Comprehensive NWT Comparison Project (calling all technically skilled members)

Apognophos

Is anyone aware if this project is already being done by someone? If not, perhaps we can brainstorm how to go about this, now that the PDF for the revised NWT is out. A couple initial impressions:

- I'm not aware of a decent program for extracting text from these PDFs. One that I tried yields the text from left-to-right from the left text column, across the cross-reference column in the middle, to the right column, so this is not useful.

- One can copy and paste the text from a PDF viewer into a rich-text editor and then save the content as RTF. The RTF will contain markup like:

\fs18 24 And so he drove the man out and posted at the east of the gar- den of E
\f1 \uc0\u56319 \u56329
\f0 den
\fs10 \up4 n
\fs18 \up0 the cherubs
\fs10 \up4 o
\fs18 \up0 and the flaming blade of a sword that was turning itself continually to guard the way to the tree of life.

This markup can be processed, the lines for cross-reference letters removed, the words hyphenated for line breaks restored to one piece, and the special character sequences for quotes and apostrophes replaced with quotes and apostrophes.

- Numerous changes will be of the nature of repetitive substitutions like changing "has declared" to "said" or "proceeded to assault" to "assaulted". We'll probably want a way to group changes so that we can sift the more interesting ones from the chaff, since the majority of changes will represent simplified grammar.

Before I go about trying to do this, I wanted to see if anyone has any better ideas or experience in this sort of thing.

JeffT

Which PDF reader did you use?

Apognophos

I copied the PDF text from Preview into TextEdit on the Mac to get that result.

ILoveTTATT

There are free OCR programs online that are absolutely amazing! You should try them!

For example,

http://www.onlineocr.net/

It worked when I copied JUST the text area of the PDF... with the snipping tool... It copied everything PERFECTLY.

However, copying just the text area is tedious... If THIS part can be automated, we have "it" made!

Apognophos

Well, the PDFs I have are already textual. So we shouldn't need to do any conversion from images to text, but rather reformat the existing text. The question is just how to turn the two-column layout into text without those annoying cross-reference letters in there, since they were all changed in the new NWT and would ruin the comparison.

I forgot to note something important, which is that I did end up with some lines out of order, and I don't understand why:

Note that a couple lines are out of place. "Abel" and "from which he had been taken" are for some reason placed at the end of the page's markup. This problem appears to exist on other pages as well and is probably widespread. I don't know the PDF format well. I can only assume that each line has its own x, y coordinates as a part of the page, and this is how the lines can be out of order in the document markup.

zound

If you put the pdf into Indesign you can create a blank section to cover the middle column (and perhaps the top page numbers) on every page (put it on the master page). Then export it again as a pdf - then convert the image to text.

MeanMrMustard

I am a programmer. I can extract the text quite easily. I've made many programs to parse through PDF files.

I've often wanted to do something like that - get the content out of different version of the WTLIB and see if anything has changed between versions. Although the WTLIB is not PDF, I can still force the content out through some automated WinAPI calls. But alas! - I don't have the time.

Where do you get the PDF?

MMM

Apognophos

Great, just the sort of person I was looking for! I program too, but I've never worked with PDF so I was first going to try basic stuff like copy-pasting and grepping or xsltprocing before I wrote an actual program, but if you've done this kind of thing before, you are probably the right guy for the job. Well, the new NWT's PDF is right at jw.org. The old one is still up for now at http://www.jw.org/en/publications/bible/nwt/books/

I imagine it won't be hard to filter out those pesky cross-reference letters?

MeanMrMustard

Ummm.. that's a good question. You never really know until you try to extract the text. I usually just use iText PDF library. There is java version and a .NET version. The java version might interest you, since you seem to be into the unix/linux side of the world. I also have some PDF libraries that I paid for, but oddly enough, I don't usually go to them.

The old one can be pulled off the website easily. Notice that you can get to each chapter just by the url :http://www.jw.org/en/publications/bible/nwt/books/genesis/1/

I'll try to grab it and parse through it tomorrow.

MMM

JeffT

If we can get it into a text type document, I'd love to take a stab at comparing some scriptures. I may do that even if we can't convert it. If somebody has the pdf shoot me a pm.

Comprehensive NWT Comparison Project (calling all technically skilled members)

Share this