Comprehensive NWT Comparison Project (calling all technically skilled members) (page 2)

88JM

I agree the challenge will be to get rid of or obfuscate those pesky cross-reference superscript letters. I'm looking at it just now in Acrobat Pro trying to convert it to different formats.

ILoveTTATT

I have both PDF's... but why are we comparing PDF's when we have both in text already? It's a matter of copy-paste in Word and then compare the versions.

The PDF route is actually necessary for comparison of 1950 with 1961 and 1984.

BTW, does anyone have the 1961 Bible on PDF?

I cursory look at some of the scriptures in Genesis reveals that pretty much every verse is different in minor ways. I thought they had reprinted the NWT with minor changes here and there and we wanted to see where those minor changes were (by using a program to detect all of them).

Now, since the changes are in every verse, how did you envision the output of this process?

MMM

MeanMrMustard

IloveTATT,

" It's a matter of copy-paste in Word and then compare the versions."

No, I think Apognophos wants to compare in detail every verse and display the differences by way of a nifty computer program. This can be done, but I am wondering why... its not like we are trying to find the changed locations, after all, they are everywhere, in every verse.

MMM

Apognophos

Yeah, it's true that there are thousands of changes. Sorry if I made it sound like this would be a quick process. My thought was that first we would find a format for saving the changes (maybe a separate file for each change?), then we could sort out the simple ones that occur over and over, like the examples I gave earlier, and toss them into a folder like "Simplified Grammar".

Basically it would entail gradually building a list of each kind of change by hand, maybe with simple pattern matches. For instance, imagine a file called simple_grammar.txt containing:

Folder:Simplified Grammar

has declared=said

proceeded to *=*ed

As we build the list in simple_grammar.txt by looking through the changes we found, we repeatedly apply it to the original folder of changes until we see that only the interesting stuff is left in the main folder. In the end, there will still be hundreds of changes or more that interested readers will go through to see what the Society changed.

It will be a somewhat tedious process, but I've done far more tedious work in the past and I'm prepared to take the time to do that filtering part if we can first obtain a complete set of changes between the PDF texts.

MeanMrMustard

Apognophos,

The challenge would be to build the program to get the output you desire. Once that occurs, it should only be tedious for the computer, not you. That is, you can run it and go grab a hot pocket. I am searching for my PDF to text converter now. I wrote one for client a while back... not if I can only find where it is.....

MMM

Apognophos

Oh, and I think the changes will need context so we can read them on their own. Each change file should probably be an entire verse, and saved in a file named for its location, i.e. if there are two changes in Genesis 1:1, the comparison program will save two text files called Genesis001-001a and Genesis001-001b, containing the same complete verse. Let's say that Genesis 1:1 used to read:

In the beginning, God created the heavens and the earth.

and now it reads:

In the beginning, God made the heavens and stuff.

There are two changes, but "created" => "made" isn't very interesting so we will end up filtering that change out. What we do care about is the second, rather surprising change from "the earth" to "stuff". Thus, the file Genesis001-001a could contain this:

In the beginning, God {created} the heavens and the earth.

In the beginning, God {made} the heavens and stuff.

And the 'b' file could contain:

In the beginning, God created the heavens and {the earth}.

In the beginning, God made the heavens and {stuff}.

The braces mark the changes for our filtering program (which might just be a simple grep-and-mv script in Unix) to do its thing. Does that make sense?

Apognophos

The challenge would be to build the program to get the output you desire.

There might be a challenge in teaching the initial PDF parsing program to filter out the cross-reference letters, but ultimately we're probably talking about some pretty basic code. I wrote a script once for converting XML to RTF and it was about as long as I possibly could make it, and it was only a couple hundred lines (it did use a third-party program, xsltproc, but I think we're talking about doing the same thing here, using a third-party PDF library for the basic parsing). Edit: Actually, the bigger challenge might be assembling the text in proper order, based on my simple experiment earlier where the PDF lines appeared to be out of order.

Once that occurs, it should only be tedious for the computer, not you.

I think the tedium will come from building the list of changes to filter out in step 2, but that's fine, I think I can handle it.

Simon

Step 1:

Create sentence-per-line (maybe verse per line) versions of old and new. If the PDFs are already text then use those rather than OCR. The \u characters are unicode and probably the fancy capital letters and punctuation so they just need to be mapped to regular ones (some removed).

Step 2:

Use a diffing program (plenty free ones available) to compare them - this will produce a file with only the differences highlighted and and additions / changes / removals shown. These are the things we programmers use for version control and change tracking.

Apognophos

Yeah, that's the idea. The PDFs are thankfully textual (they'd be huge otherwise).

As far as viewing diffs, I am familiar with programs like WinMerge, but I wanted to produce something in a format that any interested party can read, without needing to install anything or have technical knowledge, so I was hoping to strain out the "boring" changes (but leave those available somewhere, perhaps, for anyone who does want to review them), and then concatenate all the remaining "interesting" change files together into maybe a single file for each book that anyone could read easily, like "Genesis.txt" containing:

1:1

In the beginning, God created the heavens and {the earth}.

In the beginning, God made the heavens and {stuff}.

1:2

The earth was {formless and void}, and darkness was over the surface of the deep, and the Spirit of God was moving over the surface of the waters.

The earth was {a big mess}, and darkness was over the surface of the deep, and the Spirit of God moved over the surface of the waters.

1:5 [the next verse with a change]

[..]

Comprehensive NWT Comparison Project (calling all technically skilled members)

Share this