Comprehensive NWT Comparison Project (calling all technically skilled members)

by Apognophos 223 Replies latest watchtower bible

  • ILoveTTATT
    ILoveTTATT

    comatose:

    Pretty much.

  • MeanMrMustard
    MeanMrMustard

    Apognophos,

    As an update, I am able to pull text from the PDF file, but there are some odd anomalies. Here is what page 44 of the PDF looks like (I chose this page becuase it is the first full page of biblical text, after all that garbage they put at the front of the book - 42 pages of propaganda art BLA):

    --------

    GENESIS 1:20–2:5
    20 Then God said: “Let the waters swarm with living crea­tures,1 and let .ying crea­tures .y above the earth acrossthe expanse of the heavens.”2a 21 And God created the great sea creatures2 and all living creatures1 that move and swarm in the waters according to theirkinds and every winged .ying creature according to its kind. And God saw that it was good. 22 With that God blessed them,saying: “Be fruitful and becomemany and .ll the waters of the sea,b and let the .ying crea­tures become many in the earth.”23 And there was evening and there was morning, a .fth day.
    24 Then God said: “Let the earth bring forth living crea­tures1 according to their kinds, domestic animals and creeping animals2 and wild animals of the earth according to their kinds.”c And it was so. 25 And God went on to make the wild an­imals of the earth according totheir kinds and the domestic an­imals according to their kinds and all the creeping animals of the ground according to their kinds. And God saw that it was good.
    26 Then God said: “Let usd make man in our image,e accord­ing to our likeness,f and let them have in subjection the .sh of the sea and the .ying creaturesof the heavens and the domes­tic animals and all the earth and every creeping animal that is moving on the earth.”g 27 And God went on to create the man in his image, in God’s image hecreated him; male and female hecreated them.h 28 Further, God blessed them, and God said to
    1:20, 21, 24 1 Or “souls.” 1:20 2 Or “sky.” 1:21 2 Or “monsters.” 1:24 2 Or “moving animals,” apparently includingreptiles and forms of animal life di.er­ent from the other categories.

    CHAP. 1
    a Ge 2:19
    b Ne 9:6 Ps 104:25
    c Ge 2:19
    d Pr 8:30 Joh 1:3 Col 1:16
    e 1Co 11:7
    f Ge 5:1 Jas 3:9
    g Ge 9:2
    h Ps 139:14 Mt 19:4 Mr 10:6 1Co 11:7, 9

    Second Col.
    a Ge 9:1
    b Ge 2:15
    c Ps 8:4, 6
    d Ge 9:3 Ps 104:14 Ac 14:17
    e Ps 147:9 Mt 6:26
    f De 32:4 Ps 104:24 1Ti 4:4

    CHAP. 2
    g Ne 9:6 Ps 146:6
    h Ex 31:17 Heb 4:4
    i Isa 45:18
    them: “Be fruitful and become many, .ll the eartha and sub­due it,b and have in subjectionc the .sh of the sea and the .y­ing creatures of the heavens andevery living creature that is mov­ing on the earth.”
    29 Then God said: “Here have given to you every seed-bearing plant that is on the en­tire earth and every tree with seed-bearing fruit. Let them serve as food for you.d 30 And to every wild animal of the earthand to every .ying creature of the heavens and to everything moving on the earth in which there is life,1 I have given all green vegetation for food.”e And it was so.
    31 After that God saw every­thing he had made, and look! it was very good.f And there was evening and there was morning,a sixth day.
    2
    Thus the heavens and the earth and everything in them1 were completed.g 2 And by the seventh day, God had completed the work that he hadbeen doing,1 and he began to rest on the seventh day from allhis work that he had been do­ing.1h 3 And God went on to bless the seventh day and to de­clare it sacred, for on it God hasbeen resting from all the work that he has created, all that hepurposed to make.4 This is a history of the heav­ens and the earth in the time they were created, in the daythat Jehovah1 God made earth and heaven.i
    5 No bush of the .eld was yet on the earth and no vegetation of the .eld had begun sprout­ing, because Jehovah God had
    1:30 1 Or “life as a soul; a living soul.”
    2:1 1 Lit., “and all their army.” 2:2 1 Or “making.” 2:4 1 The .rst occurrence of God’s distinctive personal name, 565 (YHWH). See App. A4.

    --------

    The first thing you may notice is that the cross reference markers are going to make it very difficult. The web version used "+" and "*" hyperlinks. The PDF version is like the printed version in the sense that it uses normal letters, causing odd misspellings to occur. You can see the middle column lines too. But, as you may know, a human looking onto text like this is one thing, but developing an algorithm to intelligently parse this is another story. I'll have to work at it. The cross reference letters in the main text seem like the largest problem so far - oh, and the footnotes merge into the main text. That's the issue with PDFs, they are basically text positioned around a canvas. You don't get line breaks you would expect.

    Edit: also note how "flying" didn't render out completely.

    MMM

  • MeanMrMustard
    MeanMrMustard

    And here is another extraction method... A little less errors, but the text is draw out on the vertical.

    GENESIS
    1:20–2:5

    20
    Then
    God
    said:
    “Let
    the
    waters
    swarm
    with
    living
    creatures,
    1
    and
    let
    flying
    creatures
    fly
    above
    the
    earth
    acrossthe
    expanse
    of
    the
    heavens.”a
    21
    And
    God
    created

    [SNIP FOR LENGTH - IT JUST GOES ON THAT WAY]

    and
    no
    vegetation
    of
    the
    field
    had
    begun
    sprouting,
    because
    Jehovah
    God
    had

    1:30
    1
    Or
    “life
    as
    a
    soul;
    a
    living
    soul.”
    2:1
    1
    Lit.,
    “and
    all
    their
    army.”
    2:2
    1
    Or
    “making.”
    2:4
    1
    The
    first
    occurrence
    of
    God’s
    distinctive
    personal
    name,
    (YHWH).
    See
    App.
    A4.

    MMM

  • MeanMrMustard
    MeanMrMustard

    And one more update. Here is what it looks like when the page bytes are extracted, and sent back through Adobe Pro and exported as HTML. The export is a little better, and you get to see some inner tags. For example, the <div> tags with class "Sect" are the paragraphs. The great thing here is that you can easily peel out the middle column. Still the bad thing is that the column markers are letters, creating spelling mistakes. Also, notice that because PDFs are floating text for the most part, the extraction sometimes gets it wrong when it comes to word breaks. You can see it circled in red below... kinda stinks. I would much rather pull it from the web, you get the text all nice and neat....

    MMM

  • zound
    zound

    Great job so far. Wouldn't they before long change the web version to the new bible? If too many problems perhaps you could just wait until then.

  • MeanMrMustard
    MeanMrMustard

    zound,

    I was thinking the same thing. I have a feeling that if I go forward with the PDF parse, we are going to end up with a bunch of errors that we won't know about until we examine each scripture. For each verse we would be asking, "Did that word bump up against the middle column so that the next word is actually placed right next to it without a whitespace?" When I pulled the text from the web site, I know that the spacing is right because it's needed to format correctly on the page. But for a PDF file, you can float that text anywhere.

    MMM

  • disfellowshipped1
  • Apognophos
    Apognophos

    Sorry I was away (sleeping) while you were posting MMM. Yes, I had a feeling the PDF might be a little trouble, but it's more than I thought. We're probably better off just waiting for the 2013 web version to go up. I'm sure they're working on it.

    That being said, I am confused about the markup you are getting. For instance, is each line of text stored separately or are paragraphs single blocks of word-wrapped text? I just don't get how the words "across" and "the" could be one word in the PDF markup unless PDF breaks a word at whatever letter it hits the end of the column. Is that how it works? The HTML export also seems to have misapplied bold tags everywhere for some reason. Even the markup I got from copy-pasting the rendered PDF into RTF was better than that (as seen in my first post). The only real problem with that approach was that some lines were out of order, which is obviously a deal-breaker for using that method.

    But in any case it's clear that their 2013 web version will be much easier to work with, so I'm content to wait until that goes up.

    Another thought I had is that the final product probably can't use whole verses when it lists differences, since that is likely copyright infringement. But we'll still want to work with it internally in whole verses so I have the proper context to determine the nature of the changes that I will be categorizing or grouping.

  • MeanMrMustard
    MeanMrMustard

    Apognophos,

    No problem. sleep++;

    PDF is a strange format. You can have text fields that "float". For example, if you want to display a table in PDF, you normally just position the text fields where you want, in the shape of a grid and then overlay lines to make it look like a grid. Something very odd is going on with that PDF when it comes to the acrossthe portion. Perhaps that is one text field that spans two lines with no whitespaces in there. When the extraction algorithm come across it, it looks like a single word. Note, I had Adobe Pro export it to HTML and it even caused the word to join. To repeat for emphasis - the HTML in my last screen shot was generated by extracting the page and pushing it through Adobe Pro itself. That is, Adobe is even confused when it tries to export as text.

    When you have just HTML from the WT site, you know its all in the right order and the whitespace is ok because otherwise it wouldn't render right on their website. I think the harder task is that the cross reference symbols are plain letters. When you look at PDF as just text (just the text of the text fields), you loose the formatting - makes it look like the cross ref symbols are part of the words. Now there may be ways around all of this that I am not aware of yet.... who knows. Maybe we'll get lucky and the WT will put out a web version very soon.

    As to your last statement: yes... a winmerge like format of the differences would essentially be reproducing the text of both editions into one file. Now if you came up with a category for a change, and could objectively categorize the changes and just gather stats - that, of course, is perfectly fine to pass around.

    MMM

  • MeanMrMustard
    MeanMrMustard

    @disfellowshipped1:

    Looks like others want to know the differences between NWT and NWTv2 as well. We hope to document every change by way of a custom computer program and then develop some meaningful stats. Although, that part is still pending :)

    MMM

Share this

Google+
Pinterest
Reddit