Comprehensive NWT Comparison Project (calling all technically skilled members)

by Apognophos 223 Replies latest watchtower bible

MeanMrMustard

@DS211:

Sent in PM.

MMM
slii

Hi,

Yeah, I got distracted for a while, actually into writing some fancy new tools for reverse engineering, something I've intended to do for quite a while now but never got inspired enough by any one problem :) I guess I'm trying to figure out the format mostly out of curiosity; I agree it's likely to be much easier to get to the texts just by scripting.

I still have at least one of the lookup tables to figure out; it seems that it (probably something like a Huffman tree) is read from the .PUB files, but compressed in yet another way, with something called "nibble codec". It might start to make sense once I figure out the offset it's read from.

Anyway, some pieces about the .PUB format, just since someone is going to be curious anyway. I've been looking at a file named km1979_e.pub from the 2006 version. It contains an article named "What has happened to love?". The sha1sum of the file is 6b4c4b0c7f93c04d8aa685fb773f6826af681d35.

This is going to contain stuff that won't make much sense without referencing the contents of the file; I guess it's of value mostly if someone else wants to try to figure it out now or later.

The first 16 bytes are something called the URE header and seem to be pretty fixed:

00000000 55 52 45 53 04 00 03 00 00 00 00 00 00 00 00 00 |URES............|

I seem to remember that the 04 here tells this is a PUB file, as opposed to some other of the files (like indices, NWT, etc.), but my notes are hazy about this. 03 might be the version, which the code requires to be 03.

After that comes the PUB header, 8 bytes:

00000010 00 00 84 00 00 00 03 09

Here, I think the dword at 0x12, that is 0x84, is the number of entries in some structure which I for now call (rightly or wrongly) the Document Area Position Info Chunk. I think the 0x09 is probably the number of entries in the index data structure that immediately follows at offset 0x18 (the Chunk Information table).

At offset 0x18, the Chunk Information table. It consists of 0x09 (from previous header) 9-byte entries, each of which contains a tag of one byte and two dwords. The latter of the two dwords seems to (usually?) be an index into the document, and I guess it might make sense for the other to be the size of the element pointed to.

00000018: 00 0e 02 00 00 e6 55 00 00 02 25 00 00 00 59 00 00 00...

That is, tag 0, dwords 0x20e and 0x55e6. One of the other entries is {4, 0x261cc, 0x5ec4}. At 0x5ec4 + 0x10 (for the URE header) seems to lie some kind of structure which seems to list offsets of the chunks of the article in question; in any case, it has further pointers into the file. Inside that structure, at offset 0x6029, for example, we have the value 0x193a6, which I think points to the header of the chunk structure at something like 0x10 (for the URE header) + 0x5ec4 (for the above structure) + 0x193a6 = 0x1f27a. Anyway in 0x1f27a + 4*0x84 + 5 (I don't know yet...) = 0x1f48f some kind of header for the text chunk, with some BTEC1 (whatever) encoded information. After that comes what looks like a possible file name for the original text, 0902_K79.LOV. After that at 0x1f4a2 is what I think is a 5-byte header describing the codec used for the text, details yet unknown.

Immediately after that starts a MTEC3 (something Huffman-like) decompressed text stream. I'm not quite yet ready to say much about the compression, although I have code to do the decoding given the lookup tables (just I don't understand it yet ;). This stream seems to only contain the title, in MEPS encoding, of course.

At 0x1f760 (bytes 1b c0 ad 42) starts a MTEC3 encoded text block for which I have the lookup tables and which I thus can decompress with my code. It's a piece of text from "What has happened to love?":

". However, use your good judgment for we want the householders to know we are there because we love them and want to help them—not just to place literature. If no one is at home, leave the tract out of sight. (It is illegal to put items in the mailbox.—See Our Kingdom Service, April 1976, Announcements, page 2.) A territory can be reported as worked when we cover it with tracts." (and so on, until the end of the article.)

MEPS-encoded, the beginning of this is

. (43 08) SPACE (61 fb) H (07 08) o (28 08) w (30 08) e (1e 08) v (2f 08) e (1e 08) r (2b 08) , (44 08) SPACE (61 fb) u (2e 08) s (2c 08) e (1e 08) SPACE (61 fb) y (32 08) ...

I guess I'll post some code once I have figured out how to read/compute the LUTs :) They certainly have found a complicated way to store text...
slii

And, yeah, I too have the NWT (I ripped them from their website and processed to text only format) and tried to compare the old one to the new one. I even found a bug in the 2013 version :D Actually it will be interesting to see if they fix it when I report it here. Then again there would be some fun in reporting it directly to them too.

Specifically, in Exodus 16:8, the comma after the word "satisfaction" is bolded. It is the only bolded piece of text in the entire document. I doubt it is intentional.
jgnat

uber-geek-speak.
MeanMrMustard

Hi,

Yeah, I got distracted for a while, actually into writing some fancy new tools for reverse engineering, something I've intended to do for quite a while now but never got inspired enough by any one problem :) I guess I'm trying to figure out the format mostly out of curiosity; I agree it's likely to be much easier to get to the texts just by scripting.

I still have at least one of the lookup tables to figure out; it seems that it (probably something like a Huffman tree) is read from the .PUB files, but compressed in yet another way, with something called "nibble codec". It might start to make sense once I figure out the offset it's read from.

Anyway, some pieces about the .PUB format, just since someone is going to be curious anyway. I've been looking at a file named km1979_e.pub from the 2006 version. It contains an article named "What has happened to love?". The sha1sum of the file is 6b4c4b0c7f93c04d8aa685fb773f6826af681d35.

This is going to contain stuff that won't make much sense without referencing the contents of the file; I guess it's of value mostly if someone else wants to try to figure it out now or later.

The first 16 bytes are something called the URE header and seem to be pretty fixed:

00000000 55 52 45 53 04 00 03 00 00 00 00 00 00 00 00 00 |URES............|

I seem to remember that the 04 here tells this is a PUB file, as opposed to some other of the files (like indices, NWT, etc.), but my notes are hazy about this. 03 might be the version, which the code requires to be 03.

After that comes the PUB header, 8 bytes:

00000010 00 00 84 00 00 00 03 09

Here, I think the dword at 0x12, that is 0x84, is the number of entries in some structure which I for now call (rightly or wrongly) the Document Area Position Info Chunk. I think the 0x09 is probably the number of entries in the index data structure that immediately follows at offset 0x18 (the Chunk Information table).

At offset 0x18, the Chunk Information table. It consists of 0x09 (from previous header) 9-byte entries, each of which contains a tag of one byte and two dwords. The latter of the two dwords seems to (usually?) be an index into the document, and I guess it might make sense for the other to be the size of the element pointed to.

00000018: 00 0e 02 00 00 e6 55 00 00 02 25 00 00 00 59 00 00 00...

That is, tag 0, dwords 0x20e and 0x55e6. One of the other entries is {4, 0x261cc, 0x5ec4}. At 0x5ec4 + 0x10 (for the URE header) seems to lie some kind of structure which seems to list offsets of the chunks of the article in question; in any case, it has further pointers into the file. Inside that structure, at offset 0x6029, for example, we have the value 0x193a6, which I think points to the header of the chunk structure at something like 0x10 (for the URE header) + 0x5ec4 (for the above structure) + 0x193a6 = 0x1f27a. Anyway in 0x1f27a + 4*0x84 + 5 (I don't know yet...) = 0x1f48f some kind of header for the text chunk, with some BTEC1 (whatever) encoded information. After that comes what looks like a possible file name for the original text, 0902_K79.LOV. After that at 0x1f4a2 is what I think is a 5-byte header describing the codec used for the text, details yet unknown.

Immediately after that starts a MTEC3 (something Huffman-like) decompressed text stream. I'm not quite yet ready to say much about the compression, although I have code to do the decoding given the lookup tables (just I don't understand it yet ;). This stream seems to only contain the title, in MEPS encoding, of course.

At 0x1f760 (bytes 1b c0 ad 42) starts a MTEC3 encoded text block for which I have the lookup tables and which I thus can decompress with my code. It's a piece of text from "What has happened to love?":

". However, use your good judgment for we want the householders to know we are there because we love them and want to help them—not just to place literature. If no one is at home, leave the tract out of sight. (It is illegal to put items in the mailbox.—See Our Kingdom Service, April 1976, Announcements, page 2.) A territory can be reported as worked when we cover it with tracts." (and so on, until the end of the article.)

MEPS-encoded, the beginning of this is

. (43 08) SPACE (61 fb) H (07 08) o (28 08) w (30 08) e (1e 08) v (2f 08) e (1e 08) r (2b 08) , (44 08) SPACE (61 fb) u (2e 08) s (2c 08) e (1e 08) SPACE (61 fb) y (32 08) ...

I have to hand it to you - you have a sense of persistence. It would be cool to know the format of the PUB files. Ultimately I think they will do away with the WTLIB CD and publish the whole darn thing online from now on. As mentioned in another thread, this will allow them to electronically publish their new light without leaving any trace of what was there before - except for those sites that may log internet history, or if curios developers want to database it for fun. :)

I guess I'll post some code once I have figured out how to read/compute the LUTs :) They certainly have found a complicated way to store text...

Go for it! When I get ready, I am going to dump all of my code and stats here on this site for others to look at and learn from.

MMM
MeanMrMustard

And, yeah, I too have the NWT (I ripped them from their website and processed to text only format) and tried to compare the old one to the new one. I even found a bug in the 2013 version :D Actually it will be interesting to see if they fix it when I report it here. Then again there would be some fun in reporting it directly to them too.

You want the code that I used to do the match-patch? You can get the google-diff-match-patch online, but you have to modify it to do a word-level diff. The google-diff-match-patch is a character level diff, and it works quite well. To do it for words, you have to tokenize each verse into words, then map them to a unicode character, diff the unicode char sentence, and then map back to the words. First you have to define what a "word" is... and its tougher than it looks. For me, each space is a word, and so is punctuation, as well as what you would consider a normal word.

Specifically, in Exodus 16:8, the comma after the word "satisfaction" is bolded. It is the only bolded piece of text in the entire document. I doubt it is intentional.

I am going to check this out. I didn't even pay attention to the html formatting when I scraped it from their site. I pulled all formatting out and boiled it down to text even before it hit my database. (note: just checked this out, and yes, that comma is definitely bold all by its lonesome...)

MMM
MeanMrMustard

" I guess I'm trying to figure out the format mostly out of curiosity; I agree it's likely to be much easier to get to the texts just by scripting."

That is why I am doing this too. Curiosity and I want to see if I can do it. There are some things I want to do with the text too. I did complete my automation program and it seems to have dumped out the text for the entire thing just fine - although it took about 8 hours. I did this for the 2011 and 2012 version. I have a preliminary diff going. Just some minor differences so far (still in baby stages). But interestingly enough, it seems that in the 2011 index there was an entry for a man A.T. Johnson - who was mentioned as having outstanding service in the Grenada (yb89). In the 2012 version the man is still listed in the text but his name is removed from the publication index. Wonder if he got DFed.???

In any case, that is the kind of thing I am after. I want to see what happens when I compare every entry in 2011 with 2012 and see where the differences pop up... see what they remove... what they perhaps put in??? I've ony done a few documents so far. When I complete the program, I'll let it run the entire thing through.

In any case, the automation app turned out to be somewhat of a pain. The WTLIB has a memory leak. You can see it if you watch the memory as you browse articles and topics in the library. The thing is, the average user never actually causes an out-of-memory crash because you have to visit thousands of articles- and by the time you get the info you need for your stupid bible reading, you close the WTLIB. It became obvious as I went through and tried to hit each node in the WTLib. I had to code a restart - when it crashed I had to restart it and then find the place where I left off and keep going. The WT coders probably didn't want to bother figuring out how to free all the memory needed to read the format you are trying to hack! :) They are sloppy coders.

MMM
MeanMrMustard

uber-geek-speak.

Most definitely. You know you like it.

MMM
smiddy

This is /must be a fascinating topic , I only wish I knew what your all talking about.

Can somebody please explain in simple lay-mans terms what it all means ?

I`m sure us less technical / computer illiterates would love to know what your all so excited about . Pardon my ignorance.

smiddy
smiddy

OH, and welcome to slii we would love to hear your ( less technical story )

smiddy

Comprehensive NWT Comparison Project (calling all technically skilled members)

Share this