@babystar_jr:
You got it!
MMM
is anyone aware if this project is already being done by someone?
if not, perhaps we can brainstorm how to go about this, now that the pdf for the revised nwt is out.
a couple initial impressions:.
@babystar_jr:
You got it!
MMM
is anyone aware if this project is already being done by someone?
if not, perhaps we can brainstorm how to go about this, now that the pdf for the revised nwt is out.
a couple initial impressions:.
@LQ: Got it! Sendspace was giving me issues there for a while, but it finally uploaded.
Sending link now in PM...
MMM
is anyone aware if this project is already being done by someone?
if not, perhaps we can brainstorm how to go about this, now that the pdf for the revised nwt is out.
a couple initial impressions:.
@slii:
Great story.
As it turns out, the NWT compare project did show the full extent of the differences - and they are extensive. However, there have been some benefits out of it. For example, we knew going into it that the NWT translation committee decided to drop brackets around words they inserted for "clarity" in the 1983 version. The diff log in the database could be used to see a list of all the words that were previously bracketed in the 1983 version, but now are no longer bracketed.
Comparing the JW texts from different years might be a useful tool here too.
And that is why I changed my focus the WTLIB itself. From year-to-year, other than the new documents (because you expect that to be different), what changes? That is an interesting question indeed.
...so I proposed that we schedule the meetings beforehand. I think they just might have read a bit too much into that :-)
Ha ha... are they coming all the time now to study that book? They should - someone could report a Bible study on their time card.
MMM
is anyone aware if this project is already being done by someone?
if not, perhaps we can brainstorm how to go about this, now that the pdf for the revised nwt is out.
a couple initial impressions:.
@leaving_quietly:
I can upload my database again. The sendspace link expired. The DB is about 135 MB. It contains both versions, and all the diff lists. It also contains the HTML markup of the diffs in verse-by-verse form and well as in book form.
It is not MySQL, however. It is SQL Server 2008. Are you ok with that?
MMM
is anyone aware if this project is already being done by someone?
if not, perhaps we can brainstorm how to go about this, now that the pdf for the revised nwt is out.
a couple initial impressions:.
This is /must be a fascinating topic , I only wish I knew what your all talking about.
Can somebody please explain in simple lay-mans terms what it all means ?
I`m sure us less technical / computer illiterates would love to know what your all so excited about . Pardon my ignorance.
smiddy
What slii is doing is a lot more complicated. But the short description is this: The thread starts out with the intent of taking the the 1983 and the 2013 version of the NWT and comparing them. We all know that they were thinking of sneaking in some changes, and we wanted to find out what the full extent of the changes might be. After all, they might change the text of scripture in more ways to support their doctrine - we call all see that happening, and so we wanted to employ the services of a computer to find the changes.
It started out trying to get the 2013 text from a PDF file, but eventually we got it from the web (which was a lot easier). Until the WTB&TS put the 2013 on the web, we were trying to get the text of the new version from the PDF and then from the android app. The day they put the 2013 version on the web was the day I cracked the andrioid app encrpytion and compression. I did this by following the decomiled java code from the app.
After we had the text from both versions of the NWT, I ran it through a diff algorithm and then created the two PDFs.
Now slii is trying to crack the format of the PUB files in the WTLIB. That is definitely a difficult task, and it would be cool if he pulls it off. I went ahead and took the easy way out and got the entire WTLIB content through the WTLIB itself by creating a program to automate a human-like behavoir of clicking each item in the wtlib one at a time and then copying the translated content (by the wtlib program itself) into a text file, one for each topic the wtlib itself lists. It's like those DVD manufacturers in the early 2000s coming up with a DVD format that could only be read by special DVD players, and thinking that they have thwarted the pirates - only to realize (too late I might add), that they would simply buy the new DVD player and plug it into the back of their computer and make a video that way.
Anyhow, more to come soon.
MMM
is anyone aware if this project is already being done by someone?
if not, perhaps we can brainstorm how to go about this, now that the pdf for the revised nwt is out.
a couple initial impressions:.
uber-geek-speak.
Most definitely. You know you like it.
MMM
is anyone aware if this project is already being done by someone?
if not, perhaps we can brainstorm how to go about this, now that the pdf for the revised nwt is out.
a couple initial impressions:.
" I guess I'm trying to figure out the format mostly out of curiosity; I agree it's likely to be much easier to get to the texts just by scripting."
That is why I am doing this too. Curiosity and I want to see if I can do it. There are some things I want to do with the text too. I did complete my automation program and it seems to have dumped out the text for the entire thing just fine - although it took about 8 hours. I did this for the 2011 and 2012 version. I have a preliminary diff going. Just some minor differences so far (still in baby stages). But interestingly enough, it seems that in the 2011 index there was an entry for a man A.T. Johnson - who was mentioned as having outstanding service in the Grenada (yb89). In the 2012 version the man is still listed in the text but his name is removed from the publication index. Wonder if he got DFed.???
In any case, that is the kind of thing I am after. I want to see what happens when I compare every entry in 2011 with 2012 and see where the differences pop up... see what they remove... what they perhaps put in??? I've ony done a few documents so far. When I complete the program, I'll let it run the entire thing through.
In any case, the automation app turned out to be somewhat of a pain. The WTLIB has a memory leak. You can see it if you watch the memory as you browse articles and topics in the library. The thing is, the average user never actually causes an out-of-memory crash because you have to visit thousands of articles- and by the time you get the info you need for your stupid bible reading, you close the WTLIB. It became obvious as I went through and tried to hit each node in the WTLib. I had to code a restart - when it crashed I had to restart it and then find the place where I left off and keep going. The WT coders probably didn't want to bother figuring out how to free all the memory needed to read the format you are trying to hack! :) They are sloppy coders.
MMM
is anyone aware if this project is already being done by someone?
if not, perhaps we can brainstorm how to go about this, now that the pdf for the revised nwt is out.
a couple initial impressions:.
And, yeah, I too have the NWT (I ripped them from their website and processed to text only format) and tried to compare the old one to the new one. I even found a bug in the 2013 version :D Actually it will be interesting to see if they fix it when I report it here. Then again there would be some fun in reporting it directly to them too.
You want the code that I used to do the match-patch? You can get the google-diff-match-patch online, but you have to modify it to do a word-level diff. The google-diff-match-patch is a character level diff, and it works quite well. To do it for words, you have to tokenize each verse into words, then map them to a unicode character, diff the unicode char sentence, and then map back to the words. First you have to define what a "word" is... and its tougher than it looks. For me, each space is a word, and so is punctuation, as well as what you would consider a normal word.
Specifically, in Exodus 16:8, the comma after the word "satisfaction" is bolded. It is the only bolded piece of text in the entire document. I doubt it is intentional.
I am going to check this out. I didn't even pay attention to the html formatting when I scraped it from their site. I pulled all formatting out and boiled it down to text even before it hit my database. (note: just checked this out, and yes, that comma is definitely bold all by its lonesome...)
MMM
is anyone aware if this project is already being done by someone?
if not, perhaps we can brainstorm how to go about this, now that the pdf for the revised nwt is out.
a couple initial impressions:.
Hi,
Yeah, I got distracted for a while, actually into writing some fancy new tools for reverse engineering, something I've intended to do for quite a while now but never got inspired enough by any one problem :) I guess I'm trying to figure out the format mostly out of curiosity; I agree it's likely to be much easier to get to the texts just by scripting.
I still have at least one of the lookup tables to figure out; it seems that it (probably something like a Huffman tree) is read from the .PUB files, but compressed in yet another way, with something called "nibble codec". It might start to make sense once I figure out the offset it's read from.
Anyway, some pieces about the .PUB format, just since someone is going to be curious anyway. I've been looking at a file named km1979_e.pub from the 2006 version. It contains an article named "What has happened to love?". The sha1sum of the file is 6b4c4b0c7f93c04d8aa685fb773f6826af681d35.
This is going to contain stuff that won't make much sense without referencing the contents of the file; I guess it's of value mostly if someone else wants to try to figure it out now or later.
The first 16 bytes are something called the URE header and seem to be pretty fixed:
00000000 55 52 45 53 04 00 03 00 00 00 00 00 00 00 00 00 |URES............|
I seem to remember that the 04 here tells this is a PUB file, as opposed to some other of the files (like indices, NWT, etc.), but my notes are hazy about this. 03 might be the version, which the code requires to be 03.
After that comes the PUB header, 8 bytes:
00000010 00 00 84 00 00 00 03 09
Here, I think the dword at 0x12, that is 0x84, is the number of entries in some structure which I for now call (rightly or wrongly) the Document Area Position Info Chunk. I think the 0x09 is probably the number of entries in the index data structure that immediately follows at offset 0x18 (the Chunk Information table).
At offset 0x18, the Chunk Information table. It consists of 0x09 (from previous header) 9-byte entries, each of which contains a tag of one byte and two dwords. The latter of the two dwords seems to (usually?) be an index into the document, and I guess it might make sense for the other to be the size of the element pointed to.
00000018: 00 0e 02 00 00 e6 55 00 00 02 25 00 00 00 59 00 00 00...
That is, tag 0, dwords 0x20e and 0x55e6. One of the other entries is {4, 0x261cc, 0x5ec4}. At 0x5ec4 + 0x10 (for the URE header) seems to lie some kind of structure which seems to list offsets of the chunks of the article in question; in any case, it has further pointers into the file. Inside that structure, at offset 0x6029, for example, we have the value 0x193a6, which I think points to the header of the chunk structure at something like 0x10 (for the URE header) + 0x5ec4 (for the above structure) + 0x193a6 = 0x1f27a. Anyway in 0x1f27a + 4*0x84 + 5 (I don't know yet...) = 0x1f48f some kind of header for the text chunk, with some BTEC1 (whatever) encoded information. After that comes what looks like a possible file name for the original text, 0902_K79.LOV. After that at 0x1f4a2 is what I think is a 5-byte header describing the codec used for the text, details yet unknown.
Immediately after that starts a MTEC3 (something Huffman-like) decompressed text stream. I'm not quite yet ready to say much about the compression, although I have code to do the decoding given the lookup tables (just I don't understand it yet ;). This stream seems to only contain the title, in MEPS encoding, of course.
At 0x1f760 (bytes 1b c0 ad 42) starts a MTEC3 encoded text block for which I have the lookup tables and which I thus can decompress with my code. It's a piece of text from "What has happened to love?":
". However, use your good judgment for we want the householders to know we are there because we love them and want to help them—not just to place literature. If no one is at home, leave the tract out of sight. (It is illegal to put items in the mailbox.—See Our Kingdom Service, April 1976, Announcements, page 2.) A territory can be reported as worked when we cover it with tracts." (and so on, until the end of the article.)
MEPS-encoded, the beginning of this is
. (43 08) SPACE (61 fb) H (07 08) o (28 08) w (30 08) e (1e 08) v (2f 08) e (1e 08) r (2b 08) , (44 08) SPACE (61 fb) u (2e 08) s (2c 08) e (1e 08) SPACE (61 fb) y (32 08) ...
I have to hand it to you - you have a sense of persistence. It would be cool to know the format of the PUB files. Ultimately I think they will do away with the WTLIB CD and publish the whole darn thing online from now on. As mentioned in another thread, this will allow them to electronically publish their new light without leaving any trace of what was there before - except for those sites that may log internet history, or if curios developers want to database it for fun. :)
I guess I'll post some code once I have figured out how to read/compute the LUTs :) They certainly have found a complicated way to store text...
Go for it! When I get ready, I am going to dump all of my code and stats here on this site for others to look at and learn from.
MMM
launch an investigation on jehovahs witnesses religious policy that violates human rights and abuses religious freedom.
link.
@PoconosKnows:
" My comments and points were torn apart since my original post. I am defending them - 1 by 1 -like anyone else did - when they see unfair remarks on a topic."
But that is expected in a forum like this. And yes, you are defending them 1-by-1, and that is fine. You aren't being silenced, after all you ARE defending yourself on each topic. But we have to be realistic. In other words, the petition's purpose is to ultimate stop the WT shunning doctrine. OK, nobody here agrees with that doctrine - not even from a Biblical perpective. However, getting the authorities involved is a horrible idea for all the reasons listed - you give the goverment a hammer to bash someone, then it still has the hammer, and the next thing bashed might be YOU!
MMM