Thursday, March 15, 2007

lab 72 - wikipedia


lab 72 - wikipedia


I've been working on modifying dict to use the Wikipedia database. I mentioned this in lab 70. So here's what I've got so far. It's not beautiful; the wiki syntax parser needs a lot of work

The general idea is I want to use acme-sac as a Wikipedia browser. But there are other reasons too, such as gaining experience of using inferno to work on some large text databases.

Acme brings some nice things to a database like Wikipedia. Because of the nature of acme you don't have to rely on people making wiki links to find other articles. You can right-select almost any text to search the index. Right-selecting single words often opens a Wikipedia disambiguation page.

If you want to get this working try following the steps below.

You need to use the latest acme-sac copy from svn. It has some fixes to support big files, including Acme.exe, otherwise none of this will work.

Download the Wikipedia database. This site will explain about Wikipedia downloads. Go here for the dump files. For the English version look for a file called something like pages-articles.xml.bz2. This file is about 2.1 GB. Download it and extract it.

This was the first snag I hit. I didn't have NTFS on my laptop drive, only FAT32, so I wasn't able to extract it into a single file. I started extracting it to smaller files and looked at creating a virtual big file using a Styx server; but before I got anywhere with that idea I bought an external hard drive, reformatted to NTFS to handle big files, and just went with the single file approach.

Extract the file somewhere and rename it or bind it to /lib/dict/wikipedia. You then need to build the /lib/dict/wpindex file.

Generate the index inside inferno,

 % dict/mkindex -d wp > rawindex

I had to step outside inferno for the next bit and used 9pm archive for the plan9 sort and awk commands. Reformat and clean the index entries using /appl/cmd/dict/canonind.awk

 % awk -F' ' -f canonind.awk rawindex 

then sort and remove carriage-returns

 % sort -u -t' ' +0f -1 +0 -1 +1n -2 < junk |
    tr -d '\r' > /lib/dict/wpindex

Hopefully, you can now type in adict -d wp in acme-sac and in the new window type some text, right-select it and a result from wikipedia will be found.


svn revision 71

No comments: