Sunday, August 07, 2005

lab 38 - Geryon's data sets

NAME

lab 38 - Geryon's data sets

NOTES

I need a large data set to work with so I can try out more ideas using Geyron. I want to use real data; one that can not be analyzed trivially using, say, a relational database.

Examples I considered,

  • crawl the web - a web page repository
  • an rss feed repository
  • web server query logs
  • click logs for a site
  • aggregate data input by users
  • system health records
  • sousveillance logs

Some of these are more difficult to collect than others. Some may contain greater possibility for surprises, and the surprises are what I hope to get by working with real data. Also, a data set where others can collect and duplicate my results would help to keep the conversation concrete.

But I need something right now, and there are two possibilites at hand. I've got the data set from the MIT Reality Mining project. This is about 200+MB uncompressed. This is big enough to test out some tools. But for this size data, Geyron is not likely to offer anything that can't be done in a relational database. However, in principle this data set could grow very large, and developing grid tools to process it might come in handy. For something larger I could use the Internet Archive's crawler to collect around 1 to 50GB of web pages. I'll go with the first option until I get a better idea of what I'm doing.

Before even collecting the data, I need to consider how to organize it. How it will be stored and processed. What file formats to use, etc. So, what are the properties of a large data set? Here's what Rob Pike et al say in their Sawzall paper about Google's repository, "A document repository holding a few billion HTML pages might be stored as several thousand files [each about 1GB in size] each storing a million or so documents of a few kilobytes each, compressed." A 1GB "file" is divided into 64MB chunks which is distributed among many disks, and each chunk has 3 copies. Here's an image of the repository format from the Backrub paper.

I'll try and put that in context of Inferno. A close analog seems to be a tar file of gzipped files. I'll make it easier on myself just for the moment and turn that around into a gzipped tar file. How would I process a .tgz file in one pass?

I wrote a shell filter gettarentry,

gunzip < file.tgz |gettarentry  { getlines {}  | etc. } 

where the command block parameter is applied to each file.

Gettarentry spawns a shell process for each file and writes the data down a pipe between the two. The environment variable file is set in the context for each command block.

After playing with this and thinking that this should be a shell builtin along the lines of sh's alphabet, I find it already is!

The fs command has bundle and unbundle, an equivalent to tar.

% fs bundle in | gzip > fs.bundle.gz

I hadn't considered the sh-alphabet yet in the context of Geryon. This simple use makes it already worthwhile, but there's much more to it. The fs command is a powerful tree walker, with gates and pattern matching and the ability to apply a command block to each file. The commands below show a simple comparison of fs walking the same file hierachy, one bundled and the other already extracted onto a kfs disk.

% time sh -c {gunzip < fs.bundle.gz | fs pipe -1 @{wc} {unbundle -}}
  22191  102136  698644
0l 4.281r 4.281t

% time sh -c { fs pipe -1 @{wc} {walk in}}
  22191  102136  698644
0l 7.188r 7.188t

So, I'm on the right track. This is way more powerful than my simple gettarentry. And it fits well witin the Geryon framework,

rcpu {gunzip < /n/d?.nb/fs.bundle.gz | fs pipe -1 @{wc} {unbundle -}} d0.nb ...

To read one small file within the bundle is not very efficient. But reading and processing the whole fs is faster if bundled and gzipped. Time improvement is gained by considerably less disk reads (1/3), and considerably less interaction with the fs for walking the hierarchy.

This format does not allow me to jump directly to a file within the archive. But if I switch back to the original suggestion, an archive of gzipped files, I get direct access and a few other things. It should be straight forward to append to the archive and know the total size of the archive as I add files.

I'll need to write another module for sh-alphabet to directly handle a repository format where each file is individually gzipped. But the fs/alphabet framework seems the way to go.

Another problem to handle is splitting the repository file among many kfs disks. If each 64MB chunk can stand alone as a bundle file, I could consider a 1GB file as just the concatenation of all the archive chunks. [1] It shouldn't matter in what order we process the chunks. If I build an index of each files location within the repository I need to track the chunk, or keep track of the order of chunks and just keep a single offset within the logical 1GB repository file.

The Internet Archive's crawler, Heritrix, stores web pages in the ARC file format. I could add this format to sh-alphabet so it can be processed by fs. The crawler splits the archive into 100MB files. So by using this I've already got a lot going for me.

FILES

The files for this lab include the gettarentry command and a command to extract the data from the Reality Mining MySQL dump into a format better suited to Inferno. caerwyn.com/lab/38

Footnotes

[1] Because I am layering file systems my terminology is getting confusing. I have chunks and files and disks at several layers.

No comments: