Monday, August 08, 2005

lab 39 - tar and gzip

NAME

lab 39 - tar and gzip

NOTES

In lab 38 I looked at file formats I could use to store a document repository. I kind of decided I needed an archive file of individually gzipped files. After a little further digging I find that the gzip file format (rfc) supports multiple gzip files concatenated together as a valid gzip file. A quick test on unix shows this to be true.

% cat file1 file2 | wc
   2   8   35
% gzip < file1 > t1.gz
% gzip < file2 >> t1.gz
% gunzip < t1.gz |wc
   2   8   35

But the same test on Inferno did not work. After a little hacking on /appl/lib/inflate.b I got it to work, although I'm not sure I haven't broken something else in doing so. So beware.

Appending to a gzip file is a nice feature. What about puttar? Can I append to a tar file?

% puttar file1 > t1.tar
% puttar file2 >> t1.tar
% lstar < t1.tar
file1 1123551937 15 0

No. It stops reading after the first file. I looked at the code /appl/cmd/puttar.b and find it outputs zeroed blocks as a sort of null terminator for the file. I'm not sure if that's required to be a valid tar format file. The Inferno commands that read tar files don't seem to care since EOF works just as well. So I edited the file to not output zeroed blocks, and I renamed the command to putwar so not to confuse myself. Now I can append to a tar (war) file. What's more, I can put the gzip and tar together.

% putwar file1 |gzip > t1.tgz
% putwar file2 |gzip >> t1.tgz
% gunzip < t1.tgz |lstar
file1 1123553153 15 0
file2 1123553156 20 0

I'll resurect gettarentry from last lab so I can apply a command to each file

% gunzip < t1.tgz |gettarentry {echo $file; wc}
file1
   1   4   15
file2
   1   4   20

This is very close to what I want. I can process the whole archive in a stream, it is compressed, and if I know the offsets of each file I can jump directly to it and start the stream from there.

The remaining problems are that I don't know what meta information to store with the file, so I'm going with tar's information by default. Also, the tar format isn't handled by the sh-alphabet, which is a pity. But that doesn't matter because now I've got something concrete to play with which is good enough.

Time to really get processing some data.

FILES

caerwyn.com/lab/39

3 comments:

uriel said...

Why not use Venti?

A venti implementation in Limbo would be very cool.

caerwyn said...

I'd love to see a venti on inferno too.

what i'm trying to do is parallel programming using a lot of disks. i'm not sure how i'd go about setting up a venti to work across disks and have computations work independently on one another. so i'm doing this as something simpler.

LiteStar said...

About your question concerning tar's use of NULL blocks: your assumption was correct, most versions of the tar format use NULL blocks (I usually see two) after the contents of the current file. I've no idea how far into the tar format you have looked, but this is in part due to tar's antiquated 512-byte block size (for old tape drives).