Monday, March 24, 2008

lab 85 - stowage


lab 85 - stowage


In an earlier post I defined a venti-lite based on two shell scripts, getclump and putclump, that stored files in a content addressed repository, which in that instance was just an append-only gzip tar archive with an index.

After learning a little about the git SCM, this lab re-writes those scripts to use a repository layout more like git's. The key thing to know about the git repository is that it uses sha1sum(1) content addressing and that it stores the objects as regular files in a filesystem using the hash as the directory and filename,


In the objects directory is 256 directories named for every 2 character prefix of the sha1hash of the object. The filename is the remaining 38 characters of the hash.

Putclump calculates the hash, slices it to make the prefix and filename, tests if the file already exists, and if not writes the compressed data to the new file. Here is the important part of putclump,

 (sha f) := `{sha1sum $file}
 (a b) := ${slice 0 2 $sha} ${slice 2 40 $sha}
 if {ftest -e $hold/objects/$a/$b} {} {
  mkdir -p $hold/objects/$a
  gzip < $file > $hold/objects/$a/$b

Getclump just needs to look up the file given a hash

 sha := $1
 (a b) := ${slice 0 2 $sha} ${slice 2 40 $sha}
 files := `{ls $hold/objects/$a/$b^* >[2] /dev/null}
 if {~ $#files 1} {gunzip < $files } 

Because the git repository uses a regular file system to store objects, it makes it considerably easier to work with than the compacted file system like tar.gz, or an application specific binary format like venti. This is because instead of having to create new tools to read and write binary formats, we can re-use existing tools, like sh(1), tarfs(4), updatelog(8), and applylog(8).

For example, I wrote a script, stow, that takes a tarball and stores it in my repository, called the hold. The hold should be created first with the following directories,


Then give stow the name of a .tar or .tgz file. Files not found in the hold and that were added are printed to stdout.

 % stow acme-0.11.tgz

Stow uses updatelog(8) to create a stowage manifest file for the tarball I added. This manifest is saved under /n/stowage. The manifest records the pathname, perms, and sha1 hash of every file in the tarball.

Now that I've stowed all my tarballs I need a way of getting things out.

I built a holdfs, derived from tarfs(4), to read the stowage manifest and present files from the hold. By default the file system is mounted on /mnt/arch.

 % holdfs /n/hold/stowage/acme-0.11

The hold with its stowage is be a step up from a directory tarpit of tarballs. I can accumulate a version history based on tar.gz releases like that for acme-sac and inferno. The vitanuova downloads site contains inferno history going back to 1997. My downloads page contains snapshots of inferno from 2002 to 2006 and acme-sac after that.

My intended application for this was that I could encourage forks of a project and merge back many individuals releases into a single repository and still do useful comparisons.

Using a filled hold I should be able to do analysis of a file history based on the stowage manifests. Contained in this lab are a few experimental scripts to build out more of an SCM. For example, the script hold/diff attempts to use updatelog to compare a manifest with the current tree. And hold/difflog uses a modified applylog(8) to compare two manifests.

The nautical references suggest a distributed and loosely coupled network like that of shipping, and is also influenced by git's design. The unit of transfer is a tarball. It is stowed into the ships hold, along with the manifest. A file system interprets the manifests and gives an interface for searching the hold. There is also keep a ships log of what was stowed and when. I can extract patches, files, tarballs, or the complete stowage in my hold to share with someone else.

This is a simple system:

    669 holdfs.b
    767 total

But then most of what was needed already existed in inferno.


lab 85 code

Sunday, March 23, 2008

lab 84 - gridfs pattern (mapreduce)


lab 84 - gridfs pattern (mapreduce)


I've mentioned mapreduce in previous posts. It makes a good example application for thinking about grid computing. This lab is also about mapreduce although the point here is to illustrate an inferno pattern for grid computing. I'll call it here the gridfs pattern.

Say you have a grid of compute nodes and you want to distribute and coordinate work among them. For the gridfs pattern you construct a synthetic file system that will get exported to all the nodes. The file system is the master process and all clients to the file system are workers.

Both cpu(1) and rcmd(1) use the rstyxd(8) protocol that exports the local namespace when running remote jobs. To implement the gridfs pattern we bind our master fs into our local namespace so it gets exported when we run multiple workers across our compute grid.

A very simple example of this pattern is explained in the Simple Grid Tutorial Part 2. I export a named pipe with a provider process writing one line at a time to the pipe; then multiple worker processes running across the grid consume lines from the pipe as fast as they can do the work.

The mapreduce source files I've included in this lab are a concrete (rough and experimental) example of this pattern taken to the next level. The namespace it exports is the following,


Each worker opens the clone file and gets a unique connection to the master process, represented by a numbered directory. The open clone file becomes an open file descriptor to the ctl file of the new connection. The worker reads messages from ctl describing new work, and it can write back messages about work completed. The master process will keep the status file up to date with the progress of the worker, analogous to prog(3).

An advantage of this approach over the simpler named pipe is that the master process knows exactly when the worker has closed the connection and knows how much work they have completed based on the messages written to the ctl file. It also provides a better interface to the user; The ps(1) command can easily be adapted to read the status files from the mapreduce namespace.

To try out some examples using mapreduce I need to provide a mapper and reducer function. I wrote a module interface for a mapper,

Mapper : module {
    map: fn(key, value: string, emit: chan of (string, string));

This takes a key and value and maps it to an intermediate key and value, which it emits on a channel; it may emit many intermediate key value pairs for a single input key value pair. Here's an implementation for a mapper that takes a string input, tokenizes it, and outputs the token and '1', which will be added later for a wordcount.

# the map function may not get the whole file in one go. maybe
# just a segment, or a line.
map(nil, value: string, emit: chan of (string, string))
 if(sys == nil)
  sys = load Sys Sys->PATH;
 if(str == nil)
  str = load String String->PATH;
 (nil, f) := sys->tokenize(value, 
               "[]{}()!@#$%^&*?><\":;.,|\\-_~`'+=/ \t\n\r");
 for ( ; f != nil; f = tl f) {
  ss := str->tolower(hd f);
  emit <-= (ss, "1");

There is also an interface for a reducer,

Reducer : module {
    reduce: fn(key: string, input: chan of string, emit: chan of string);

This takes all the intermediate values for a key and emits a value. Here's the adder, used by the wordcount.

reduce(nil: string, v: chan of string, emit: chan of string)
 value := 0;
 while((s :=<- v) != nil)
  value += int s;
 emit <-= string value;

The mapper and reducer interfaces are known by a worker process that loads them on demand. An intermediate process that combines values of the same keys and sorts them is also implemented in the worker process (See the Google MapReduce paper for a good explanation.) This implementation of mapreduce knows only how to walk directory hierarchies and print the file names to all the worker processes. Here's an example of a mapreduce command line that counts words in all files below /lib/legal.

  % mkdir /mnt/mapreduce
  % mapreduce -M4 -R3 wordcount adder /lib/legal
  % ls /mnt/mapreduce

Mapreduce should launch and manage all its own processes. However, for the code checked into this lab, to illustrate what is going on, I have it launching nothing. It just mounts the file system on /mnt/mapreduce. The arguments '-M4 -R3' say to expect 4 Mapper processes and 3 Reducer processes. As workers connect it will configure it to be a mapper or reducer depending on whether work remains. Therefore, after running the above command and doing a cat(1) on /mnt/mapreduce/clone we should see the config line then the pathnames for the first worker.

  % cat /mnt/mapreduce/clone
  worker -m -R 3 -d wordcount -i 1
  /lib/legal/GPL 0 17982

The pathnames are divided up among the workers as fast as they can process them. So in this implementation mapreduce functions almost the same as the named pipe in the simple grid tutorial. The cat of the first clone file will return all pathnames!

Mapreduce however is still expecting more workers. Cat the clone file three more times to see the input to the next 3 workers. The next cat after that you should see the config and input to the reducer. For example from a remote node,

  % rcmd ${nextcpu} cat /n/client/mnt/mapreduce/clone

Doing a listing on the /mnt/mapreduce path should show you the current workers connected (if any). After all reducers have disconnected, the mapreduce filesystem will report it's done and exit.

Lets run it again for real using the mapreduce worker processes.

  % mapreduce -m /mnt/mapreduce -M4 -R3 wordcount adder /lib
  % for i in 1 2 3 4 {mapreduce/worker /mnt/mapreduce/clone&}
  % for i in 1 2 3 {mapreduce/worker /mnt/mapreduce/clone&}

You should see the result files in /tmp/out.*

For the GSoC 2008 I suggested a project where the student implement a splitjoin file system. Create a coarse grained splitjoin service as defined loosely here (PDF) (see slides 18 on for fine grained task parallelism). This suggested implementation is really another concrete example of the gridfs pattern. It would allow control over how messages are passed round robin to all the workers. It would permit different configurations of how many to push to each node, how many to join from each node, how many commands to duplicate. E.g.,

filter | splitjoin -d10 -m5 -n3 {cmd} | filter 

creates 10 duplicates of the cmd, take input from a pipeline and distributes m=5 records at a time round robin to each node and join the output n=3 records at a time from each task back out to the pipeline.

Splitjoin would take care of launching the task, and monitoring the task for completion. (Ideally, it would interact with the registry to decide where to launch services.)

Because Plan 9/Inferno is not participating this year in GSoC I will probably have a crack at this.


lab 84 code

Tuesday, March 11, 2008

lab 83 - lcmd local cpu


lab 83 - lcmd local cpu


While thinking of the Simple Grid Tutorial Part 1 and Part 2, I wondered whether I could implement the equivalent of rcmd(1) but for a local emu launched using os(1). For example,

 lcmd math/linbench 100

would launch a new emu, export the local fs to it through a pipe rather than a network socket, and run the command in that namespace. The idea seemed simple, no apparent obstacles, but it actually took me a couple of evenings to get it to work. So I'm posting it more because of the effort rather than its value.

First lets look at what rcmd does, ignoring the networking. Given its arguments it builds a string, calculates its length + 1, and writes the length then the string to a file descriptor, then exports the local namespace to the same file descriptor. Well that part is easy to do in sh(1). Here it is as a braced block assuming all work is done on file descriptor 1.

fn lcmd {
 load expr string
 args := $*
 s := sh -c ${quote $"args}
 echo ${expr ${len $"s} 1 + } 
 echo $s
 export / /fd/1

We can test that,

 lcmd {ls } | auxi/rstyxd

Now, instead of running rstyxd in the current VM, I want to run another instance and run in it that. This is where it gets complicated. You might think this might work,

 lcmd {ls} | os emu auxi/rstyxd

It doesn't because os treats stdin as read only, stdout as write only. Because export(1) needs to read and write on one file descriptor, and so does rstyxd(8), we need to setup extra pipes, both on the local end and the remote end.

Another problem presents itself in emu. At startup rstyxd will see /dev/cons as stdin. But I'd need to bypass the keyboard handling and get the direct stdin from the pipe. We see the answer to that in /dev,

% ls -l /dev/host*
--rw-r--r-- c 0 caerwyn caerwyn 0 Oct 30 22:43 /dev/hostowner
---w--w--w- c 0 caerwyn caerwyn 0 Oct 30 22:43 /dev/hoststderr
--r--r--r-- c 0 caerwyn caerwyn 0 Oct 30 22:43 /dev/hoststdin
---w--w--w- c 0 caerwyn caerwyn 0 Oct 30 22:43 /dev/hoststdout

This looks good but when I tried them they were not fully implemented in the current emu. The details are not interesting. I fixed that in the acme-sac tree and committed it.

Finally, we can build our full lcmd

fn lcmd {
 load std expr string
 pctl forkns
 args := $*
 s := sh -c ${quote $"args}
 bind '#|' /tmp/lpipe
  echo ${expr ${len $"s} 1 + }  >/fd/0;  
  echo $s >/fd/0; 
  export / /fd/0
 } <>/tmp/lpipe/data    &
 os -d 'd:/acme-sac' d:/acme-sac/sys/Nt/386/bin/icell.exe -c1  sh -c '
  bind  ''#|'' /tmp/pipes; 
  cat /tmp/pipes/data > /dev/hoststdout& 
  cat /dev/hoststdin > /tmp/pipes/data& 
  auxi/rstyxd <>/tmp/pipes/data1 >[2] /dev/null;  
  echo halt > /dev/sysctl' < /tmp/lpipe/data1 >/tmp/lpipe/data1


I'm using icell.exe built using the cell config from acme-sac. This is a really small emu configuration. The directories /tmp/pipes, /tmp/lpipe are assumed to exist.

From this definition we can replace rcmd with lcmd in the commands for rsplit and lk in the Grid Tutorial Part 2 and get emu tools for multicores without the setup required for the grid.