Showing posts from August, 2005

lab 42 - channels in shell

NAME lab 42 - channels in shell NOTES This is a quicky. I ripped this code off from Rob Pike's squint, an interpreter for his squeak language, a predecessor of limbo. It's an implementation of Eratosthenes's Sieve using communicating processes and channels, which I translated to Inferno shell. This was just to try out the channels in shell, since somehow I've overlooked them till now. load tk load expr fn counter { ch := $1 {i:=2; while {} {send $ch $i; i=${expr $i 1 +}} } } fn filter { prime := $1 listen := $2 s := $3 while {} { i := ${recv $listen} if {ntest ${expr $i $prime %}} { send $s $i} } } fn sieve { prime := $1 chan count c := count counter count & {while {} { p := ${recv $c} send prime $p chan newc^$p filter $p $c newc^$p & c = newc^$p }} & } chan prime sieve prime & echo ${recv prime} echo ${recv prime} echo ${recv prime} Watch your CPU meter spike when you get a hundred primes or so.

lab 41 - venti lite

NAME lab 41 - venti lite NOTES I've taken another look recently at venti and the ideas from the venti paper . Venti is a data log and index. The index maps a sha1 hash of a clump of data to the offset of the clump in the log. Clumps are appended to the log after being compressed and wrapped with some meta-data. A sequence of clumps make an arena, and all the arenas, possibly spread over several disks, make up the whole data log. There is enough information in the data log to recreate the index, if neccessary. The above is my understanding of Venti so far after reading the code and paper. There is a lot more complexity in it's implementation. There are details about the caches, the index, the compression scheme, the blocking and partitioning of disks, and so on. I will ignore these details for now. Although the whole of venti could be ported to Inferno, I want to look at it without getting bogged down in too many details too early. Reasoning about Venti in the context of I

first year

first year anniversary A year ago today I started keeping this notebook. The original idea was for it to be a social thing. Instead of keeping a private notebook of my computing experiments, I published everything in a blog to make it part of a conversation. Who were the other voices? A pattern emerged in how I went about finding ideas and exploring them. It seems I hit upon an interesting paper--often from a regular source like Google Labs, Bell Labs or the MIT PDOS group--and I'd then try and reason about the paper's subject in the context of Inferno. So the other voices I started listening to were from the authors of these great papers. Sketching out a program from something I'd read in a paper helped me to understand the paper better and generate ideas that went beyond the paper. This was not just useful, but also a lot of fun. Much as writing is more than communication but an extension of the process of thinking, programming is not merely a corporate activity,

lab 40 - distributing data

NAME lab 40 - distributing data NOTES The first problem I had to deal with once extracting the data from the reality mining data set was how to distribute it across all the disks. So this lab describes how I did it. The Geyron grid I have at home is a simulation grid. I'm just pretending that I have 16 real disks, when I actually only have two. [1] The sizes of things are also scaled down, so each disk is only 64MB instead of a more typical 40GB. I divided each disk into 1MB chunks for this simulation. On each disk I created 60 files numbered 0.ckfree ... 59.ckfree. Each empty file is a record that an available chunk is on this disk. I don't use 64 because I leave room for slop. Each chunk is going to be slightly over 1MB. I do all this easily enough using my friend, the registry: % for (i in `{ndb/regquery -n resource kfs}) { mount $i /n/remote for (j in `{seq 0 59}) { > /n/remote/$j.ckfree } } The idea behind this, as suggested in the google f

lab 39 - tar and gzip

NAME lab 39 - tar and gzip NOTES In lab 38 I looked at file formats I could use to store a document repository. I kind of decided I needed an archive file of individually gzipped files. After a little further digging I find that the gzip file format ( rfc ) supports multiple gzip files concatenated together as a valid gzip file. A quick test on unix shows this to be true. % cat file1 file2 | wc 2 8 35 % gzip < file1 > t1.gz % gzip < file2 >> t1.gz % gunzip < t1.gz |wc 2 8 35 But the same test on Inferno did not work. After a little hacking on /appl/lib/inflate.b I got it to work, although I'm not sure I haven't broken something else in doing so. So beware. Appending to a gzip file is a nice feature. What about puttar ? Can I append to a tar file? % puttar file1 > t1.tar % puttar file2 >> t1.tar % lstar < t1.tar file1 1123551937 15 0 No. It stops reading after the first file. I looked at the code /appl/cmd/puttar.b and find it

lab 38 - Geryon's data sets

NAME lab 38 - Geryon's data sets NOTES I need a large data set to work with so I can try out more ideas using Geyron. I want to use real data; one that can not be analyzed trivially using, say, a relational database. Examples I considered, crawl the web - a web page repository an rss feed repository web server query logs click logs for a site aggregate data input by users system health records sousveillance logs Some of these are more difficult to collect than others. Some may contain greater possibility for surprises, and the surprises are what I hope to get by working with real data. Also, a data set where others can collect and duplicate my results would help to keep the conversation concrete. But I need something right now, and there are two possibilites at hand. I've got the data set from the MIT Reality Mining project. This is about 200+MB uncompressed. This is big enough to test out some tools. But for this size data, Geyron is not likely to offer a