Thursday, October 21, 2004

lab 14 - map reduce

NAME

lab 14 - map reduce functional grid programming

DESCRIPTION

I read about Google's MapReduce from Rob Pike's interview at slashdot. At the same time I've been studying Inferno's alphabet-grid(1) and am wondering if I can implement map reduce using alphabet.

Here's an imaginary example

          % - {rng , | remote |
           mapreduce "{tq -1rm /n/tick/tick.bt} "{tock } |
           /create /tmp/result
          }

Suppose that tick.bt is a log of time spent on tasks where each record is the timestamp, task and number of seconds spent on the task that instance. Rng produces 1 or more date ranges. Remote converts type /fd to an endpoint. Mapreduce will then split a date range, such as one year, into M smaller date ranges. For each subrange it calls rexec passing it the address of an available node, the subrange and map function as parameters.

The output from all the map functions is directed to R endpoints. The R parition function could be hash(key) mod R as suggested in the paper. Then mapreduce rexec's a reduce worker, which reads in all the data from the endpoint, sorts it, and for each key calls the reduce function with the key and list of values (or /fd) as parameter. In this example tock, the reduce function, sums all the time values for a task and outputs the total.

I've made the example specific to tickfs and the use of triads merely because I already have these tools and makes it easier for me to grasp. The google paper uses key, value pairs. I'm ignoring all the other factors they consider, such as fault tolerance, locality, and much else.

Here's another example. In the distribution on my homepage I include a command nsearch for searching a tickfs index. The command is given a list of keywords. Given the first keyword, which might be a date range, it builds an initial set of keys. It then partitions this set among a fixed number of threads. Each thread test the record coming in on a channel against the index and the search term given as parameter to the thread. The reduce function would be an identity function, simply passing through it's input. This is a map, filter, reduce pipeline. Alphabet seems to provide the tools to express this whole query and more on the command line, including distributing the processing among nodes.

The implementation needs somewhere to lookup the available list of nodes. Administering all the nodes would need some fancy fs that managed the status of all executing workers. I'd keep this to an absolute minimum for now.

CONLUSION

This all sounds very promising but I don't know how to implement it yet. Here are some more notes while I think this through.

The revelation for me is the importance of functional programming to distributed computing. It wasn't long ago (lab 1) that I discovered limbo shell supported functional programming. Alphabet takes this to the next level by defining types. Alphabet-grid provides the connection between processing modules on distributed nodes. Altogether it provides a framework for distributed computing I'm still coming to grips with. It is a different way of thinking about computing than I am used to.

REFERENCES

MapReduce Interview

No comments: