Semantic Linguine

Friday, July 18, 2008

Java MapReduce

Ok, this isn't quite MapReduce in its entirety. It lacks any facility for distributed computing across multiple machines. But what it does do is simplify concurrent programming by distributing your MapReduce throughout multiple threads, which means multiple processors on most SMP machines.

It pretty nicely abstracts away all the threading and synchronizing issues so you can focus only on the problem at hand.

To use it, you create a class to encapsulate your mapping and reducing functions. This class must extend class MapperThread and implement class Reducer, which means you must implement two methods,

public Object map(Object data)
public Object getReduction(Collection results)

Then you call the static method MapReduce.mapReduce() passing in your data, as a Vector, and the name of your mapper-reducer class and it will return your results inside an object of your choice.

It's time for an example. Let's use a classic MapReduce example, the word counter. We'll put some "documents" as Strings into a Vector, pass this into mapReduce() which will concurrently process each document and return to us a HashMap where each key is a word and each value is the total count of that word in all our documents.

For example if our documents are {"This is document 1", "This is another document", "Document 3"}
then our output should be {this=2, is=2, document=3, 1=1, another=1, 3=1} (assuming case-insensitivity).

Here's the code to do it:

WordCount.java


package test.misc;

import java.util.Collection;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Vector;
import java.util.Map.Entry;

import test.misc.MapReduce;
import test.misc.MapperThread;
import test.misc.Reducer;

public class WordCount extends MapperThread implements Reducer {

    static final String doc1 = "This is document 1";
    static final String doc2 = "This is another document";
    static final String doc3 = "Document 3";
   
    public Object map(Object data) {
        String doc = (String) data;
        String[] tokens = doc.trim().split("\\s+");
        HashMap results = new HashMap();
        for (int i = 0; i < tokens.length; i++) {
            accumulate(tokens[i], results);
        }
        return results;
    }
   
    void accumulate(String s, HashMap acc) {
        String key = s.toLowerCase();
        if (acc.containsKey(key)) {
            Integer I = (Integer) acc.get(key);
            int newval = I.intValue() + 1;
            acc.put(key, new Integer(newval));
        } else {
            acc.put(key, new Integer(1));
        }
    }
   
    public Object getReduction(Collection c) {
        HashMap h = new HashMap();
        for (Iterator i = c.iterator(); i.hasNext();) {
            Collection entries = ((HashMap) i.next()).entrySet();
            for (Iterator j = entries.iterator(); j.hasNext();) {
                Entry e = (Entry) j.next();
                Object key = e.getKey();
                Integer val = (Integer) e.getValue();
                if (h.containsKey(key)) {
                    Integer oldval = (Integer) h.get(key);
                    h.put(key, new Integer(val.intValue() + oldval.intValue()));
                } else {
                    h.put(key, val);
                }
            }
        }
        return h;
    }
   
    public static void main(String[] args) {
        Vector docs = new Vector();
        docs.add(doc1);
        docs.add(doc2);
        docs.add(doc3);
       
        HashMap results = null;
       
        try {
            results = (HashMap) MapReduce.mapReduce(docs, "test.misc.WordCount", "test.misc.WordCount");
        } catch (Exception e) {
            e.printStackTrace();
        }
       
        System.out.println(results.toString());
    }
}

Notice what we're NOT doing. We're not starting any threads or worrying about synchronization.

We are processing documents concurrently but all the threading is abstracted away to the MapReduce and MapperThread classes.

We can re-use these classes again and again to perform any number of tasks in parallel. We could use them to make concurrent http requests, for instance, and all the sudden we are doing concurrent DISTRIBUTED programming.

There's also an optional argument to mapReduce, maxThreads, that limits the number of concurrently active threads. You'll need to use this limit when working with large data sets.

Here are the three classes that handle the dirty work. You'll have to configure your own package and class paths of course. Enjoy!

MapReduce.java


package test.misc.util;

import java.util.HashMap;
import java.util.Iterator;
import java.util.Vector;

public class MapReduce {

    //
    // If you get OutOfMemoryError, try setting maxThreads to something like 500 or 1000, or even lower if you're
    // manipulating very large objects.
    //
    public static Object mapReduce(Vector datalist, String mapperClass, String reducerClass, int maxThreads) throws Exception {
        Vector lists = new Vector();
        splitList(datalist, maxThreads, lists);
        HashMap map = new HashMap();
        for (Iterator iter = lists.iterator(); iter.hasNext();) {
            Vector list = (Vector) iter.next();
            Vector threads = createThreads(list.size(), mapperClass);
            initializeMap(map, threads);
            startThreads(threads, list, map);
            waitForThreads(threads);
        }
        Object results = runReducer(map, reducerClass);
        return results;
    }
   
    public static Object mapReduce(Vector datalist, String mapperClass, String reducerClass) throws Exception {
        return mapReduce(datalist, mapperClass, reducerClass, 1000000);
    }
   
    static void initializeMap(HashMap results, Vector threads) {
        for (int i = 0; i < threads.size(); i++) {
        results.put(threads.get(i), null);
        }
    }
   
    static Vector createThreads(int numThreads, String mapperClass) throws Exception {
        Vector threads = new Vector(numThreads);
        for (int i = 0; i < numThreads; i++) {
            try {
                MapperThread thread = (MapperThread) Class.forName(mapperClass).newInstance();
                threads.add(thread);
            } catch (java.lang.OutOfMemoryError e) {
                System.err.println("Error: thread " + i);
                throw e;
            }
        }
        return threads;
    }
   
    static Vector startThreads(Vector threads, Vector data, HashMap map) throws Exception {
        for (int i = 0; i < data.size(); i++) {
            try {
                Object item = data.get(i);
                MapperThread thread = (MapperThread) threads.get(i);
                thread.setParams(item, map);
                thread.start();
            } catch (java.lang.OutOfMemoryError e) {
                System.err.println("Error: thread " + i);
                throw e;
            }
        }
        return threads;
    }
   
    static void waitForThreads(Vector threads) {
        for (Iterator i = threads.iterator(); i.hasNext();) {
            Thread thread = (Thread) i.next();
            try {
                thread.join();
            } catch (InterruptedException e) {}
        }
    }
   
    static Object runReducer(HashMap map, String reducerClass) throws Exception {
        Reducer reducer = (Reducer) Class.forName(reducerClass).newInstance();
        return reducer.getReduction(map.values());
    }

    static Vector splitList(Vector in, int max, Vector acc) {
        if (in.size() <= max) {
            acc.add(in);
            return acc;
        } else {
            acc.add(new Vector(in.subList(0, max)));
            return splitList(new Vector(in.subList(max, in.size())), max, acc);
        }
    }
}

MapperThread.java


package test.misc.util;

import java.util.HashMap;

public class MapperThread extends Thread {
    protected Object data;
    protected HashMap mappings;
   
    public void setParams(Object data, HashMap map) {
        this.data = data;
        this.mappings = map;
    }

    /**
     * This is the method your subclass must override.
     * @param data - This is taken from successive elements of the list
     *               you pass in to mapReduce.
     * @return - Whatever is returned here will be added to the collection passed in to
     *           your Reducer.getReduction() method.
     */
    public Object map(Object data) {
        return data;
    }
   
    public void run() {
        mappings.put(this, map(data));
    }
}

Reducer.java



package test.misc.util;

import java.util.Collection;

public interface Reducer {

    /**
     * You'll override this in your own implementation.
     * @param results - A collection of objects returned from your MapperThread subclass.
     * @return - The return from this method is what is returned by mapReduce().
     */
    public Object getReduction(Collection results);
}

Thursday, May 15, 2008

COMET: A Hack for the Broken Web

This post provides insight into some ingenious engineering at Facebook. Erlang is the perfect language to implement the server side of COMET's long-polling with all its simultaneous connections.

However, all of that and COMET itself, is only necessary because of the fact that the World Wide Web is unfinished! In its current state the Web is BROKEN!

Most of the clients out there are not network nodes. They are half-nodes. They can initiate connections out, but do not accept incoming connections (on the WWW port, 80). Unlike the good old telephone network where every phone can call every other phone, the Web is like a television network; you can request and receive their programming, but they are not interested in receiving your programming!

What does this mean for web applications? Well, a chat application is a perfect example. Each client wants real-time updates of the status of its buddies. But since a server can't really push data to a typical client, the client has to poll. Of course polling is so inefficient that brilliant engineers have had to come up with things like long-polling, COMET, hugely distributed web servers, and the like--all of which are still pretty inefficient.

Wouldn't it be nice if each client were a real node? It could PUT its status to each of its buddy nodes any time its status changed. It could POST new chat messages directly to the buddy node. The centralized server would have far fewer responsibilities, like maintaining your buddy list and the address of the node you're logged in at.

What would be necessary to get the Web finished? Among other things, IPV6 or some other addressing scheme to get all the nodes uniquely addressed; new firewall and security rules and protocols to keep the bad guys at bay, and renewed interest in using HTTP to its full potential (using REST of course).

Friday, May 9, 2008

Want to Be a Better/Faster/Smarter Programmer? Learn a New Language!

For years most of the projects I worked on were written in C. I thought I was a pretty good C programmer. Then I had to move fully into the Java world. I grumbled a little because I was no Java expert, but once I learned it I found I would much rather write Java code than C. But you know what else I discovered? When I did go back to writing or maintaining C code, I did it a little differently. I took a more object-oriented approach and wrote more re-usable libraries. (Like my vector.c module that emulated Java's Vector class.)

After a while I began to learn Ruby and that's when several ideas that should have already been ingrained really started to click. I realized that handling exceptions like this:


try {
 some_method()
} catch (Exception e) {
 write_to_logfile(e);
 return false;
}

is almost always WRONG. You should be probably be throwing that error up the stack to be handled at a high level.

Ruby's awesome Test::Unit module woke me up to value of automated unit testing. Instead of doing unit testing as an afterthought, I began thinking about unit tests much earlier in the process. I've learned to use JUnit for Java and even written tests first and the class second a couple of times!

Coding Ruby has got me wanting to write small, easy to test classes made of small, easy to test methods, and modularize to a degree I haven't in the past.

As a result of better modularization and unit testing, my Java and Ruby classes are much more reliable from the moment of release and require far less debugging (and apologizing to users) post-release.

What next? Erlang! What a great languange. And what a great book Joe Armstrong has written in Programming Erlang: Software for a Concurrent World.

I'm not yet writing production code in Erlang, but now, whatever the language, I'm seeking more elegant solutions using recursion, thinking more about concurrency and shared data, doing actual top-down coding, and trying to think of solutions functionally.

Thursday, March 27, 2008

How Software Development is Like Finishing a Basement

Last year Carrie and I bought and moved into a new house. Our strategy was to move in then finish the unfinished basement ourselves, building up some sweat equity. Now we're about halfway finished and I see some obvious similarities between building a basement and building a software application.

We decided to be agile in our basement finish, rather than work from a predefined blueprint. Some decisions would be put off until necessary; like the colors of carpet and paint, the style of lighting fixtures, and even the placement of some doorways and size of closets. But some things have to be decided right at the start. What will the end product look like generally? (Modern.) What features are definitely required? (Bar, pool table, salon, bathroom, two bedrooms, toy room.) This is like deciding the overall look and feel of your app and its required features.

Once you've got a handle on the overall look and feature set, it's time for the strategy of how to make it happen. How will you design it? You can design the whole project, creating a detailed blueprint before starting work, or wait to design the bathroom until you are ready to start work on it. But keep dependencies in mind. If you cover the main heating duct with drywall before you run a duct to the bathroom, you'll be redoing your ceiling. Likewise, if you write some Javascript that runs on IE then find out you've got some Linux users, you'll be right back into that code.

We ended up doing the basement much like I write software. Start with a good idea of what features must be in the end product, but save the details for later. Know what has to be done first, middle, and last, but break each part into its details when you come to it.

Test each part well before moving on to the next. You want to find out that one of the electrical outlets didn't get a wire run to it BEFORE you put drywall on the walls. And you want to know that your sql stored procedure works correctly before you start calling it from several different web apps.

For me the greatest lesson learned from finishing a basement is that everything doesn't have to be perfect. A wall stud can be a half inch off and you can still make it work. A heating duct can be a few inches too long or short and be fine. Likewise, sometimes the quick and dirty code is good enough. No function will ever be elegant enough and fast enough and memory efficient enough for me to consider it perfect. Put down the level and tape measure, nail the board into place and move on.

Monday, February 25, 2008

Test::Unit Wins in an Upset

Over the last couple of days I decided to write a small program without the benefit of Ruby's Test::Unit module. Why? Well, it's only been in the last year and a half or so that I've been rigorously testing code with frameworks like JUnit and Test::Unit, and I wanted reassure myself that I'm really better off now.

The thing is, even if you haven't created test cases, you still have to test your code. So my coding in the last two days has gone like this:

Make a change
Run the whole program (It worked, yea!)
Run it again with different input (Failed. Boo.)
Interpret the stack trace
Correct the error
Run the whole program again (It worked, yea!)
Run it again with different input (This time it worked, yea!)
Run it again with still another input (Oh, shit.)
Interpret the stack trace
Correct the error
Repeat over and over with every change and every kind of input

By the end of the project I was running the program seven times, to test seven different inputs, each time I made a change. Instead, I should have just run a unit test that tests only the module that was changed.

The time to run the whole program takes a few seconds longer than running a test case of a module. The time to run it seven times to test different inputs takes much longer than running a test suite. The time to type in the different inputs for each run of the program is far more than the time it would take to create the test cases. All these seconds really add up over several hours of coding.

Bottom line: I'm more convinced than ever that by taking the time to create good test cases I'm saving tons of time over the life of a project. Suite!

Thursday, February 21, 2008

Email is Efail?

My take on Tantek's Thoughts http://tantek.com/log/2008/02.html#d19t2359.

While I agree there is great benefit to pushing a lot of communication to 1:many forms like blogs, twitter, etc., I find email still works for me in many respects. There are too many kinds of communication to pigeon-hole everything into just a few media.

Some things are best done over the phone, some over IM, some over 1:many media, and some over email.

A friend pointed out that email bloat is almost never a problem, and in fact lack of detail in emails is much more prevalent. I agree. It usually takes an email correspondence of several messages back and forth to get enough detail to proceed with the issue at hand.

Using a great email client, like Gmail, is also a great way to maximize email efficiency.

Thursday, January 24, 2008

Flex 3 Pre-release Tour

Last night I attended the Flex and AIR Pre-release Tour hosted by the Salt Lake City Flex/Flash User Group. The presenter was Adobe Evangelist, Kevin Hoyt. He was very good and the meeting was very worthwhile. Hoyt knows his stuff and presents it well and with a sense of humor.

Watching Hoyt use Illustrator to create and style a button confirmed for me just how tedious developing user interfaces is, no matter what tools you use. After several minutes he ended up with a really nice looking button. But thinking about having to go through the same steps to put a skin on each interface widget made my head hurt. I'd rather debug kernel code. I really would. But that's just me.

But it got me thinking, why is the developer taking the time to stylize or skin the app anyway? Why don't developers just create the basic framework of the interface and let the users skin the app the way they want it to look? Well, of course that's not as easy as it sounds and some new tools would be required to make it simple and fun for users to do so. And maybe for another reason which I thought was Hoyt's most interesting idea.

Hoyt's idea, which I had never heard before, is about the way economies evolve. He explained that when the U.S. was young it had an agrarian economy. This evolved into a goods-based economy. Next came the service-based economy, where most 'products' in the economy are really services.

I had heard all this before, but then he said we are moving into an Experience-based economy.

You go to Starbucks and pay $5 not for a cup of coffee, but for the Experience of getting a Starbucks coffee. The smells, atmosphere, music, all that sort of thing. Or you go into an Apple store to get the gadget-buying Experience that you can't get at Best Buy.

I'm not sure I fully buy into the Experience-based economy idea. But, it makes more sense why a developer would spend so much time on the look-and-feel, the skin, of an application. You may want to provide a specific experience to the user. A memorable way of branding your product or company. I'll definitely be thinking more about the overall experience of the user as I develop applications from now on.

Oh yeah, and Flex 3 and Air look like good products to me. I think we made the right decision in our company to develop our next major app for the Air platform.