Wednesday, November 26, 2008

Google sorts one petabyte in 6 hours

Google has sorted a record 1 terabyte of data on 1,000 computers in only 68 seconds, which breaks the previous mark of 209 seconds established in July by Yahoo.

Team leader Grzegorz Czajkowski wrote that the team followed the rules of a standard terabyte sort benchmark and used Google's MapReduce software framework that supports parallel computations over large (multiple petabyte) data sets on clusters of computers. Yahoo's effort had featured a 910-node cluster, and used Hadoop, an open-source MapReduce implementation.

The sort benchmark, which was created in 1998 by computer scientist Jim Gray, specifies the input data (10 billion 100-byte records in uncompressed text files), which must be completely sorted and written to disk. Not content with just rewriting the record book, the Google team then decided to up the ante in sorting massive volumes of data.

One petabyte is a thousand terabytes. One way to put that amount in perspective, according to Czajkowski, is to consider that the aggregate size of data processed by all instances of MapReduce at Google was, on average, 20 PB per day in January 2008. A paper explaining MapReduce on the Google labs site says that the upwards of one thousand MapReduce jobs are executed on Google's clusters every day. So the infrastructure team's MapReduce job that extended the benchmark factors out to 50 typical MapReduce jobs, or one-twentieth the total of all daily MapReduce jobs run on Google's clusters.