Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
We'll examine the implementation of a much more sophisticated driver program that automatically runs MapReduce programs on large-scale clusters of machines in a moment, but first, let's consider a few other problems and how they can be solved using Map-Reduce:
Distributed grep
The Map function emits a line if it matches a supplied regular expression pattern. The Reduce function is an identity function that just copies the supplied intermediate data to the output.
Reverse web-link graph
A forward web-link graph is a graph that has an edge from node URL1 to node URL2 if the web page found at URL1 has a hyperlink to URL2. A reverse web-link graph is the same graph with the edges reversed. MapReduce can easily be used to construct a reverse web-link graph. The Map function outputs <target, source> pairs for each link to a target URL found in a document named source. The Reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair <target, list of source URLs>.
Term vector per host
A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The Map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The Reduce function is passed all per-document term vectors for a given host. It adds these term vectors, throwing away infrequent terms, and then emits a final <hostname, term vector> pair.
Inverted index
An inverted index is a data structure that maps from each unique word to a list of documents that contain the word (where the documents are typically identified with a numeric identifier to keep the inverted index data relatively compact). The Map function parses each document and emits a sequence of <word, docid> pairs. The Reduce function accepts all docids for a given word, sorts the corresponding document IDs, and emits a <word, list of docids> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions within each document.
Distributed sort
MapReduce can also be used to sort data by a particular key. The Map function extracts the key from each record, and emits a <key, record> pair. The Reduce function emits all pairs unchanged (i.e., the identity Reduce function). This computation depends on the partitioning facilities and ordering properties described later in this chapter.