Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.

  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint

Binning

Pattern Description

The binning pattern, much like the previous pattern, moves the records into categories irrespective of the order of records.

Intent

For each record in the data set, file each one into one or more categories.

Motivation

Binning is very similar to partitioning and often can be used to solve the same problem. The major difference is in how the bins or partitions are built using the MapReduce framework. In some situations, one solution works better than the other.

Binning splits data up in the map phase instead of in the partitioner. This has the major advantage of eliminating the need for a reduce phase, usually leading to more efficient resource allocation. The downside is that each mapper will now have one file per possible output bin. This means that, if you have a thousand bins and a thousand mappers, you are going to output a total of one million files. This is bad for NameNode scalability and follow-on analytics. The partitioning pattern will have one output file per category and does not have this problem.

Structure

  • This pattern’s driver is unique in using the MultipleOutputs class, which sets up the job’s output to write multiple distinct files.

  • The mapper looks at each line, then iterates through a list of criteria for each bin. If the record meets the criteria, it is sent to that bin. See Figure 4-3.

  • No combiner, partitioner, or reducer is used in this pattern.

The structure of the binning pattern

Figure 4-3. The structure of the binning pattern

Consequences

Each mapper outputs one small file per bin.

Caution

Data should not be left as a bunch of tiny files. At some point, you should run some postprocessing that collects the outputs into larger files.

Resemblances

Pig

The SPLIT operation in Pig implements this pattern.

SPLIT data INTO
    eights IF col1 == 8,
    bigs IF col1 > 8,
    smalls IF (col1 < 8 AND col1 > 0);

Performance analysis

This pattern has the same scalability and performance properties as other map-only jobs. No sort, shuffle, or reduce needs to be performed, and most of the processing is going to be done on data that is local.

Binning Examples

Binning by Hadoop-related tags

We want to filter data by tag into different bins so that we can run follow-on analysis without having to run over all of the data. We care only about the Hadoop-related tags, specifically hadoop, pig, hive, and hbase. Also, if the post mentions Hadoop anywhere in the text or title, we’ll put that into its own bin.

The following descriptions of each code section explain the solution to the problem.

Problem: Given a set of StackOverflow posts, bin the posts into four bins based on the tags hadoop, pig, hive, and hbase. Also, create a separate bin for posts mentioning hadoop in the text or title.

Driver code

The driver is pretty much the same boiler plate code, except that we use MultipleOutputs for the different bins. MultipleOutputs takes in a name, bins, that is used in the mapper to write different output. The name is essentially the output directory of the job. Output counters are disabled by default, so be sure to turn those on if you don’t expect a large number of named outputs. We also set the number of reduce tasks to zero, as this is a map-only job.

...
// Configure the MultipleOutputs by adding an output called "bins"
// With the proper output format and mapper key/value pairs
MultipleOutputs.addNamedOutput(job, "bins", TextOutputFormat.class,
        Text.class, NullWritable.class);

// Enable the counters for the job
// If there are a significant number of different named outputs, this
// should be disabled
MultipleOutputs.setCountersEnabled(job, true);

// Map-only job
job.setNumReduceTasks(0);
...

Mapper code

The setup phase creates an instance of MultipleOutputs using the context. The mapper consists of several if-else statements to check each of the tags of a post. Each tag is checked against one of our tags of interest. If the post contains the tag, it is written to the bin. Posts with multiple interesting tags will essentially be duplicated as they are written to the appropriate bins. Finally, we check whether the body of the post contains the word “hadoop”. If it does, we output it to a separate bin.

Be sure to close the MultipleOutputs during cleanup! Otherwise, you may not have much output at all.

Caution

The typical file names, part-mnnnnn, will be in the final output directory. These files will be empty unless the Context object is used to write key/value pairs. Instead, files will be named bin_name-mnnnnn. In the following example, bin_name will be, hadoop-tag, pig-tag, hive-tag, hbase-tag, or hadoop-post.

Note that setting the output format of the job to a NullOutputFormat will remove these empty output files when using the mapred package. In the newer API, the output files are not committed from their _temporary directory into the configured output directory in HDFS. This may be fixed in a newer version of Hadoop.

public static class BinningMapper extends
    Mapper<Object, Text, Text, NullWritable> {

    private MultipleOutputs<Text, NullWritable> mos = null;

    protected void setup(Context context) {
        // Create a new MultipleOutputs using the context object
        mos = new MultipleOutputs(context);
    }

    protected void map(Object key, Text value, Context context)
            throws IOException, InterruptedException {

        Map<String, String> parsed = MRDPUtils.transformXmlToMap(value
                .toString());

        String rawtags = parsed.get("Tags");

        // Tags are delimited by ><. i.e. <tag1><tag2><tag3>
        String[] tagTokens = StringEscapeUtils.unescapeHtml(rawtags).split(
                "><");

        // For each tag
        for (String tag : tagTokens) {
            // Remove any > or < from the token
            String groomed = tag.replaceAll(">|<", "").toLowerCase();

            // If this tag is one of the following, write to the named bin
            if (groomed.equalsIgnoreCase("hadoop")) {
                mos.write("bins", value, NullWritable.get(), "hadoop-tag");
            }
            if (groomed.equalsIgnoreCase("pig")) {
                mos.write("bins", value, NullWritable.get(), "pig-tag");
            }
            if (groomed.equalsIgnoreCase("hive")) {
                mos.write("bins", value, NullWritable.get(), "hive-tag");
            }
            if (groomed.equalsIgnoreCase("hbase")) {
                mos.write("bins", value, NullWritable.get(), "hbase-tag");
            }
        }

        // Get the body of the post
        String post = parsed.get("Body");

        // If the post contains the word "hadoop", write it to its own bin
        if (post.toLowerCase().contains("hadoop")) {
            mos.write("bins", value, NullWritable.get(), "hadoop-post");
        }
    }

    protected void cleanup(Context context) throws IOException,
            InterruptedException {
        // Close multiple outputs!
        mos.close();
    }
}
  • Safari Books Online
  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint