Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
Bixo (see http://openbixo.org/) is an open source web mining toolkit based on Hadoop, the dominant open source implementation of the MapReduce algorithm. Bixo uses the Cascading open source project (see http://www.cascading.org/) to define the web crawling workflow. The use of Cascading allows Bixo to focus on the mechanics of web crawling and the associated data flow rather than Hadoop/MapReduce implementation details.
Cascading provides a rich API for defining and implementing scale-free and fault-tolerant data processing workflows on a Hadoop cluster. The Cascading workflow model is one of operations that are connected via “pipes,” much like classic Unix tools. Bixo consists of a number of Cascading operations and subassemblies, which can be combined to form a data processing workflow that (typically) starts with a set of URLs to be fetched and ends with some results extracted from parsed HTML pages. The entire Bixo workflow is shown in figure 15.2.