Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.


  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • DownloadDownload
  • PrintPrint

15.2. The Bixo web crawler

Bixo (see http://openbixo.org/) is an open source web mining toolkit based on Hadoop, the dominant open source implementation of the MapReduce algorithm. Bixo uses the Cascading open source project (see http://www.cascading.org/) to define the web crawling workflow. The use of Cascading allows Bixo to focus on the mechanics of web crawling and the associated data flow rather than Hadoop/MapReduce implementation details.

Cascading provides a rich API for defining and implementing scale-free and fault-tolerant data processing workflows on a Hadoop cluster. The Cascading workflow model is one of operations that are connected via “pipes,” much like classic Unix tools. Bixo consists of a number of Cascading operations and subassemblies, which can be combined to form a data processing workflow that (typically) starts with a set of URLs to be fetched and ends with some results extracted from parsed HTML pages. The entire Bixo workflow is shown in figure 15.2.


  

You are currently reading a PREVIEW of this book.

                                                                                        

Get instant access to over
$1 million worth of books and videos.

  

Start a Free Trial