Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
You’ve seen how MapReduce works for small inputs; now it’s time to take a bird’s-eye view of the system and look at the data flow for large inputs. For simplicity, the examples so far have used files on the local filesystem. However, to scale out, we need to store the data in a distributed filesystem, typically HDFS (which you’ll learn about in the next chapter), to allow Hadoop to move the MapReduce computation to each machine hosting a part of the data. Let’s see how this works.