Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
The normal way of preparing data for clustering is to determine a common set of numerical attributes that can be used to compare the items. This is very similar to what was shown in Chapter 2, when critics’ rankings were compared over a common set of movies, and when the presence or absence of a bookmark was translated to a 1 or a 0 for del.icio.us users.
This chapter will work through a couple of example datasets. In the first dataset, the items that will be clustered are a set of 120 of the top blogs, and the data they’ll be clustered on is the number of times a particular set of words appears in each blog’s feed. A small subset of what this looks like is shown in Table 3-1.