Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
316 CHAPTER7 Data Transformations UnsupervisedDiscretization There are two basic approaches to the problem of discretization. One is to quantize each attribute in the absence of any knowledge of the classes of the instances in the training set--so-called unsupervised discretization. The other is to take the classes into account when discretizing--supervised discretization. The former is the only possibility when dealing with clustering problems where the classes are unknown or nonexistent. The obvious way of discretizing a numeric attribute is to divide its range into a predetermined number of equal intervals: a fixed, data-independent yardstick. This is frequently done at the time when data is collected. But, like any unsupervised discretization method, it runs the risk of destroying distinctions that would have turned out to be useful in the learning process by using gradations that are too coarse or, that by unfortunate choices of boundary, needlessly lump together many instances of different classes. Equal-width binning often distributes instances very unevenly: Some bins contain many instances while others contain none. This can seriously impair the ability of the attribute to help build good decision structures. It is often better to allow the intervals to be of different sizes, choosing them so that the same number