Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
438 CHAPTER11 The Explorer into a new dataset. Reorder alters the order of the attributes in the data; the new order is specified by supplying a list of attribute indices. By omitting or duplicating indices it is possible to delete attributes or make several copies of them. Changing Values SwapValues swaps the positions of two values of a nominal attribute. The order of values is entirely cosmetic--it does not affect learning at all--but if the class is selected, changing the order affects the layout of the confusion matrix. Merge- TwoValues merges values of a nominal attribute into a single category. The new value's name is a concatenation of the two original ones, and every occurrence of either of the original values is replaced by the new one. The index of the new value is the smaller of the original indices. For example, if you merge the first two values of the outlook attribute in the weather data--in which there are five sunny, four overcast, and five rainy instances--the new outlook attribute will have values sunny_overcast and rainy; there will be nine sunny_overcast instances and the original five rainy ones. One way of dealing with missing values is to replace them globally before apply- ing a learning scheme. ReplaceMissingValues replaces each missing value by the mean for numeric attributes and the mode for nominal ones. If a class is set, missing values of that attribute are not replaced by default, but this can be changed. NumericCleaner replaces the values of numeric attributes that are too small, too large, or too close to a particular value with default values. A different default can be specified for each case, along with thresholds for what is considered to be too large or small and a tolerance value for defining too close. AddValues adds any values that are not already present in a nominal attribute from a user-supplied list. The labels can optionally be sorted. ClassAssigner can be used to set or unset a dataset's class attribute. The user supplies the index of the new class attribute; a value of 0 unsets the existing class attribute. Conversions Many filters convert attributes from one form to another. Discretize uses equal-width or equal-frequency binning (see Section 7.2, page 316) to discretize a range of numeric attributes, specified in the usual way. For the former method the number of bins can be specified or chosen automatically by maximizing the likelihood using leave-one-out cross-validation. It is also possible to create several binary attributes instead of one multivalued one. For equal-frequency discretization, the desired number of instances per interval can be changed. PKIDiscretize discretizes numeric attributes using equal-frequency binning; the number of bins is the square root of the number of values (excluding missing values). Both these filters skip the class attribute by default. MakeIndicator converts a nominal attribute into a binary indicator attribute and can be used to transform a multiclass dataset into several two-class ones. It substi- tutes a binary attribute for the chosen nominal one, of which the values for each instance are 1 if a particular original value was present and 0 otherwise. The new attribute is declared to be numeric by default, but it can be made nominal if desired.