Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
566 CHAPTER17 Tutorial Exercises for the Weka Explorer 17.2 NEAREST-NEIGHBORLEARNING ANDDECISIONTREES In this section you will experiment with nearest-neighbor classification and deci- sion tree learning. For most of it, a real-world forensic glass classification dataset is used. We begin by taking a preliminary look at the dataset. Then we examine the effect of selecting different attributes for nearest-neighbor classification. Next we study class noise and its impact on predictive performance for the nearest-neighbor method. Following that we vary the training set size, both for nearest-neighbor classification and for decision tree learning. Finally, you are asked to interactively construct a decision tree for an image segmentation dataset. Before continuing you should review in your mind some aspects of the classifica- tion task: · · · · · How is the accuracy of a classifier measured? To make a good classifier, are all the attributes necessary? What is class noise, and how would you measure its effect on learning? What is a learning curve? If you, personally, had to invent a decision tree classifier for a particular dataset, how would you go about it? TheGlassDataset The glass dataset glass.arff from the U.S. Forensic Science Service contains data on six types of glass. Glass is described by its refractive index and the chemical elements that it contains; the the aim is to classify different types of glass based on these features. This dataset is taken from the UCI datasets, which have been collected by the University of California at Irvine and are freely available on the Web. They are often used as a benchmark for comparing data mining algorithms. Find the dataset glass.arff and load it into the Explorer interface. For your own information, answer the following exercises, which review material covered in the previous section. Exercise 17.2.1. How many attributes are there in the dataset? What are their names? What is the class attribute? Run the classification algorithm IBk (weka.classifiers.lazy.IBk). Use cross-validation to test its performance, leaving the number of folds at the default value of 10. Recall that you can examine the classifier options in the Generic Object Editor window that pops up when you click the text beside the Choose button. The default value of the KNN field is 1: This sets the number of neighboring instances to use when classifying.