Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.


Share this Page URL
Help

CHAPTER 17 Tutorial Exercises for the We... > 17.6 Mining association rules - Pg. 582

582 CHAPTER17 Tutorial Exercises for the Weka Explorer Not all of the attributes (i.e., terms) are important when classifying documents. The reason is that many words are irrelevant for determining an article's topic. Weka's AttributeSelectedClassifier, using ranking with InfoGainAttributeEval and the Ranker search, can eliminate less useful attributes. As before, FilteredClassifier should be used to transform the data before passing it to AttributeSelectedClassifier. Exercise 17.5.11. Experiment with this, using default options for StringToWordVector and NaiveBayesMultinomial as the classifier. Vary the number of the most informative attributes that are selected from the information gain­based ranking by changing the value of the numToSelect field in the Ranker. Record the AUC values you obtain. How many attributes give the best AUC for the two datasets discussed before? What are the best AUC values you managed to obtain? 17.6 MININGASSOCIATIONRULES In order to get some experience with association rules, we work with Apriori, the algorithm described in Section 4.5 (page 144). As you will discover, it can be challenging to extract useful information using this algorithm. Association-RuleMining To get a feel for how to apply Apriori, start by mining rules from the weather. nominal.arff data that was used in Section 17.1. Note that this algorithm expects data that is purely nominal: If present, numeric attributes must be discretized first. After loading the data in the Preprocess panel, click the Start button in the Associate panel to run Apriori with default options. It outputs 10 rules, ranked according to the confidence measure given in parentheses after each one (they are listed in Figure 11.16). As we explained in Chapter 11 (page 430), the number following a rule's antecedent shows how many instances satisfy the antecedent; the number following the conclusion shows how many instances satisfy the entire rule (this is the rule's "support"). Because both numbers are equal for all 10 rules, the confidence of every rule is exactly 1. In practice, it can be tedious to find minimum support and confidence values that give satisfactory results. Consequently, as explained in Chapter 11, Weka's Apriori runs the basic algorithm several times. It uses the same user-specified minimum confidence value throughout, given by the minMetric parameter. The support level is expressed as a proportion of the total number of instances (14 in the case of the weather data), as a ratio between 0 and 1. The minimum support level starts at a certain value (upperBoundMinSupport, default 1.0). In each iteration the support is decreased by a fixed amount (delta, default 0.05, 5% of the instances) until either a certain number of rules has been generated (numRules, default 10 rules) or the support reaches a certain "minimum minimum" level (lowerBoundMinSupport,