Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.

Share this Page URL

Chapter 1 What's It All About? > 1.6DATA MINING AND ETHICS - Pg. 32

CHAPTER1 What's It All About? problem described previously. The problem is that if disjunction is allowed, useless concept descriptions that merely summarize the data become possible, whereas if it is prohibited, some concepts are unlearnable. To get around this problem, it is common to search the concept space starting with the simplest concept descriptions and proceeding to more complex ones: simplest-first order- ing. This biases the search toward simple concept descriptions. Using a simplest-first search and stopping when a sufficiently complex concept description is found is a good way of avoiding overfitting. It is sometimes called forward pruning or prepruning because complex descriptions are pruned away before they are reached. The alternative, backward pruning or postpruning, is also viable. Here, we first find a description that fits the data well and then prune it back to a simpler description that also fits the data. This is not as redundant as it sounds: often the only way to arrive at a simple theory is to find a complex one and then simplify it. Forward and backward pruning are both a kind of overfitting- avoidance bias. In summary, although generalization as search is a nice way to think about the learning problem, bias is the only way to make it feasible in practice. Different learning algorithms correspond to different concept description spaces searched with different biases. This is what makes it interesting: different description languages and biases serve some problems well and other problems badly. There is no universal "best" learning method--as every teacher knows! 1.6 DATA MINING AND ETHICS The use of data--particularly data about people--for data mining has serious ethical implications, and practitioners of data mining techniques must act respon- sibly by making themselves aware of the ethical issues that surround their par- ticular application. When applied to people, data mining is frequently used to discriminate--who gets the loan, who gets the special offer, and so on. Certain kinds of discrimina- tion--racial, sexual, religious, and so on--are not only unethical but also illegal. However, the situation is complex: everything depends on the application. Using sexual and racial information for medical diagnosis is certainly ethical, but using the same information when mining loan payment behavior is not. Even when sensitive information is discarded, there is a risk that models will be built that rely on variables that can be shown to substitute for racial or sexual characteristics. For example, people frequently live in areas that are associated with particular ethnic identities, so using an area code in a data mining study runs the risk of building models that are based on race--even though racial information has been explicitly excluded from the data. It is widely accepted that before people make a decision to provide personal information they need to know how it will be used and what it will be used for, what steps will be taken to protect its confidentiality and integrity, what the