Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.

Share this Page URL

Chapter 3 Data Preprocessing > 3.7SUMMARY - Pg. 108

10 CHAPTER Data Preprocessing EXaMPLE 3.8 Concept Hierarchy Generation Using Prespecified Semantic Connections Suppose that a data mining expert (serving as an administrator) has pinned together the five attributes number, street, city, province_ or_ state, and country, because they are closely linked semantically regarding the notion of location. If a user were to specify only the attri- bute city for a hierarchy defining location, the system can automatically drag in all of the preceding five semantically related attributes to form a hierarchy. The user may choose to drop any of these attributes, such as number and street, from the hierarchy, keeping city as the lowest conceptual level in the hierarchy. subset of the relevant attributes in the hierarchy specification. For example, instead of including all of the hierarchically relevant attributes for location, the user may have specified only street and city. To handle such partially specified hierarchies, it is important to embed data semantics in the database schema so that attributes with tight semantic connections can be pinned together. In this way, the specification of one attribute may trigger a whole group of semantically tightly linked attributes to be "dragged in" to form a complete hierarchy. Users, however, should have the option to override this feature, as necessary. 3.7 SUMMARY Data preprocessing is an important issue for both data warehousing and data mining, as real-world data tend to be incomplete, noisy, and inconsistent. Data preprocessing includes data cleaning, data integration, data transformation, and data reduction. Descriptive data summarization provides the analytical foundation for data preprocessing. The basic statistical measures for data summarization include mean, weighted mean, median, and mode for measuring the central tendency of data and range, quartiles, interquartile range, variance, and standard deviation for measuring the dispersion of data. Graphical representations, such as histograms, boxplots, quantile plots, quantile-quantile plots, scatter plots, and scatter-plot matrices, facilitate visual inspection of the data and are thus useful for data preprocessing and mining. Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. Data cleaning is usually performed as an iterative two-step process consisting of discrepancy detection and data transformation. Data integration combines data from multiple sources to form a coherent data store. Metadata, correlation analysis, data conflict detection, and the resolution of semantic heterogeneity contribute toward smooth data integration.