Free Trial

Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.

Share this Page URL

Preface - Pg. ix

ix Preface Data mining is concerned with the generalized problem of digging out "the hidden gold" in form of knowledge patterns from massive amounts of data. The information overload, which characterizes the digital era we are living in, is further exasperated due to the textual nature of the majority of data avail- able in existing information sources. Moreover, text data have a "semistructured" nature in most of such sources, primarily over the Web but also in digital libraries, company data repositories, and scientific databases. Semistructured text data is the connection point between the natural language written text and the rigidly structured tuples of typed data--for example, a news article may contain a few structured fields (such as, news channel, headline, author, location, publication date) but also a largely unstructured text component (the article body). Semistructured data also enables the representation and description of complex real-life objects and their relationships, thus unleashing a potentially unlimited number of pos- sibilities for human-machine-human communication. XML is the preeminent form of representation of semistructured data. In contrast to most of the Web pages which are encoded as HTML documents, XML is well-defined and flexible, and markup is used to put emphasis on structuring and modeling data, rather than on presentation and layout issues, and to encode semantics. While the counterpart HTML is designed primarily for human-readable documents, XML supports the exchange of machine-readable data. Using XML, information representation is sepa- rated from information rendering, making documents to be presented by different views. XML makes it possible to define complex document structures, such as unbounded nesting and object-oriented hierarchies, and to specify not only data but also the data structures, how elements are nested, and their content models. The flexible nature of XML syntax simplifies the definition and deploy- ment of arbitrary languages for domain-specific markup, enabling automatic authoring and processing of networked data. It has been recognized that an important role of XML vocabularies is the ability of modeling a large variety of data types and their many interrelationships, and being flexible enough to support new information as it is discovered. XML is indeed conceived to couple data with its context (metadata) through an extensible, hierarchical tag structure, which is essential to handle taxonomies (as in life sciences) or other conceptual structures. As a consequence, XML has rapidly become the pre- ferred meta-language for disseminating information in on-line databases, digital libraries, scientific and financial data repositories, multimedia, and many others. All these features, and much more, have made the impact of XML significant not only in research contexts but also in industry: publishing information sources in XML is ever attractive for organizations that want to easily interoperate and provide their information in a format processable by other applications especially on the Web. XML technologies have also been coupled with relational databases to solve business problems. As a matter of fact, encoding data into XML provides benefits in decreasing the translation overheads of communication within and between organizations.