Safari Books Online is a digital library providing on-demand subscription access to thousands of learning resources.
This Book and the XML Specifications
HTML, SGML, and XML: History and Influences
XML, XLink, XPointer, and XSL
The Annotations
Where did XML come from? What are all those other acronyms that people keep mentioning with it? If there's an official specification, what makes it so official? And what good are the strange, numbered pieces of cryptic syntax throughout the specification?
This chapter answers these questions. After giving you a little background It gives you some context to help you understand where XML came from, how the specification specifies what it does, and what you can do with it.
There are many books about XML, but the official W3C specification is the single most important document of all. The official XML specification, as written, edited, and promulgated by the W3C (the organization created to develop common World Wide Web protocols such as HTML) is available for free on the Web. For the reader who lacks a strong background in computer science and SGML, however, the specifications are difficult to understand.
By providing background on the SGML concepts, the computer science vocabulary (such as "grammar productions", "big-endian", "name spaces", "nonterminal", and "schema"), and the myriad other standards bodies and specifications alluded to (for example, ISO 8879, 639, 10744, 10646, 8859, 3166; UTF-8, UTF-16; RFCs 1738, 1808, and 1766) this book explains exactly what the specs mean.
In addition to the "what", the book explains the "why"—the debate among the W3C's XML Working Group that led to the final wording of the spec.
If you are accustomed to learning computer languages directly from their formal specifications, this book will be all you need to learn XML. My annotations and my 170 new examples will make the task a lot easier than it would be if you read the spec alone.
However, even if you have such skills—and especially if you don't—I can recommend two other books in this Series. You might want to explore either or both of them before tackling this book.
The XML Handbook, by Series Editor Charles F. Goldfarb and Paul Prescod, contains user-friendly (but technically rigorous) tutorials on XML and related standards, plus extensive coverage of XML applications and tools.
XML by Example: Building E-Commerce Applications, by Sean McGrath, is a programmer's introduction to XML. It uses examples from electronic commerce to illustrate both the XML language and programming techniques for use with XML data.
What is XML's relationship to SGML and HTML? They have one obvious thing in common: the "ML" in their names. It stands for "markup language ", a collection of markup codes and rules for adding those codes to documents in order to indicate the structure or appearance of those documents.[1] Using a text editor such as Windows' Notepad or the Macintosh's BBedit (or, more likely, a specialized editor that automates the use of your markup language), you add markup to files so that a program processing the document knows what to do with each part of it.
Early markup languages were usually invented by the companies that sold document processing software. For example, if you used the PageMania program to typeset your books for a printing house, the PageMania documentation would tell you the markup that was developed at PageMania, Inc. to specify margins, fonts, and other page layout details in your books.
As an alternative to these proprietary systems, SGML, or "Standard Generalized Markup Language", was issued as an International Standard in 1986. The "Standard Generalized" part meant that document developers were no longer tied to a particular vendor's markup language, but could instead develop their own markup, ideally in such a way that their documents would easily convert into other formats. Using SGML, they could define new "document types " by specifying certain details about documents of each type:
The names of their potential components.
The documents' structure—that is, the allowable ordering of those components.
This makes documents more versatile, because a document processing application that knows a document's structure can do more useful things with it more efficiently, just as a relational database management program can do sophisticated manipulation of a database once it knows the names, sizes, and data types of each column in a database's tables.
HTML was developed in 1991 as a specific markup language to identify the structure of Web documents. For the Web browsing programs then under development, HTML markup identified which parts of the document were headings, subheadings, bulleted and numbered lists, hypertext links, and other components of a technical document.
HTML is an SGML application, defining a document type by using SGML syntax to indicate the purpose of each part of an HTML Web page. The "tags " that mark the start and end of HTML document components (or "elements ") each begin and end with angle brackets (<>) and include a name identifying the type of element represented. End-tags have a slash after their opening angle bracket to distinguish them from start-tags.
For example, you would code an HTML first-level heading (an h1 element) as shown in Example 1, and a second-level header with the h2 tags shown in Example 2.
<h1>HTML, SGML, and XML: History and Influences</h1> |
<h2>HTML and SGML</h2> |
In addition to the name of the element's type, a start-tag often has one or more labeled pieces of information about the element known as attributes. For example, HTML's a element (like the one in Example 3, which identifies hypertext anchors and links) has an href attribute to tell browsers the Web address of the link's destination.
<a href="http://www.w3.org">World Wide Web Consortium</a> |
The text between the start- and end-tags of an element describes the element's content. Some elements have no content text; these are called "empty" elements.
HTML's most popular type of empty element was not part of the original HTML, but was added by the designers of Mosaic, the first widely popular HTML browser. The empty img element type's src attribute identifies a file that has a picture for the browser to display, and optional attributes specify details such as the alignment of the image in the browser window. Example 4 shows a typical img element; it has no content text or end-tag.
<img src="treegraf.gif" align="right"> |
Character-based markup languages have a unique problem: if the "< " character begins tags, what if you really want a "<" to show up in your document? Special characters like this have names assigned to them in HTML (in this case, lt for "less-than"). By using this name in a document, preceded by an ampersand and followed by a semicolon, you tell the processing application to replace this "entity reference " with the actual entity. This is why a Web browser displays "<" when it sees <.
HTML uses entity references to represent many of the characters of Western alphabets not found on the typical keyboard, such as ñ for ñ, È for È and è for è. What if you want to actually display an ampersand? Use its character entity reference: &.
To see how all of these elements and entity references looks when assembled, let's look at the simple HTML file shown in Example 5.
|
Code View:
Scroll
/
Show All <html> <head><title>Title of Web Page</title></head> <body> <h1>A Sample Web Page</h1> <p>Here is the first paragraph of text in the sample Web page. It has enough text to wrap to the next line so that we can see what the left indentation is like. <h2>Bulleted Lists</h2> <p>Here is an unordered list. The browser prefixes a bullet to each item: <ul> <li>Here is the first item of the bulleted list. The Spanish word for "Spain" is "España" and the German word for "brew" is "bräu". <li>Here is the second item of the bulleted list. </ul> <h2>Images</h2> <p>Here is a picture: <img src='sample.jpg' alt='face'> <p>This concludes our test. </body> </html> |
Figure 1 shows how Netscape displays the document from Example 5.
The HTML document's <head> start-tag (line 2) is right after line 1's <html> start-tag. The </head> end-tag comes at the end of line 2, well before the </html> end-tag on line 25, which shows that the head element is inside of (or, "is a sub-element of") the html element. The head element has only one sub-element: line 2's title element.
It's easier to understand the different levels of container relationships in the document if we look at the tree graph in Figure 2, which shows each element branching off into the "child" elements it contains.
After the html element's head element, it only has a body element. The body element's structure is fairly flat—of its eight elements, only ul and one p element have child sub-elements of their own, and those children (the li sub-elements and the p element's img child) have no children themselves. Although the h1 and h2 elements should represent different hierarchical levels of information in the document (a head and a subhead), in HTML they're both on the same level of the document "tree", one level below the body element.
Although HTML was an SGML application, the browsers that displayed HTML documents ignored one important aspect of SGML: valid document structure. HTML came with guidelines for the use of its elements, but popular browsers didn't enforce any particular element ordering. For example, HTML uses h1, h2, and h3 elements as the titles of main sections, subsections, and the subdivisions of those subsections in an HTML Web page. It uses the p element to represent regular paragraphs of document text. (If this chapter was an HTML Web page, this paragraph would be a single p element.) The document in Example 6 has no head element, no p paragraphs of text anywhere, and it has section headings in a meaningless order—and none of the popular Web browsers would have a problem with it.
<html><body> <h2>Florida</h2> <h1>United States</h1> <h3>Miami</h3> <h2>California</h2> </body></html> |
Was this laxity bad? Was it good? Yes and yes. It was bad because, as with database programs, a program that can rely on a regular structure in its input can do much more with it. For example, an outlining program could generate a table of contents from HTML files with properly ordered headings, but would only make a mess of the "document" in Example 6.
This flexibility was good because, by minimizing the number of document-crippling mistakes that HTML authors could make, it let them create Web documents with the simplest of tools. This ease of document creation played a big role in the Web's tremendous growth in the mid- and late-nineties.
Some HTML editing programs do enforce structure, but there's not much to enforce. As the tree graph of Figure 1-1's HTML document showed, h1, h2, p, and ul are all at the same hierarchical level. In more structured documents, subsections are components of their parent sections, instead of just more text preceded by a different header, so that treating a specific chapter or section as a unit (for example, to extract it for use in another publication) means simply finding a particular start-tag and its matching end-tag.
As an example, let's compare a well-structured document (Example 7) with an equivalent HTML document (Example 8). As SGML and XML let you do, I've made up my own element type names and structure for the document in 7, and I indented its elements to show the structure more clearly.
|
Code View:
Scroll
/
Show All <chapter> <title>Here is the Chapter Title</title> <para>This paragraph introduces the chapter.</para> <section> <title>Section 1</title> <para>Here is the first section's first paragraph.</para> <para>It has two paragraphs.</para> <subsection> <title>Section 1's First Subsection</title> <para>Here is Section 1's first subsection.</para> </subsection> <subsection> <title>Section 1's Second Subsection</title> <para>Here is Section 1's second subsection.</para> </subsection> </section> <section> <title>Section 2</title> <para>Here is the introduction to section 2.</para> <subsection> <title>Subsection of Section 2</title> <para>Here is a subsection of section 2.</para> </subsection> </section> </chapter> |
For reasons we'll see below, many HTML end-tags are optional, but I included them in 8 to make comparison easier. I also added line numbers so that the discussion that follows the example could more easily refer to specific lines.
<html><head><title>Sample Structured Document</title> <body> <h1>Here is the Chapter Title</h1> <p>This paragraph introduces the chapter.</p> <h2>Section 1</h2> <p>Here is the first section's first paragraph.</p> <p>It has two paragraphs.</p> <h3>Section 1's First Subsection</h3> <p>Here is Section 1's first subsection.</p> <h3>Section 1's Second Subsection</h3> <p>Here is Section 1's second subsection.</p> <h2>Section 2</h2> <p>Here is the introduction to section 2.</p> <h3>Subsection of Section 2</h3> <p>Here is a subsection of section 2.</p> </body></html> |
If we look at a tree diagram of the non-HTML version (Figure 3) we see more groupings of elements into useful units than we see in the HTML version (Figure 4).
Why are these groupings valuable? Let's say I want to pull out the structured version's first section for use in another document. It would be easy to write a program that finds the first <section> tag in that chapter and then copies everything from there up to its corresponding </section> end-tag. With the HTML version, the first section begins at the chapter's first <h2> tag (line 5), but the corresponding </h2> end-tag on the same line only shows the end of the section's title. Our program must look for the next <h2> start tag (line 12) to know where the target section ends: just before that <h2> tag at the end of line 11.
If the program was pulling out that section's first subsection, it would have to look from that subsection title's <h3> start-tag (line 8) to just before the next <h3> start-tag (line 10), but what if we want to pull out that second subsection? If we grab everything from its <h3> tag (line 10) to just before the next <h3> tag (line 14) the way we did when we extracted the first subsection, we'd get the following section's title and introductory paragraph as well (lines 12 and 13)—lines that are outside of the subsection we want. Our program that extracts subsections from the HTML file needs extra rules and logic to figure out what lines it needs, but when extracting from the non-HTML file, it only needs one simple rule: to extract a subsection, get everything from its <subsection> start-tag to that tag's corresponding </subsection> end-tag.
In terms of the tree pictures, a document processing program can extract a useful subset of the document by simply clipping a particular branch and taking all of its descendants. This is much simpler than hunting through a series of HTML sibling elements and looking for the ones that meet certain complicated conditions. (This branch clipping approach is less abstract than it sounds—many software development kits for XML, SGML, and other kinds of applications read the data into the equivalent of a tree so that you can manipulate that tree to get what you want.)
Why would you want to extract anything? What's the advantage of moving through a document and pulling out pieces? Because information is an asset, and information that can be reused for multiple purposes has more value than information that is only available for one purpose. In business terms, the ability to make multiple products from the same body of information means more revenue from the same assets. For example, if you have all the details about a car's engine stored in a format that be easily searched and manipulated, you can write programs that pull out
A complete reference work on the engine.
A tutorial on fixing it.
A quick reference card on troubleshooting engine problems.
And that's just paper publications. Each of these three could be published on other media:
CD-ROM versions of the complete reference, the tutorial, and the quick reference card, giving you three more products.
Hand-held computer versions of all three for mechanics crawling under the car.
Web versions of all three.
This makes a total of twelve products from the same information. When the engine is upgraded, instead of undertaking twelve revisions, you update the main body of information once and then kick off the automated processes that create updated versions of the twelve publications.
Because HTML lacks the ability to represent such structure, it's not the best way to store a huge amount of information that you're going to put to multiple purposes. It became popular because of its simplicity and one particular application that used HTML files. This application—a free product—became the "killer app" for the Internet. "Killer app" is a computer industry term for an application that convinces huge amounts of people that they need a certain new technology. The killer app Long Playing record for hi-fi record players in the late nineteen-fifties was the soundtrack to the Broadway show "South Pacific"; the killer app that got millions of people to buy IBM's first personal computer in the early nineteen-eighties was the Lotus 1-2-3 spreadsheet; the killer app for the Internet was Mosaic.
From 1990 on, various experimental Web browsers were developed and released for public use. In February of 1993, the National Center for Supercomputing Applications (NCSA) released the first alpha, or early test version, of Marc Andreessen's Mosaic browser for UNIX machines. By running in a graphical interface environment (at first, UNIX XWindows), this program could display different elements in different fonts and even display pictures—color pictures! It wasn't necessarily the first program to do this, but once the NCSA released Windows and Macintosh versions, nearly everyone with access to a computer that was connected to the Internet could download Mosaic for free, view Web pages graphically, and follow the hypertext links to other Web sites around world. Anyone with an account on a UNIX machine that was running a Web server (a program that could send Web pages in response to requests made by browsers via the Internet) could create personal Web pages that were accessible by all these browsers. Familiarity with HTML was not required for this; plain text files also worked, but Mosaic displayed these in a boring, typewriter Courier font, and HTML wasn't that hard to learn, so soon everyone was cranking out HTML markup for their Web pages. Although the name "World Wide Web " had been around since 1991, it had finally become world-wide.
In March of 1994, Andreessen and some of his colleagues left the NCSA to form their own company: Mosaic Communications Corporation. Their new company, which later changed its name to Netscape Communications Corporation, made its browser available over the Internet for people to download for free. Another company known as Spyglass used much of Mosaic's code to develop its own browser, which Microsoft snapped up and turned into its Internet Explorer product after it realized in 1995 that it should take the Internet seriously after all. While the Web's growth didn't really explode until after the name Mosaic had faded away to a trivia question for old computer hackers, the two dominant Web browsers as we finish up the century started off as spinoffs of this historic application that made millions of people aware of the power of HTML and hypertext documents.
For many people, however, Mosaic wasn't an example of what you could do with HTML; Mosaic, HTML, the Web, and even the Internet were a seamless whole. For these people, h1 went from meaning "first level header" to "24 point bold Times Roman, suitable for first level headers". If they wanted slightly smaller bold Times Roman text to appear on their Web page, they tagged it with h2 tags. If they wanted a word in the middle of a sentence bolded, they put b tags around it, and if they wanted it italicized, they put i tags around it. The possibility that someone might read their Web page using a Braille Web browser, or listen to an automated voice system read it, never occurred to them; they were creating pages to be read using Mosaic (and eventually Netscape and Internet Explorer), so whether they italicized a word because it was a foreign term or a C++ keyword was irrelevant. Since the blockquote tag added a wider left and right margin to text tagged as a blockquote, it became common practice to put a <blockquote> start-tag at the beginning of a document and the </blockquote> end-tag at the end to give the whole thing a more professional-looking margin. This element's original purpose—identifying a quotation that was more than one or two lines long—became as irrelevant as the effect of b and i tags on Braille browsers.
People were confusing structure and appearance. Document designers who thought " h1 is how you make text appear in 24-point bold Times Roman" misunderstood the distinction between structure and appearance—a misunderstanding that made it difficult to fully exploit either. Even the img element type showed at least one symptom of this confusion: its align attribute did not provide information about the image specified by the element's src attribute, but instead described how the image should appear visually on the screen.
This emphasis on using HTML tags to control visual design led to a wish for even greater control over document appearance. (While you could reset typefaces and fonts for your own copy of the browser, so that all h1 elements sent to your browser displayed on your screen using a font of your choice, this wasn't good enough; people wanted control over how their documents appeared to other people on the Internet.)
Eventually, the font tag was added, which let you specify the typeface and size of the text outright. (For example, the word "hello" after the font start-tag <fontface="arial" size="4"> will be displayed in the Arial font a little larger than normal body text. Don't ask about the details of specifying font size this way; it's one of HTML's messier details, and the Cascading Style Sheets described below let you describe font size in points, a much more natural way to do it.)
Using this element let you design an attractive Web page if you had a knack for selecting appropriate fonts, but for applications that treated Web pages as structured data, it was useless. A program that extracts particular sections or subsections of a chapter can't possibly know their boundaries based on markup for font sizes and typeface names used in place of more structural elements such as h1, h2, and p.
I mentioned earlier that markup indicates the structure or appearance of documents. A collection of markup tags that indicate some structure and some appearance, which is how HTML came to be used by 1997, is like jazz-rock fusion: in trying to be two things at once, it does both badly. People needed a way to indicate structure and appearance separately so that they would have greater control over both.
One approach that gained popularity around this time was Cascading Style Sheets (CSS ). At its simplest level CSS lets you define, at the beginning of your document, how you want each of your element types to look. As an alternative to scattering font tags throughout your document, a Cascading Style Sheet lets you use the HTML structural tags and be confident that the document's elements will look the way you want with a minimum of markup.
For example, after you include the style element shown in Example 9 inside of a Web page's head element, a browser will display the body, h1, h2, and pre elements according to these instructions instead of using the default font faces and sizes.
<style><!--
body { background: whitesmoke; margin: 20px 50px 20px 50px }
h1 { font-family: arial,helvetica; font-size:140% }
h2 { font-family: arial,helvetica; font-size:110% }
pre { font-family: system; font-size:75% }
--></style>
|
This approach gives you easier control over a document's appearance by isolating presentation information at the document's beginning. This way, you can rely on structural tags for the remainder of the document. This makes the document more valuable to applications that use it for other purposes, because these applications won't need to scan through font, b, and i tags to find the content they want. (Identifying the ending of hierarchical sections, however, will still be the same problem it always was in HTML.)
HTML had another problem: while an official standard defining HTML existed, it had competition. Since 1994, the standard has been maintained by the World Wide Web Consortium (or "W3C "), a collection of companies and universities around the world interested in developing and promoting common protocols for the Web's evolution. Before then, Web development activity was centered in the European Laboratory for Particle Physics (CERN) in Geneva, Switzerland, where the Web was first developed around the ideas of CERN's Tim Berners-Lee. Being more concerned with particle physics than the Web, CERN (with MIT's Laboratory for Computer Science) helped to found the W3C to oversee standards development.
Mosaic's derivatives supported the W3C's HTML standard, but they had big ideas about going beyond that standard. Any software vendor tries to outdo its competition by adding new features to its applications that are unavailable in competing products, and browser vendors often did this by inventing flashy new HTML element types that your documents could use if people viewed them with the vendor's latest browser. Sometimes these features would eventually be incorporated into the W3C standard, but the vendors' attempts to outpace each other have always sent them along paths that differed slightly from each other and from the standard.
Cascading Style Sheets encouraged a return to the use of HTML's structural tags, which was great for programs creating a table of contents from Web pages with properly arranged h1 and h2 elements. SGML people, however, knew that designing their own classes of elements, or "element types ", let them create documents that were useful to a wider variety of applications.
For example, plenty of Web sites offered HTML pages of recipes; if ingredients were tagged as ingredient elements instead of HTML li list items, and the total preparation time was tagged as a totpreptime element instead of as an italicized paragraph, an application could easily search a cookbook Web site for a recipe that used chicken and lemon grass but took less than 40 minutes to cook. (Web sites that did offer this kind of capability didn't search through HTML pages of recipes—they searched SGML documents or relational databases and then temporarily converted any found recipes to HTML for retrieval by the browser that sent the query.)
So why not put SGML pages on the Web instead of HTML pages? SGML was designed—a decade before the Web even reached the experimental stage—to be very flexible so as to accommodate the widest possible range of uses. A proper SGML application must account for all of this flexibility, which makes for a difficult, complicated program to write.
For example, SGML lets you redefine the tag opening and closing characters so that you can write an ingredient start-tag as (ingredient) instead of <ingredient> and its end-tag as (!ingredient) instead of </ingredient>. You can redefine the characters that delimit an entity reference, so that instead of entering < when you want the processor to output a "<" you could use {lt} instead.
Although a negligible number of SGML documents actually redefine tag and entity reference delimiters, proper conforming SGML software must be prepared to handle these redefinitions, and that makes for bigger, more complex programs.
SGML also allows the omission of start- and end-tags when a system can figure out the omitted tags from the included markup. HTML uses this often; for example, while the end of an ordered list must almost always be clearly specified with an ol end-tag (</ol>), HTML paragraph start-tags (<p>) rarely need a matching </p> end-tag because a paragraph's end is usually obvious. If a <p> start-tag is followed by text and then another <p> start-tag, it wouldn't make sense to consider the second p element as being inside the first one. A browser can therefore assume that the first p element ends just before the second one begins.
For example, if line 6 of Example 8 (lines 5 through 8 are reproduced in Example 10) had no </p> end-tag, a browser wouldn't treat line 7's p element as being inside of the p element begun on line 6; it would assume that line 6's p element was finished.
5. <h2>Section 1</h2> 6. <p>Here is the first section's first paragraph.</p> 7. <p>It has two paragraphs.</p> 8. <h3>Section 1's First Subsection</h3> |
(An ordered list usually needs its </ol> end-tag because many element types are legal inside of a list item. Therefore, the beginning of a particular new element after the start of the last item of the ordered list may not indicate the end of the list.)
Deducing the location of missing end-tags is not difficult for HTML Web browsers because they always deal with the same limited set of element types and assumptions. SGML software must calculate a new set of assumptions each time it comes across a new set of element types and element ordering rules, and writing software to do this is no trivial task.
SGML also offers many markup abbreviation techniques that may not be widely used but must still be supported by SGML applications. SGML's designers included these features to reduce the amount of typing required when entering SGML markup into text editing programs. Tag omission is one such technique; others have been forgotten or never learned by those SGML users who create SGML documents using editing tools that automate much of this markup entry for them.
For example, with the right setup of your SGML environment, entering an ingredient start-tag as <ingredient/ lets you enter its end-tag as a simple slash. Instead of marking up the word "carrots" as <ingredient>carrots</ingredient> you could just enter <ingredient/carrots/. People rarely need such a trick when using an SGML editing program that lets you pick "ingredient" from an element type menu and then enters the complete ingredient start- and end-tag for you. You only need to type the element's content: the word "carrot".
An SGML Web browser needs one more thing, a requirement unrelated to the SGML standard: instructions for displaying the various elements. Any HTML browser knows that h2 is a second level heading, and should look similar to a first level heading but a little smaller, and that text between b tags should be bolded and text between i tags should be italicized.
When an SGML browser receives a document using a customized set of element types called recipe, ingredient, totpreptime, and step, how does it know the font to use for each of these? Which elements does the browser display on their own line, and which get displayed on the same line as their parent element? If any elements are hypertext links, how does the browser know this, and where does it look for information about their link destinations?
All this information must be supplied as a stylesheet. There was no real standard for stylesheet syntax until ten years after SGML became a standard, and when DSSSL (the Document Style Semantics and Specification Language) did come along, it was powerful enough to present many of its own intimidating requirements to developers who wanted to create software for it.
For an SGML browser to be prepared for all of this, it had to be a large, complex program. Gloomy as this sounds, some SGML Web browsers were developed. They each had a unique syntax for specifying document appearance, they were often difficult to set up and use, and any free versions had limited features and availability. (After all the work that went into developing them, asking their developers to give away the product of this labor would be a bit much.)
To summarize (and perhaps oversimplify), full SGML was too much for what many needed, and HTML wasn't enough. Some tried to remedy the latter by adding features to HTML such as new tags, embedded programming languages to manipulate data, and new ways to specify element appearance. All these expanded the range of an HTML Web page's possibilities; unfortunately, the lack of coordination of these efforts and their accumulation on a foundation that was unprepared for so much all added up to a mess.
If full SGML was too much, and HTML wasn't enough, and adding on to HTML didn't work well, how about a version of SGML that stripped away the fancier, little-used parts to leave something easier to implement? In 1986, SGML offered many, many features to account for the wide variety of directions that publishing technologies might take and to accommodate the specialized needs of users with enormous document collections. Ten years later, the World Wide Web had emerged as a unique user environment, well-defined enough to make it easier to decide which SGML features were superfluous to its needs.
In the summer of 1996, Jon Bosak of Sun Microsystems recruited a group of SGML experts and formed the XML Working Group. Working under the auspices of the W3C (with Dan Connolly, a key figure in HTML's early history, serving as the Working Group's contact with the W3C), they hammered out a version of SGML that was simpler to implement, especially for delivering documents over the Web.
In the course of numerous meetings, conference calls, and over three megabytes of e-mailed discussion and proposals among the Working Group and the XML Special Interest Group (an advisory body of SGML experts) they managed to put together release 0.01 of the specification's initial working draft by November of that year. They told the world about it at the SGML '96 conference in Boston, the largest group of SGML people ever assembled up to that time.
XML's customization of SGML had several important properties:
It made specific choices of syntax characters which must be used in all documents. For example, all XML tags must begin with the less-than (<) character and end with the greater-than (>) character, and all entity references must begin with an ampersand (&) and end with a semicolon (;).
Empty XML elements can either have a normal start- and end-tag with no content between them or an empty-element tag, which looks like a start-tag with a slash character just before the closing bracket, like this:
<img src="largeglass.jpg"/>
The Document Type Definition (DTD) that spells out the structure of a particular class of SGML documents need not be present, or even identified. Because no tags or attribute names can be omitted from an XML document, applications can accomplish plenty with a document that has no DTD. (They can do even more using the DTD, which plays an important role in the more powerful XML applications.)
These simplifications mean less work for programs that have to parse (that is, determine the structure of) XML documents, which means that software should be much easier to develop.
Although XML is essentially a stripped-down version of SGML, and an understanding of full SGML provides interesting perspectives on XML's heritage, no knowledge of full SGML is required to use XML. The XML specification (far, far briefer than the SGML specification) stands on its own as a self-contained document. Along with the XLink, XPointer, and XSL specifications, it provides you with the power to create rich, valuable documents that will someday make HTML Web pages look like primitive early attempts at Internet publishing.
We've seen that XML provides a way to define your own structure for documents, and we've seen the advantage of keeping structure and presentation information separate. Pushing presentation out of the picture gives us more flexible and therefore more valuable documents, but what happens when we want to present our documents? If XML is supposed to provide a superior alternative to HTML, it better look as least as good as HTML looks with Cascading Style Sheets. The documents should let you do more with them, too, if XML is more sophisticated than HTML.
This is where XLink, XPointer, and XSL come in. The Extensible Linking Language (XLink) does much more than define ways to offer hypertext links—it provides a mechanism for defining relationships between elements and offers multiple ways to represent those relationships. Underlined text that jumps you to some point in the current document or another document, the way HTML's anchor (a) element does, is only one way to represent such a relationship. Its companion, the XML Pointer Language (XPointer) provides ways to point to specific elements, character strings, or even individual characters of an XML document, even if that target has nothing identifying it as part of a link.
The Extensible Stylesheet Language (XSL) lets you define the appearance of your XML elements: fonts, text size, bolding, italicizing, line spacing, and other aspects of a document's visual design. It goes beyond this (and Cascading Style Sheets) by offering a scripting language that allows you to rearrange document content and to conditionally execute instructions based on evaluation of the document's data, structure, or other properties—for example, to format a report one way if the report element's stage attribute has a value of "draft" and another way if it is "final".
Hypertext has a long, distinguished history of investigation into different ways to represent and follow relationships among electronic document content. HTML implemented a simple way to indicate that a word or phrase was related to either another element in the same document, an element in a different document, or to an entire separate document: you enclose the word or phrase with start- and end-tags for the a element and use this element type's href attribute to indicate where to jump when the user clicks the tagged text.
For example, the code in Example 11 tells an HTML Web browser to display the phrase "World Wide Web Consortium" differently from the other text and to jump to the W3C's home page when that text is clicked.
The <a href="http://www.w3c.org">World Wide Web Consortium</a> was founded in 1994. |
Six years after SGML became a standard, the Hypermedia/Time-based Structuring Language, or "HyTime ", became an ISO standard: ISO/IEC 10744. As a hypermedia structuring language, it offers ways to define relationships (whether implemented as links or not) between components of document content. As a time-based structuring language, it lets you describe relationships between more than just elements of text and static pictures; you can specify a link between the fourth second of an audio clip and a particular point on an image ten seconds into a video clip. HyTime was designed as a set of standardized constructs for using SGML to represent hypermedia and multimedia information.
Like SGML, HyTime gained a reputation for being powerful but difficult to implement; the many features in its broad scope have given the XML Working Group a lot of good ideas about ways to represent relationships between elements. They were also influenced by the Text Encoding Initiative (TEI), an international organization developing a set of standards for representing classic works of literature for scholarly study (in particular, developing DTDs for the SGML markup of this literature).
The TEI's concept of extended pointers provided the XML Working group with the model for the XML Pointer Language, or XPointer. XPointer offers ways to define link targets that are much more flexible than HTML's a href links, and the TEI's use and refinement of extended pointers over several years gave XPointer the advantage of being based on a practical reality instead of mere theory.
XLink describes syntax for defining relationships between objects, or the "participating resources " of a link. According to the XLink specification, XLink "uses XML syntax to create structures that can describe the simple unidirectional hyperlinks of today's HTML as well as more sophisticated multi-ended and typed links". More sophisticated links allow greater descriptive possibilities for describing a linking element and its anchor resources than the HTML anchor (a) element offers when describing a link's source and target.
More information in a linking element means that a processing application can determine an appropriate way to represent it. A link resource might be implemented as:
A pop-up window.
Highlighted text in a secondary "References" browser window.
An e-mail message to your mailbox.
One of the HTML ways: by replacing a linking element that is also a participating resource with another resource—either a document or an a element with a name attribute.
XLink takes from HyTime the ability to describe different links by their purpose (for example, a footnote reference, a citation, or a glossary reference) and to then keep the implementation details of each link type separate from its meaning, much as XML and SGML keep an element's purpose separate from its presentation details.
As a companion to XLink's greater flexibility in describing the nature of a link, XPointer offers many powerful new capabilities to describe participating link resources such as link targets. (Well, they're new to the HTML user—researchers involved in serious hypertext have known about them for a while.)
While an HTML a href element links either to an entire document or to a specific a element with a name attribute specified, XPointer offers ways to point at anything you want in a target document without requiring the target document to have predefined link anchor elements. XPointer offers syntax that lets you say "link to the third bulleted item in the second bulleted list in the fourth chapter", or even to the third letter in that bulleted item. Instead of forcing you to link to either an entire document or to a single element in that document, you can specify two points anywhere within a target document as the beginning and end of a link resource. This lets you link to a particular quotation, description, single word, or any range of text you like.
XLink's extended links (not to be confused with the TEI extended pointers that influenced the XPointer language) let you link one piece of information to several others. Such links might pop up a list of the choices, or send you to all the resources in some order of your choosing, or let you search the set of participating resources for something—the possibilities are up to the imaginations of developers creating XML applications.
Once browsers are available that support these new ways of linking information, XLink's features will provide many of the most exciting aspects of XML systems. Why? Because while XML and XSL offer sensible ways to do things that proprietary publishing systems have already done for years, XLink gives us ways to do entirely new things, creating Web pages that are a huge step beyond HTML's capabilities.
The Extensible Stylesheet Language (XSL) lets you define the visual appearance of the elements in your XML documents. Formatting may be based, among other criteria, on
An element's position: for example, assigning one style to the first element of a bulleted list and a different style to the list's remaining elements.
An element's ancestors: for example, assigning one style to a title element inside of a chapter element and another to a title in a section element.
An element's attribute values: for example, assigning one style if an element's status attribute has a value of "overdue" and another if it equals "pending".
It also provides for generated text, the use of a scripting language for more sophisticated tasks, the definition of reusable macros, and many other powerful features.
XSL is also based on a standard designed to work in conjunction with SGML: the Document Style Semantics and Specification Language, or DSSSL. DSSSL (another ISO standard—ISO/IEC 10179), defines processing of SGML documents. By "processing", this certainly means the specification of fonts, type sizes, text spacing and colors, but it also means any useful manipulation of SGML documents.
DSSSL, like HyTime, is considered too much to implement in a browser receiving documents over the Web. Early efforts to create a simpler version called "DSSSL-O " (for "DSSSL-online") involved several of the same people working on XML, and DSSSL-O evolved into XSL.
At first glance, XSL doesn't at all resemble DSSSL, whose roots in the Scheme and LISP programming languages mean that it groups related programming structures by using lots and lots of parentheses. XSL looks like XML, in accordance with its second design goal: "XSL should be expressed in XML syntax".
The fifth design goal is that "XSL will be a subset of DSSSL with the proposed amendment" (that is, once the DSSSL specification is amended to make it a proper superset of XSL). Although XSL and DSSSL look different, they're semantically similar enough that one of the first pieces of XSL software available was Henry Thompson's xslj program, which converts an XSL stylesheet to a DSSSL specification.
While XSL addresses many of the same problems as Cascading Style Sheets, it doesn't compete with CSS , a W3C standard you can use with XML documents as well as with HTML documents. XSL design goal six states that "A mechanical mapping of a CSS stylesheet into an XSL stylesheet should be possible". In other words, programs that convert a CSS stylesheet to an XSL one should be easy to write.
Why convert a CSS stylesheet to XSL? CSS stylesheets can do quick formatting with very little code and often make an ideal starting point for formatting a particular class of documents for Web delivery. However, as an online publishing system grows more complex, a developer trying to squeeze too much out of CSS will eventually hit a wall. For example, CSS offers no way to reorder elements, an essential capability when creating customized publications from subsets of a large document collection.
Converting an XSL stylesheet to a CSS one isn't so easy, because XSL can do so much more than CSS. In addition to macros and element reordering it offers ECMAScript, a nonproprietary specification of Netscape's JavaScript programming language developed by the European Computer Manufacturer's Association. ECMAScript is the key to going beyond the specification of element appearance to real application development, because it lets you evaluate document content and then conditionally execute different script instructions based on the results.
The W3C's Web site (http://www.w3.org) has special sections that are only accessible to members and invited experts taking part in W3C work. However, the Notes, Working Drafts, Proposed Recommendations, and Recommendations are available for the world to see and read at http://www.w3.org/TR. This includes the XML specification, so why publish a book of them?
Design goal 8 of the XML specification (1.1, "Origin and Goals") tells us that "the design of XML shall be formal and concise". Webster's New World College Dictionary has over ten definitions of "formal" as an adjective; the most common uses of the word are defined by definitions 4b ("stiff in manner; not warm or relaxed") and 5a ("designed for use or wear at ceremonies, elaborate parties, etc. [formal dress]"). XML design goal 8 uses the word in the sense of Webster's definition 8, a less commonly used meaning: "done or made according to the forms that make explicit, definite, valid. etc. [a formal contract]".
Specifying XML according to forms that make it explicit and definite mean using the concise language of computer science, which often means using well-known English words in a sense very different from their commonly understood meanings. "Formal" is one example; others include "grammar", "production", "token", "deterministic", and "terminal". Someone with a computer science background knows that "production" is unrelated to show business and that "terminal" doesn't describe a place where one can use public transportation, or even a keyboard and monitor hooked up to a computer somewhere. They'll also understand terms such as "nonterminal", "big-endian", and "Extended Backus-Naur Form", which mean little outside of the world of computer science.
Why doesn't the specification explain these terms? Because of the other part of design goal 8: the specification should be concise. The background and definitions necessary to understand these terms are out there, available for looking up. That's still a lot of work to ask of your average Web page designer who's interested in new technology but lacks a computer science degree. The goal of this book is to save that designer all that work by adding all the necessary explanations to the specs themselves.
But this book adds more than that: in addition to background on the "what", this book explains the "why" of much of the XML specification. After the XML Working Group debated a point and then decided how XML would handle a certain issue, they just laid it out in the spec: "here's how to do this part". They didn't need to justify themselves, explaining the alternatives they considered and the relative merits and deficiencies of each approach; they wanted to be concise, and we trust that this group of some of the most expert people in the SGML world carefully reasoned out their decisions.
Well, sure, trust is great, but besides satisfying our curiosity, knowing the "why" of various decisions helps us to better understand XML's strengths. If SGML offered four ways to accomplish some task and the XML Working Group picked one and threw out the other three, knowing the reasoning that led to this decision helps you to make better use of that feature of XML.
To learn this reasoning, I read through many megabytes of the debates and discussions that led to the XML specification. It was a tremendous learning experience for me and I've done my best to assure that what I learned is reflected in this book's annotations.
I chose to show commas and periods outside of quotation marks, according to the British style, "like this", instead of putting them inside as is usually done in American publishing, "like this." The main reason was to stay consistent with the style used in the specification itself. The British style makes more sense anyway when writing about computers and software; if I use the American style to tell you that your password is "swordfish," it is ambiguous whether the comma is part of the password. If I use the British style to tell you that it's "swordfish", it's much clearer exactly which characters make up your password.
In keeping with this theme of American work in a British style, most example text that I didn't make up myself is from T.S. Eliot's "The Waste Land", a work whose recent seventy-fifth birthday has put it into the public domain.