François Chahuneau: Beyond the SGML DTD


Date: Mon, 02 Feb 1998 06:13:33 +0100 (MET)
Resent-from: w3c-xml-sig@w3.org
From: fcha@ais.Berger-Levrault.fr (Francois Chahuneau)
Subject: Religious Wars (Re: Beyond the DTD...)

[Note: The following posting was submitted to the W3C WG, and has been reproduced here with the express permission of the author. The context of this posting is a discussion about XML-Data, use of architectures to express and check semantic constraints, overlapping roles of DTDs and new schema languages, etc. The point of interest below ('DTD serving as grammar and schema') was elaborated more fully in Chahuneau's SGML/XML '97 presentation "SGML and Meta-information: From SGML DTDs to XML-DATA."]


Let me add my two cents to this discussion. First, a disclaimer:

Therefore, I obviously have no immediate interest in seeing DTDs die. . .

However, I can see many reasons why Microsoft and others may have felt the need to propose something new in the context of XML. I am only considering "intellectual" reasons here. . . (:-).

The big problem with DTDs is that they reflect a serious confusion between two totally differents things, which plagued SGML since its origins and made it appear esoteric to many people. Let me explain this (note: this argument is more or less the content of the speech I gave at SGML '97 in Washington).

SGML DTDs were designed to be two things at the same time:

I am carefully avoiding the terms "syntactic constraints" and "semantic constraints" here, because these terms may be interpreted in various ways (especially the second one) and should be defined first. I have defined the way in which I was using the terms "grammar" and "schema" here. I understand that Eliot's definition of "syntactic contraints" has to do with matching the "tagset" (element and attribute names, weak data typing provided by attribute types) but, given my definition of the term "grammar" in this context, I would rather see such constraints as part of the schema information (or a very weak schema of their own if they are the only constraints). In my definition, the "grammar" does not express any constraints; it's only needed (with SGML) to be able to parse the instance and correctly abstract the information it conveys (of course, with SGML, part of this information can be in the SGML declaration rather than the DTD).

Because in XML the syntax is "fixed" (no SGML declaration, no minimization, explicit markup of empty elements, etc.), you no longer need the "grammar" role of DTDs to properly access the information content of a document. As long as it is well-formed, you can represent it as a tree of typed nodes with attributes, process it, store it, etc. The question of whether the observed "de facto" structure complies or not with some set of generic rules (a schema) may be quite essential for your application... especially if its purpose is precisely to ensure it! But it belongs to a totally different scope than the question of knowing whether or not you grabbed document information correctly (i.e., in line with the originator's intent).

The current DTD syntax expresses this dual role: grammar + schema. In SGML, these two notions are not clearly separated. The SGML DTD syntax does reflect this confusion and, to some extent, encourages it. When teaching SGML to computer science students at the University, I always expect this question from the most clever ones: "but what does this '- O' thing appearing in the element declaration have to do with the notion of generic structure you explained before? Well... not much!". Therefore, adopting a new syntax might be felt as necessary to move beyond this state of things.

DTDs, as a data modelling language, have limited semantics, which XML-Data schemas try to expand by adding things such as strong data typing and lexical constraints. There are still some holes (for instance you can say that an element should appear one of several times, but you have now way to say "no more that five"), but the fact that XML-Data did not make much effort in this direction is just one more indication that it does not particularly target document modelling. (And, anyway, any semantics is always limited at some point . . .).

Concerning the introduction of proper inheritance and related things, I don't see this as an extension of semantic power, but more as something needed to bring schema design and representation more in line with modern, object oriented programming practice (modular design etc.).

Last thing: XML-Data schemas, being XML fragments, can be manipulated, stored, indexed, etc. . . using the same techniques as those developed to manipulate, store, index XML instances, which is a big practical advantage.

Therefore, welcoming XML-Data schema as some evidence of scientific progress is an issue of common sense, and I don't really understand why this "DTDs versus XML-Data discussion" begins to look like a religious war. I am perfectly aware of the obvious political aspects behind all this, but this is not enough to make me say that it is a bad idea. It is a good idea that has to be managed. . .

As far as I am concerned, the fact that XML-Data schemas encompass the semantic expressiveness of SGML DTDs (at least, at the XML level, i.e., without exceptions) is something required for them to be of any use to me in a document modelling context... but it's there. And I see many uses, including in the document business, for the additional capabilities.

I understand Eliot's concern about the fact that it is not "standard", at least not yet. However, in practice, I have seen many customers write Balise code to express and check lexical constraints in text content because, for their application, complying with these rules which a DTD could not express was as important as complying with the other rules expressed by the DTD. And although I like the Balise language, I must admit that this is not very "standard" either. . .

Mechanically converting an XML DTD to an XML-Data schema and vice-versa is something easy as soon as you have some "high level" representation of DTDs at hand (such a thing exists, among other places, inside all SGML editors, an in the Balise DTD API). It is this high-level representation which is important in practice, and which is either missing or inacessible in too many SGML tools.

Cheers,

François Chahuneau



François CHAHUNEAU                phone: [+33] 1 40 64 43 00
Directeur Général/General Manager
AIS S.A.                          FAX: [+33] 1 40 64 43 10
15-17 rue Rémy Dumoncel           email: fcha@ais.berger-levrault.fr
75014, Paris, FRANCE              WWW: http://www.berger-levrault.fr