SGML: SGML and Context Free Grammars

SGML: SGML and Context Free Grammars


Subject: Re: Various questions
Date: 6 Feb 1997 15:49:29 -0800
From: jenglish@crl.com (Joe English)
Newsgroup: comp.text.sgml
------------------------------------------------ Lars Marius Garshol <larsga@ifi.uio.no> wrote: >eric.kidd@pobox.com (Eric M. Kidd) writes: >> An SGML document cannot be described with a context-free grammar. >> [...] >> The element structure permitted by a given DTD could be recognized by a >> sufficiently complex deterministic pushdown automaton. I don't know if it's >> possible to be more specific than that. > >The set of languages that can be produced by context-free grammars is >identical to the set of languages that can be recognized with pushdown >automata. (As far as I can remember that includeds non-deterministic ones as >well.) So which one is it? :-) The basic problem here is that an SGML document begins with its own "grammar" (the DTD). For any particular DTD, the set of _instances_ can be described with a context-free grammar [*], but the language of complete SGML _documents_ is the set { (SGML-declaration + _prolog_ + _instance_) : where _instance_ conforms to the grammar defined by _prolog_ } and that takes a Turing machine. [*] Converting a DTD to a CFG is _possible_, but not in most cases _practical_ when you take into account things like start-tag omission and CDATA/RCDATA declared content. >And as for SGML (and C++) not being either LL(k) or LALR I'm a bit surprised, >as that must make parsing very difficult. That only means it's difficult to parse SGML with an LL- or LALR- parser... There are other parsing techniques you know :-) From an algorithmic standpoint, parsing SGML is in some ways easier than the traditional lex/yacc approach. (By this I mean that, although writing an SGML parser is considerably more difficult than *using* lex and yacc, it's easier than *writing* lex and yacc.) The first tricky part is in entity management. The lex/yacc approach assumes that the input is a single stream of characters (perhaps, as in the case of C/C++, preprocessed to handle macro expansion and file inclusion). An SGML document is composed of a hierarchical tree of entities, so the parser has to handle a stack of input sources. Delimiter recognition corresponds roughly to the "lex" phase of a lex/yacc parse. This is in some ways more complicated than lex-based scanners due to the number of delimiter recognition modes and their interaction with the other parsing phases, but in other ways it's simpler. Since all the delimiters are fixed-length strings [*] you can use a trie instead of a full-blown DFA like lex and flex do. [*] With the exception of "B" sequences in SHORTREF maps; these are a pain. Markup parsing -- interpreting markup declarations, start- and end-tags, etc., -- can be done with a simple recursive-descent parser, since this aspect of the ISO 8879 grammar is all LL(1). The main difficulty here is the interaction with the entity manager and delimiter recognizer. The next phase is building the element hierarchy. (I suspect that this is where most people who try to use yacc to parse SGML get into trouble. This is really best treated as a separate parsing phase, rather than trying to account for the markup and element hierarchy with a single grammar). This is *considerably* easier than LL or LALR parsing: you can match element tokens against individual content models with regular expression techniques, while LL/LALR parsing requires that the entire grammar be processed to build the parse tables. The main difficulties in content-model matching are dealing with OMITTAG (mainly start-tag omission; omitted end-tags are easier), and handling AND groups. The latter is made somewhat easier by the rule prohibiting ambiguous content models (if you build an NFA from a valid content model, it's guaranteed to be deterministic, which allows some important optimizations), but AND groups are still a pain. XML forbids AND groups, OMITTAG, SHORTREF, and R/CDATA declared content; these seem to be the hardest aspects of SGML parsing. (Actually, the hardest aspect is the astonishing number of minor points and small details in ISO 8879, and since the Standard is written in "legalese" rather than "computerese" it takes a great deal of effort to account for all of them.) --Joe English jenglish@crl.com ----------------------------------------------------------------------- Subject: Re: DTD vs BNF Questions
Date: 23 May 1997 11:54:47 -0700
From: jenglish@crl2.crl.com (Joe English)
Newsgroups: comp.text.sgml
References: <feustelEAJKpu.Ksu@netcom.com>
Dave Feustel <feustel@netcom.com> wrote: > >Are DTD and BNF (Backus-Naur Form) equivalent? No. >If not, is one a subset of the other? > >If neither specification language is a proper subset of the other, >what parts of the DTD specification are not expressible in BNF? > >Is it correct that XML *is* equivalent to a BNF? No. It is however possible to create an equivalent context-free (BNF) grammar from any XML DTD, where the terminals are start-tags, end-tags, and #PCDATA and the productions correspond to element types and content model positions. The reverse is not possible in general, so XML DTDs are (in a sense) a subset of BNF. The same *might* be true of SGML, but when you consider things like declared content, exceptions, AND groups, and OMITTAG, converting a general SGML DTD to a CFG is a much more difficult problem. "Do SGML DTDs define context-free languages" is still an open question AFAIK. (I suspect the answer is "yes", but even if so it would not be a very useful result; the derived CFG would be intractably large in many cases.) >Are there any articles on-line which address these questions? There's a section in the "Special Topics" page of Robin Cover's SGML Web site; <URL: http://www.sil.org/sgml/topics.html>. Also look for papers by Derek Wood and Anne Brueggeman-Klein, they've done a lot of research in this area. --Joe English jenglish@crl.com