SGML: Sperberg-McQueen review of Makoto Murata


Date:         Fri, 18 Nov 1994 18:21:14 CST
Reply-To:     "C. M. Sperberg-McQueen" <U35395@UICVM.BITNET>
Sender:       Text Encoding Initiative public discussion list
              <TEI-L@UICVM.BITNET>
From:         "C. M. Sperberg-McQueen" <U35395@UICVM.BITNET>
Organization: ACH/ACL/ALLC Text Encoding Initiative
Subject:      SGML '94 Trip Report

                              Trip Report [extract]

                SGML '94 and SGML/Open Technical Meeting

             (Tyson's Corner, Virginia, 7-11 November 1994)


                         C. M. Sperberg-McQueen

                      November 18, 1994 (17:36:06)

. . .

Fortunately, two later talks the same morning provided shining exam- ples of how to encourage technical discussion. Jean Paoli, of Grif, spoke about SGML Objects and the issues involved in defining behaviors for them. And Makoto Murata, of Fuji Xerox, gave what I thought was the most substantial technical paper of the conference. Under the unpre- possessing title FILE FORMAT FOR DOCUMENTS CONTAINING BOTH LOGICAL STRUCTURES AND LAYOUT STRUCTURES, Murata described the formal problems confronting any attempt to record both the logical (or: a logical) and the (or: a) physical structure of the same text. Since these problems have been a constant looming presence in the TEI, especially in the work groups for textual criticism, manuscript transcription, and dictionar- ies, and since the TEI was never able to devise a fully satisfactory general solution to them, I was particularly interested in his summary. (In this summary, I will like Murata speak of the logical and the physi- cal view; the problems, however, also occur when more than one logical view, or more than one physical view, are to be encoded.) In brief, the problems include:

duplication of data (e.g. in a running head, which appears once in the logical view and several times in the physical view)
removal of data (e.g. annotations in the logical view which are not present in the physical view)
addition of data (e.g. page numbers in the physical view, not present in the logical view)
need for explicit expression of the alignment between elements in the two views
reordering of data (e.g. migration of footnote or endnote text away from its point of attachment to the bottom of the page or to the end of the chapter or volume); this Murata calls DISTORTION. Fine- grained alignments of parallel documents in one or more languages also exhibit this problem.

The optional SGML feature CONCUR was intended to enable the simultaneous encoding of multiple views of the document (in particular, of both a logical and a layout view), but CONCUR has only awkward methods for han- dling duplication, suppression, and addition of data, and no methods at all, that I know of, for handling duplication and distortion. The stan- dard is silent on whether parsers which support CONCUR must support simultaneous parsing with more than one DTD, so such parsers may or may not support explicit linkage between nodes in different document trees.

Borrowing concepts from other work on document processing and docu- ment formatting, Murata defined an algorithmic process for augmenting the logical and physical trees of the document with specialized node types, which enable him to handle duplication, addition, suppression, and distortion without having to store any portion of the text more than once. The augmented trees have explicit links expressing the correspon- dences of their nodes, and each tree can be reconstructed in a straight- forward way, undoing, as necessary, the effects of addition, duplica- tion, etc. In many respects, the specialized node types introduced by Murata resemble the PTR, LINK, and JOIN elements of the TEI encoding scheme; I need to study his work further before knowing how far these TEI element types can be used to exploit his insights in a TEI context.

Murata's work has, I think, critical implications for anyone con- cerned with document formatting, with systematic encoding of text layout or physical presentation, with multiple versions of a text (text dis- placement, Murata's DISTORTION, is one of the hardest problems of text criticism, not only in electronic form, but also in paper forms), or with synchronization of multilingual corpora. I warmly recommend its close study.

[. . . ]