SGML: Groves (Kimber)

SGML: Groves (Kimber)



Subject: Re: Groves?
Date: Fri, 30 Aug 1996 11:59:39 -0500
From: "W. Eliot Kimber" <kimber@passage.com>
Newsgroup: comp.text.sgml
-------------------------------------------------------------------- > Steve Tinney <s@eryops.walnut.lane> wrote in article <m2zq3d90g5.fsf@eryops.walnut.lane>... > > [len bullard wrote] > The work on groves is significant. > > I don't understand this or other references to `groves' in your > interesting article. Since you invited questions to CTS, could > you or someone else explain, or give a useful URL? Here is my attempt to explain groves with a minimum of detail. As Steve Newcomb mentioned in his related post, the HyTime conference materials will be available from the GCA web site (www.gca.org). This will include my presentation on groves, which is excerpted from the current draft of my forthcoming HyTime book. A number of us are still struggling to find the most effective way to explain groves to different audiences. Therefore, if you find the following explanation unhelpful, let me know and I'll try again. Groves seem to make intuitive sense to programmers but not to non-programmers. The following explanation is intended primarily for non-programmers. At the conference, Murry Maloney asked question "who needs to know about groves?" My answer is: A. If you are just an author of SGML documents, you probably don't need to know about groves at all, even you will be using basic HyTime linking and addressing in your documents. B. If you design SGML document types and document architectures, you don't necessarily need to know anything about groves and property sets, but you might find it useful to be able to define the semantic objects your document types or architectures represents in grove terms. C. If you are an author who will be doing out-of-the ordinary things with location addressing in HyTime (or another grove-based addressing scheme), then you need to understand the basics of groves because all location addresses ultimately address nodes in groves. In particular, you need to understand how to predict a node's address given the rules for constructing the grove you are addressing. This is especially true if you will be using queries for addressing. In addition, there are features of HyTime that appear to be "magic" unless you understand their underlying grove-based definition. If you aren't willing to take it on faith that this magic is not arbitrary, you need to understand groves in order to understand the definition of the magic. This is analogous to having to learn number theory to understand why it's really the case that 1+1=2. D. If you are creating DSSSL style sheets "by hand" (as opposed to through some interface that hides the details), you need to understand groves and property sets because DSSSL processing operates on groves. This is analogous to having to understand for DynaText how it represents documents internally so that you can build complex property value functions. E. If you are trying to understand the functional relationship between HyTime and DSSSL, you need to understand groves because it is through the grove mechanism that these two standards have been rationalized. F. If you are building SGML tools that you expect to interoperate with other tools (editors, browsers, document managers, etc.), you need to understand groves because groves can act as a common data API by which different tools communicate about the documents they operate on. G. If you are designing query languages or specialized addressing notations for use with HyTime or DSSSL, you need to understand groves so that you can define your language in grove terms. In brief: if it is enough for you to be told that "groves make it all work" then you don't need to know more. If you want to understand the formalism that does make it work, read on. NOTE: In the discussion that follows, I refer to the "corrected HyTime standard". This is the HyTime standard as corrected by the forthcoming Technical Corrigendum (TC). In ISO procedures, a TC is a correction, rather than a revision, being essentially a list of errata. Thus the HyTime standard is being corrected rather than revised. A Brief Introduction to Groves ================ What Is A Grove? To oversimplify, a "grove" is the in-memory result of parsing a document. Groves consist of "nodes" and each node is simply a set of properties. For example, an "element" node has properties like "gi", "children", "markup", and "attribute list". The values of some properties may serve to connect nodes in the grove together into a graph. Each node has a defined class. Each grove has a single "grove root" node. Groves may be related together to form a "hypergrove". Groves provide an abstraction that lets us talk about the processing SGML tools do without worrying about implementation details. Groves and their construction let us describe general processing models from which actual implementations can be derived without having to worry that the implementations have accidently overlooked important details during optimization. In particular, it gives system developers a formalism for saying "this is the information we care about and this is the information we don't care about" because groves define the total set of possible information an SGML application might work on. Groves are called groves because they are collections of trees. We tend to think of SGML documents as being trees, but in fact, there are a number of different trees within SGML documents. For example, there is the tree rooted at the document element (what we usually think of as the "document tree"). The doctype type declaration can be thought of as a tree. There are other parts of documents that are not trees, but are simply objects of different types related to their origin objects. For example, an SGML document consists of a prolog and an instance, but we don't normally think of the prolog as being a child of the SGMLDOC object. Rather, the prolog is simply a property of the SGML document. In other words, it is useful to be able to say that some parts of the grove form trees and that others don't. In particular, it enables the use of tree-based addressing without having to worry about whether or not the inclusion of optional objects or properties will change the tree position of objects. For the tree rooted at the document element, we usually think of the tree as consisting only of elements and their data content but not things like processing instructions or comments. The grove mechanism lets us say explicitly that PIs and comments are not part of the tree even if they are included in the grove. When you "construct" a grove, you can choose to include or omit certain objects and properties from the grove. If your processing never needs to look at comments, for example, you can simply omit comments from your grove entirely, saving processing time and memory space. Because you can also say that PIs are not part of the instance tree, you know that excluding PIs will not change the tree locations of elements in the tree. Thus, you can think of a grove as a "parse tree," but it's a bit more than that because parsing an SGML document doesn't just produce a tree, it produces a lot of other things that we don't normally think of as being in a tree. [That said, there is a view of a grove in which the grove *is* a tree in the formal sense of being an acyclic directed graph where all the arcs relate children to their parents. This is called the "subnode tree". The definition of subnode trees is beyond the scope of this discussion.] Thus, a grove is the in-memory data structure that reflects the objects recognized in source data by a parser. It does what ESIS tried to do: define what the result of parsing is; but does it completely and in a way that is tailorable to the needs of particular processors. You can think of groves as a sort of data API for SGML processing systems where the grove becomes an agreed upon representation of the data all SGML processors need to operate on. If you imagine that a grove exists as a data structure that multiple applications can read at the same time, you can see that groves provide a common data abstraction that allows different applications to talk about the data in SGML documents the same way. Thus, if an editor wants to tell a browser to view a particular element, the editor can talk to the browser in grove terms, knowing that the browser will always understand it, regardless of how the browser works internally. The key things to remember are: o A grove is an in-memory representation of the result of parsing. This is an abstract representation in the sense that it is not intended that real systems actually use the grove structure literally for their own internal memory. o The grove acts as a *lingua franca* by which different applications can communicate with each other about documents. o Groves can be used to model the processing of SGML documents independent of implementation issues. Different implementations can then be derived that have whatever optimizations they need but that reflect the general model and for which the mapping from internal structures to grove structures is explicit. ======================== Groves and Property Sets Each node in a grove has a defined class and serves as a container of properties. This begs the question of where these classes and properties are defined. In order to actually work with groves you have to know what classes and properties to refer to. For example, to create a query that will find elements with a particular attribute value, you have to know first how to refer to the list of all the elements in a document and then how to refer to the attributes of elements. While it may seem intuitive to say "elements have attributes" and leave it at that, it's not that simple in practice. The classes and properties that an SGML document represents are not, for the most part, explicit in the SGML standard. Rather, the standard defines the rules for recognizing delimiters and data in the SGML document string and what those delimiters represent. The SGML standard lacked a formal definition of what the abstract objects and properties were that the syntax of an SGML document represents. Some aspects of SGML documents, like elements and their attributes, were pretty obvious, but many other aspects were either not obvious or could be interpreted in different ways. In particular, the SGML standard says nothing about what things in an SGML document instance are to be considered part of the "document tree". There are also places where things could be related together in a number of different ways. Thus, arbitrary choices needed to be made about how to relate some things together. Before groves, both DSSSL and HyTime had a notion of there being an "SGML property set" that defined formally what the objects and properties were. It became clear that there needed to be a single, common property set for SGML documents from which DSSSL and HyTime (and any other SGML processor) could work. This property set was defined by the SGML, DSSSL, and HyTime teams. [The property set is based on much deep thought done by Lynn Price, Dave Peterson, and other long-time members of the SGML team. James Clark was instrumental in working out the details of property sets and groves and tested the designs in his SP parser, the results of which you can now see in the SP code James makes available (www.jclark.com). Peter Newcomb and Derek Denny-Brown of TechnoTeacher have done much to test the grove design through their work on a new version of the HyMinder product, which will be fully grove based.] The SGML property set is published in the DSSSL standard (ISO/IEC 10179:1996) but used by both DSSSL and HyTime. The property set defines formally all the object classes and their properties. The output of any SGML parser can now be defined in terms of these objects and properties. You can think of this as ESIS++: a formal definition of what things a parser will tell you about an SGML document. Unlike ESIS, the SGML property set is complete, reflecting everything explicit or implicit in the SGML standard. The SGML property set will become part of the formal specification of SGML when it is revised. Property sets are defined using an SGML document conforming to the property set document type defined in annex C of the corrected HyTime standard. A property set document consists of a set of class definitions. Each class then has a set of associated property definitions. Property definitions may refer to new data types and lexical types defined in the property set. Property sets are organized into one or more modules. Modules typically group related classes or properties together to make it easier to include or ignore groups of classes or properties in a particular grove. Groves are said to be "constructed". A grove constructor is a program that takes as input the output of a parser (or other processor), a property set, and a "grove plan". The grove plan says what classes and properties from the full property set to include in the constructed grove. For example, if your processor doesn't care about PIs, you could define a grove plan that simply excludes the PI object class (and therefore its properties as well) from the grove. How a grove plan is specified may be different for different processors. For example, in HyTime, grove plans are specified using an element type form defined in the HyTime standard but other processors could use a different mechanism. For every property set there is an implicit "maximum grove plan" that includes all classes and properties. ============================================= Application-Specific Property Sets and Groves Groves and property sets are not limited to the task of parsing SGML documents and their subsequent processing. Applications can define property sets that reflect their own unique needs and then create groves reflecting that data. For example, a HyTime engine processes SGML documents and then creates new objects unique to HyTime, such as hyperdocuments, hyperlinks, location addresses, and so on. These objects reflect the *semantics* of HyTime, rather than the syntax of SGML documents. Thus, the HyTime standard defines a unique HyTime property set that reflects the semantic objects a HyTime engine works with. This property set is distinct from the SGML property set. The HyTime property set is then used by a HyTime engine to create "HyTime semantic groves" from the documents in a HyTime hyperdocument. You can imagine that a HyTime engine takes as input one or more SGML document groves (produced by an SGML parser/grove constructor) and then produces its own HyTime semantic grove. This process is analogous to what any hyperlink management system does today when it processes a bunch of documents and creates its own data structures for representing the links among the documents. For example, many such systems create relational database tables that record the links between documents. The HyTime semantic grove is simply a formalized, abstract representation of this, separated from implementation details. Note that given the HyTime property set, a HyTime semantic grove, and the grove plan by which it was constructed, any HyTime-aware processor would be able to use that grove to do HyTime-specific processing. In particular, you would be able to query it using a grove-aware query language (such as SDQL from the DSSSL standard) to ask questions about links and location addresses *because the objects and properties are standardized*. You would not need to know how a particular HyTime engine had chosen to structure its internal data. Note also that any HyTime engine can provide a "grove view" of its internal data structures if it wants to. This means that the grove can again act as common data access API among different HyTime applications, independent of how they are implemented and optimized under the covers. Any application or architecture can define its own semantic property set and therefore define its processing in terms of groves constructed according to that property set. Grove construction is defined in a "grove construction document". The grove construction document is intended primarily for consumption by human programmers. For each class and property in the property set, it defines the rules by which they are constructed. These could be simple prose instructions for a programmer to read and follow or they could be more programmatic. In the abstract, the construction of semantic groves could be defined by specifying the SDQL queries to be applied to the input groves where the results of the queries would be objects in the new grove (SDQL queries could include functions that perform arbitrary processing). Thus, for a complex document type or architecture, property sets and grove construction documents can be useful simply as a formalism for design specification, even if there is no intention to actually process these specifications programmatically. =========================== What More is There To Know? I have not talked about the details of how classes and properties are defined. I have not talked about how the nodes in groves are related to each other or how the values of properties are defined and used. These are important details that only implementors and people doing sophisticated queries need to understand. I have not talked about views on groves, derived and auxiliary groves, hypergroves, and the like. I do explain these details in my forthcoming book. -- <Address HyTime=bibloc homepage="http://www.squirrel.com/squirrel/drmacro"> W. Eliot Kimber, kimber@passage.com Senior SGML Consultant and HyTime Specialist Passage Systems, Inc., 10596 N. Tantau Ave., Cupertino, CA 95014-3535 (408) 366-0300 (Cupertino), (512) 339-1400 (Austin), http://www.passage.com </Address> "If I never had existed, would you still remember me?..." --Austin Lounge Lizards, "1984 Blues" (http://www.webcom.com/~yeolde/all/lllhome.html)