Re: [dita] attribute specialization as foundation

dita message

Subject: Re: [dita] attribute specialization as foundation

From: Dana Spradley <dana.spradley@oracle.com>

To: Erik Hennum <ehennum@us.ibm.com>

Date: Thu, 27 Apr 2006 08:58:18 -0700

I've just had time to skim it so far, but I can already see that Erik's paper clearly and systematically outlines of what attribute extension in general should look like - in addition to many other valuable DITA extensions that would make the kind of customization I've resigned myself to in any workable implementation unnecessary.

But this just makes me wonder why I've been getting so much resistance over what, unlike some of these extensions, should be a no brainer - allowing DITA implementers to add arbitrary, implementation-specific attributes to elements that will be ignored on generalization.

Erik recognizes the need for this:

Attribute addition
Adding new properties to existing elements without losing interoperability with others. Many problem domains have special metadata that must be represented in the document instance. For instance, warning notes for hardware might have an attribute that identifies the regulation that motivates the warning.

He also provides the use case and justification, along the same object-oriented lines I've been arguing:

The Object Oriented perspective and addition of properties to discourse

In the Object Oriented approach, a specialized class inherits all of the properties of the base class. These properties are often accessed through extensible behaviors, but that nuance doesn't alter the basic principle. The specialized class introduces variation by adding new properties (in XML Schema parlance, through extension by addition).

The following example shows the specialization of a class for generic structural nodes to define a class for tree nodes.
General class Specialized class
Node
    data: Object
    next: Node
TreeNode
    data:   Object
    next:   Node
    parent: TreeNode
A program can treat objects of the specialized type as objects of the general type through a casting operation that hides the added properties. For instance, the parent property of the TreeNode class isn't visible when a program is treating a TreeNode object as a Node object. Such casting makes it easy for a program to process objects in shared or distinct ways as appropriate.

Adding properties to a discourse object can be important for metadata processing and for hybrid documents that include record data as well as discourse text. For instance, a lab report type might need metadata about the institution that produced the report or record data expressing the raw data analyzed in the report. If added content is restricted to properties outside the main flow of discourse, the standard object-oriented strategy of hiding the additions can maintain the validity of the discourse when generalizing to a type that doesn't have the added properties. That is, after the added properties are hidden, the remaining discourse remains a valid instance of the general type.

One strategy is to put the addition inside a processing instruction that occupies the position of the hidden content during generalization. It should even be possible to add properties to a specialization of an empty element because, when generalized, the empty element should be able to contain the processing instruction for the hidden addition.
Special type General type after hiding the addition
<fig>
  <title>Quantum 
  engines</title>
  <labloc>B52-FA-RA13</labloc>
  <image href="qengines.jpg"/>
</fig>
   <fig>
     <title>Lab report</title>
     <?HIDDEN-ELEMENT <labloc>B52-FA-RA13</labloc> ?>
     <image href="qengines.jpg"/>
   </fig>
Thus, addition complements substitution by supporting extensible properties about discourse.¹

So what's the fuss? Why so much resistance?

--Dana

Erik Hennum wrote:

Hi, DITA Committee Folk:

Since it came up, I'd like to summarize some ideas that have been brewing offline for a while now. Maybe the ambition for more significant attribute capabilities in the future can provide motivation for progress on attribute specialization now.

FWIW, a paper at last year's Extreme has more detail:

http://www.mulberrytech.com/Extreme/Proceedings/html/2005/Hennum01/EML2005Hennum01.html

Of course, the issues summarized herein require more thought and many perspectives to get right.

1. Specializing an attribute that takes a single value (not an enumeration)

If an element contains a value (that is, only text), a designer in DITA 1.0 can specialize that element by changing the name and restricting the value to specify a more precise semantic. For instance, we can specialize <apiname> as <javaClassName> or specialize <msgnum> to <httpErrorCode>.

In principle, the same kind of specialization should be possible for an attribute that takes a single value. For instance, a designer should be able to distinguish and enforce formats for the version, release, and modification attributes on <vrm>, for the id on <resourceid>, the content on <othermeta>, or the value on <state>.

In the same way that the specialized <parml> element can mandate a specialized <plentry> in its substructure, a specialized element should be able to mandate a specialized attribute. That ability to specialize an attribute as part of element substructure might be something to take on after DITA 1.1

2. Interoperability of a model over variant XML syntax

More fundamentally, could specialization allow mutability between a single-value attribute and a text-only subordinate element (a possibility that Bruce raised with respect to the <data> element)? For instance, could DITA recognize the following forms as identical?

<p owner="bjorn">It all began...</p>
<p><owner>bjorn</owner>It all began...</p>

Building on that, could DITA recognize equivalence between the subdivision of a value into fields via a pattern and fields in the content delimited by subordinate elements? For instance, could a base instance of

<bookinfo publisher="Bjornsen, Bjorn"/>

be specialized via a field pattern of "'(\w+),\s+(\w+)', lastname, firstname" as

<bookinfo><publisherIndividual>
.... <lastname>Bjornsen</lastname>
.... <firstname>Bjorn</firstname>
</publisherIndividual></bookinfo>

Similarly, could a different base instance of

<bookinfo publisher="AMLW - Amalgamated Widgets"/>

be specialized via a different field pattern of "'(\w+) - (\w+)', stock, company" as

<bookinfo><corporatePublisher>
.... <stock>AMLW</stock>
.... <company>Amalgamated Widgets</company>
</corporatePublisher></bookinfo>

This account is only a sketch of a direction, but this capability would let designers specify text for general content and still allow specialized elements for precision.

3. Bridging between definitions of controlled values and citations of controlled values

How might adopters define the controlled values for an enumeration -- especially in a way that permits extensibility of those values?

One possibility would be to use the key feature proposed for DITA 1.2 (credit to Mr. Priestley for that lightbulb):

Use a specialized DITA topic to define the meaning of the controlled value (a meta topic, if you will).

Use a specialized DITA map both to combine these definitional topics in groups (like operating system platform, machine type platform, audience education, and audience job) and to indicate semantic hierarchies within each group ("RedHat" is a special kind of "Linux," "appdev" is a special kind of "programmer").

Assign a key (effectively, a local name) to each definitional topic.

Use the keys as values in metadata attributes.

Benefits: The enumeration can be maintained by content creators without having to modify a schema definition. A process can still validate the enumeration (that is, check that the controlled values in topics have corresponding definitions). Where a controlled value without any definition might be ambiguous, a defined controlled value can be clarified by drilling down into the definitional topic. The definitions of controlled values can be shared easily between adopters and allow adopters to use different local names for the same thing (for instance, "linux" and "LinuxOS" and "unices.linux"). The taxonomic relationships can be maintained without forcing classification changes in the content. Definitional topics can be reused as content topics where the user would benefit from a definition of an unfamiliar concept. Finally, adopters can scale the formality of their practice from single controlled values to formal taxonomies without any change in their authoring infrastructure. (In fact, the DITA taxonomy specialization provides an implementation of the first two bullets above.)

4. Specializing an attribute that takes an enumeration

The two sides debating about attribute specialization seems to focus on different things.

One side has a focus on the semantics of the attributes, submitting that, if you analyze your audience by education, by job role, or by both, you are still analyzing your audience.

The other side has a focus on the values, submitting that any enumeration of audience education values requires additional information to merge with any other enumeration of audience values.

The second side has a point. Where the base and specialized attributes have a clear semantic relationship, the base enumeration would include values that are compounds of the values from the specialized enumerations. As a result, even in the best case, the mapping will be likely to be complex and partial.

For example, if operating system and machine are special kinds of platform, adopters might need mappings similar to the following:

adopter 1 (base) .......... platform = ( bigiron | openserver | wintel | handheld )

adopter 2 (specialized) ... os = ( linux | macosx | windows )
........................... machine = ( macintosh | mainframe | pc | server )

mapping ................... platform( bigiron ) MATCHES machine=( mainframe )
........................... platform( openserver ) MATCHES os( linux ) OR machine=( server )
........................... platform( wintel ) MATCHES os( windows ) OR machine=( pc )

So far as I know, no one wants to address that mapping challenge now. Besides, the DITA practice thus far has been to enable vocabulary agreements within communities in advance rather than to try to reconcile arbitrary vocabularies after the fact. So, let's acknowledge that we won't automate the mapping of values from different enumerations and thus won't automate integration of enumerations for conditional processing.

All that said -- does the first side have a point, too? If I need to enumerate the audience by education and know that I am providing an analysis of the audience, why should I be forced to treat audience education as if it were completely unrelated to audience? If I can declare that audienceEducation specializes audience, processes other than conditional processes that operate on audience semantics can recognize the values of audienceEducation as indicating something about the audience. For instance, a process might build an index of content by audience or by platform:

.......... bigiron
.......... handheld
.......... linux
.......... macintosh
.......... macosx
.......... mainframe
.......... openserver
.......... pc
.......... server
.......... windows
.......... wintel

Otherwise, each attribute that has processing that is sensitive to semantics will require a custom process.

Hoping that's useful,

Erik Hennum
ehennum@us.ibm.com

References:

attribute specialization as foundation
- From: Erik Hennum <ehennum@us.ibm.com>