DITA Proposed Feature 9

Add a <data> element for representing machine-processable values within DITA topics and maps.

Longer description

The problem: Currently, DITA provides limited extensibility for properties as well as embedded data (such as the form fields that word processors can embed in content).

For the topic as a whole, a designer can only specialize single values from the <othermeta> element. The designer can't define complex data structures comparable to the existing <audience> and <prodinfo> properties. As a result, specializers are forced to specialize body content to create complex data structures. (For examples, see the eNote record-oriented demo and the bookinfo data for the bookmap demo in the DITA Open Toolkit.) These data structures are not part of the topic content and thus don't belong in the body.

Within the topic content, a designer can specialize data from the <state> element. Again, the <state> element supports only single values. As a result, specializers are forced to abuse discourse elements such as <ph> for data that isn't a textual phrase.

The solution: Add a <data> element for values intended to be consumed primarily by automated processes. Typical applications would include both complex metadata structures and hybrid documents with both discourse and data values. You can nest <data> elements for structures and specialize the <data> element for more precise semantics and for constraints on structures and values.

A process could harvest the data values for a machine processable representation such as RDF. Formatting for discourse should skip the <data> element by default. A specialization could, however, extend processing to include data values in some formatted outputs (again, similar to form fields in word processor formats).

Note: It is an abuse of the architecture to specialize <data> element for content within the discourse flow such as a special kind of paragraph within the topic body. When generalized and formatted with base processes, the paragraph would be skipped, mangling the discourse flow.

The <data> element is a more powerful alternative to the <state> and <othermeta> elements. The <state> element could, in fact, be specialized from the <data> element and deprecated.

The following references are pertinent:

Original proposal
http://lists.oasis-open.org/archives/dita-comment/200407/msg00006.html - modified in this proposal.
Discussion thread
http://www.oasis-open.org/apps/org/workgroup/dita/email/archives/200507/msg00093.html - incorporated into this revised proposal.
RDF/A
http://www.w3.org/MarkUp/2004/rdf-a - proposal for adding data extensibility to XML vocabularies and especially XHTML.

Scope

Major only because the implications need to be considered carefully. The actual design and implementation should be small.

Use Case

Here are some specific examples of potential uses of the <data> element:

Technical Requirements

Design impact: a new element would be added with a definition similar to the following (in DTD syntax)

<!ELEMENT data      (#PCDATA|%keyword;|%term;|%image;|%object;|%ph;|%data;)*>
<!ATTLIST data    %univ-atts;
                  name        CDATA #IMPLIED
                  label       CDATA #IMPLIED
                  typeid      CDATA #IMPLIED
                  abouthref   CDATA #IMPLIED
                  abouttype   CDATA #IMPLIED
                  aboutformat CDATA #IMPLIED
                  value       CDATA #IMPLIED
                  href        CDATA #IMPLIED
                  type        CDATA #IMPLIED
                  format      CDATA #IMPLIED
                  outputclass CDATA #IMPLIED
>
contents
Available to supply the value for the property, especially when the value has substructure that requires subordinate elements. Because values are atomic, a specialized <data> element would typically not support mixed content but would instead restrict the content model to specific subordinate elements, to text, or to an empty model (nothing). The nested <ph> element allows properties containing small discourse units (for instance, for an author blurb), but the <data> element doesn't contribute to the main discourse flow for the topic.
univ-atts attribute list
Optional id, conref, metadata selection, xml:lang, and translate attributes. Specialized data can be shared by conref across multiple documents.
name attribute
Available to identify the semantic of the value This attribute would often be defaulted in the specialized element.
label attribute
Available to provide a label for the value in input editors and output reports. This attribute would often be defaulted in the specialized element.
typeid attribute
Available to supply an established public identifier (such as a URI) for the semantic of the property. This attribute would often be defaulted in the specialized element.
abouthref
Available for explicit specification of the element instance that is the subject of the property. If this attribute is not supplied, the default subject is the element that contains the <data> element. For consistency with existing DITA properties, this default rule has a few exceptions. First, the <prolog>, <metadata>, and <topicmeta> elements pass through their properties to their containers. Second, where the container of <topicmeta> specifies a reference (as with a <topicref> or <navref> element), the properties apply to the target of the reference.
abouttype
Available to declare the type of the subject of the property (as with the type attribute of the <xref> element) where the subject is an element instance within DITA content. Processors should report an error when the type of the actual subject differs from the declared type. This attribute could be defaulted in the specialized element to impose a constraint.
aboutformat
Available to declare the format of the subject of the property (as with the type attribute of the <xref> element). The default format is DITA content.
value attribute
Available to supply a textual value for a property. In a specialized element, the value can be made mandatory, limited to an enumeration of tokens, or set to a fixed value.
href attribute
Available to supply a reference value for a property. The reference could specify a DITA element instance, a non-DITA resource such as an HTML page or web service, or a URI identifier such as a concept published in RDF.
type attribute
Available to declare the type of a referenced value for the property where the value is a referenced element instance within DITA content. As with abouttype, this attribute could be defaulted in the specialized element to impose a constraint.
format attribute
Available to declare the format of a referenced value for the property. As with aboutformat, the default format is DITA content.
outputclass attribute
Available to specify a style.

The new element would be added to the following contexts:

Processing impact: a default rule to ignore the <data> element silently.

Costs

Because the <data> element is optional in all contexts, there is no migration costs.

As noted above, design impacts are small and processing impacts are minimal.

Editors and content management systems must implement support for nested values.

Benefits

The benefit of the <data> element is primarily for specialization. Without the <data> element, data structures have to be implemented in the topic body because topic metadata doesn't support complex structures. Semantically, this workaround (usually with the <ph> or <keyword> element) makes the false promise that the data content is discourse.

A clean basis for extensible data provides benefits to everyone who works with complex metadata and opens up the potential for DITA form-like and transactional documents. The rest of this section lists some examples of potential specializations of the data element in different subject areas.

The following example identifies the properties of a book (where <bkrights> and everything it contains are specialized from the <data> element):

<bookmap>
  <bkrights>
    <bkcopyrfirst><year>2003</year></bkcopyrfirst>
    <bkcopyrlast><year>2005</year></bkcopyrlast>
    <bkowner>
      <organization>
        <orgname>XYZ, Inc</orgname>
        <phone>123-456-7890</phone>
        <resource href=""http://www.xyz.com/"/>
      </organization>
    </bkowner>
  </bkrights>
  ...
</bookmap>

The following example specifies source code delimiters for automatic refresh of a code fragment (where the <sourceFile>, <startDelimiter>, and <endDelimiter> elements are specialized from <data> but the <codeFragment> is specialized from <codeblock>):

<example>
  <title>An important coding technique</title>
  <codeFragment>
    <sourceFile     value="helloWorld.java"/>
    <startDelimiter value="FRAGMENT_START_1"/>
    <endDelimiter   value="FRAGMENT_END_1"/>
    ...
  </codeFragment>
</example>

The following example identifies a real estate property for a house description (where the <realEstateProperty> and everything it contains are specialized from <data> but <houseDescription> is specialized from <section>).

<houseDescription>
  <title>A great home for sale</title>
  <realEstateProperty>
    <realEstateBlock value="B7"/>
    <realEstateLot   value="4003"/>
    ...
  </realEstateProperty>
  <p>This elegant....</p>
  <object data=""B7_4003_tour360Degrees.swf"/>
</houseDescription>

The following example identifies the maintainer of the topic (where <maintainer> is specialized from the <data>):

<topicref href=""sometopic.dita">
  <topicmeta>
    <maintainer>Sachiko</maintainer>
  </topicmeta>
  ...
</topicref>

Time Required

With no changes, 4 person hours to implement the DTD and Schema changes, add the default processing rule, and expand this note as a formal specification.