OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

relax-ng message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]


Subject: Datatype assignment for TREX


I've been thinking about the issue of how to do datatype assignment for
TREX.  By "datatype assignment" I mean, given a TREX pattern and a
document, assigning a datatype to each text node and attribute value
(<anyString/> can be considered as urtype.).

For some patterns and some documents, this requires lookahead (or
multiple passes).

  <choice>
    <element name="b">
      <element name="c">
         <data type="xsd:int"/>
      </element>
      <element name="d">
        <empty/>
      </element>
    </element>
    <element name="b">
      <element name="c">
         <anyString/>
      </element>
      </element>
      <element name="e">
        <empty/>
      </element>
    </element>
  </choice>

For example, in

 <b><c>1</c><d/></b>

c has datatype xsd:int, but in

 <b><c>1</c><e/></b>

c has no datatype (ie anyString).

The approach I have taken is to come up with a constraint on patterns
that can be checked independently of any source document and which is
sufficient to ensure that an implementation can always easily tell what
the datatype of a text node or attribute value is.

The basic idea is to require that it always be possible to determine the
datatype of a text node or attribute value just using the names of the
element and attribute ancestors.

To describe the algorithm we need the concept of the "direct
descendants" of a pattern.  The "direct descendants" of a pattern P are
all the descendants of P that would be visited by a walk of the
descendants of P that follows <ref> elements but does not look inside
<element> and <attribute> elements and only looks at the first
subpattern of a <concur> element.  For example, assuming the following
definition

<define name="x">
  <element name="x"><empty/></element> <!-- 1 -->
</define>

the direct descendants of this pattern:

<choice>
  <ref name="x">
  <anyString/> <!-- 2 -->
  <element name="y"> <!-- 3 -->
     <element name="z"><empty/></element> <!-- 4 -->
  </element>
</choice>

are the patterns labelled 1, 2 and 3, but not the pattern labelled 4.

The elements in TREX can match sequences of characters are <anyString>,
<string>, <data> and any element with a trex:role="datatype" element. 
Let's call these character elements.  We say that a set of character
elements is ambiguous if any of the following conditions apply:

1. it contains two distinct data or trex:role="datatype" elements

2. it contains both a data or trex:role="datatype" element and a
<anyString> element

3. (i) it contains both a data or trex:role="datatype" element and a
<string> element, and (ii) the content of the <string> element may be a
value of the datatype

Now we can describe the constraint on patterns, which I will call "easy
datatype assignment".

A pattern P has "easy datatype assigment" if the following conditions
are all satisfied.

1. The set of direct descendants of P that are character elements are
not ambiguous

2. For any name x, take the direct descendants of P that are <element>
elements with a name class that x matches; the pattern consisting of a
choice of the content patterns of all such <element> elements must also
have "easy datatype assignment"

3. Same as 2, but for <attribute> elements instead of <element>
elements.

For example, in determining whether the example above had easy datatype
assignment, we would look at (by applying step 2 with x="b"):

  <choice>
    <group>
      <element name="c">
         <data type="xsd:int"/>
      </element>
      <element name="d">
        <empty/>
      </element>
    </group>
    <group>
      <element name="c">
         <anyString/>
      </element>
      </element>
      <element name="e">
        <empty/>
      </element>
    </group>
  </choice>

and (by applying step 2 with x="c"):

  <choice>
    <data type="xsd:int"/>
    <anyString/>
  </choice>

which does not have easy datatype assigment, because the set of direct
descendant character elements is ambiguous.

There are a few subtleties to the implementation (to avoid infinite
recursion and deal with wildcards), but it's basically quite
straightforward.

The main limitations are

1. It doesn't deal well with some uses of concur.

2. Suppose I have a "foo" element which can contain either ints or
strings according the value of a "type" attribute:

<choice>
  <element name="foo">
     <attribute name="type">
        <string>string</string>
     </attribute>
     <data type="xsd:string"/>
  </element>
  <element name="foo">
     <attribute name="type">
        <string>int</string>
     </attribute>
     <data type="xsd:int"/>
  </element>
</choice>

This would not satisfy my constraint. (On the other hand, if the
datatypes are explicit in the instance, then datatype assignment needn't
involve schema processing at all.)

James



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]


Powered by eList eXpress LLC