CDATA Confusion

Joe English
Last updated: Fri Oct 6 19:13:06 PDT 1995



1 Introduction

The keyword CDATA has (by my count) at least five different meanings in SGML. This tends to cause a great deal of confusion. CDATA is commonly misunderstood to mean ``no markup is recognized'' or ``verbatim text'', but this is not always the case.

2 Attribute declared values

Attributes may have CDATA as their declared value. For example (from the HTML 2.0 DTD):

<!ATTLIST IMG
        ...
        ALT CDATA #IMPLIED
	...
>

This means that the attribute value may contain arbitrary character data (as opposed to an ID, a NAME or NAMES, a NUMBER, NUMBERS, et cetera.) CDATA attributes are not folded to upper case, and are not tokenized like other attribute types.

Note that attribute value literals are always parsed as replaceable character data, regardless of the attribute's declared value. This means that references (&xxx;, &#yyy;) are recognized and replaced in attribute specifications, even for CDATA attributes.

For example, this HTML fragment:

    <IMG SRC="eqn1.gif" ALT = "A &lt; B">
will be displayed as
    A < B
in a text-mode browser or with image loading turned off (assuming the browser is working properly, of course).

3 Internal entity declarations

The second most common source of confusion is in entity declarations:

<!ENTITY amp CDATA "&#38;"	-- ampersand -- >

Here, the CDATA keyword signals that the entity is a character data entity (as opposed to a text entity, or an SDATA or PI data entity.)

In this case, no markup is recognized in the replacement text when the entity is referenced. Note however that character references (&#nnn;) and parameter entity references (%nnn;) are recognized in parameter literals, so some references are expanded when the entity is declared.

For example,

<!ENTITY foo       "BAR"  >
<!ENTITY e1        "&foo;">
<!ENTITY e2  CDATA "&foo;">
<!ENTITY e3        "&#38;foo;"  -- &#38; = ampersand or ERO delimiter -->
<!ENTITY e4  CDATA "&#38;foo;" >
will be replaced as follows:
foo: BAR
e1: BAR
e2: &foo;
e3: BAR
e4: &foo;

The entities e1 through e4 all have the same replacement text, namely &foo;. The difference is that when e1 and e3 are referenced, the parser treats the replacement text as if it had appeared in the document directly, so &foo; is itself parsed as an entity reference. On the other hand, since e2 and e4 are data entities, the parser inserts the replacement text literally.

4 External entity declarations

External entities may be declared as CDATA, with an associated data content notation:

<!NOTATION some-notation SYSTEM>
<!ENTITY foo1 SYSTEM "foo.sgml">
<!ENTITY foo2 SYSTEM "foo.sgml" CDATA some-notation>

Here, CDATA means much the same thing as it does for internal entities: the entity's replacement text is to be treated as literal character data, and the parser does not scan for markup.

(In fact, ESIS-producing parsers such as SGMLS don't even examine the content of data entities at all, and simply report the reference.)

External entities may also be declared as SDATA, NDATA, or SUBDOC.

5 Marked sections

CDATA may appear a status keyword in a marked section declaration:

<![ CDATA [  blah, blah, blah. ]]>

The only markup that is recognized in CDATA marked sections is an MSC (]]>) delimiter, which closes the marked section. CDATA marked sections are the preferred method for entering ``verbatim text'' in an SGML document.

Other marked section status keywords are

RCDATA
Replaceable character data -- recognize references, but not tags or other markup.
IGNORE
Skip the marked section entirely.
INCLUDE
The opposite of IGNORE; useful for making ``conditional text''.
TEMP
The same as INCLUDE, only different.

6 Element declared content

Elements may have a declared content of CDATA.

Don't use this feature if you're designing a DTD. It's evil. In fact, you're better off forgetting about CDATA, RCDATA, and ANY declared content altogether. SGML is much less confusing if you ignore them.

In case you want to know the whole story, there are four choices for an elements declared content:

  1. A model group;
  2. CDATA;
  3. RCDATA;
  4. ANY.

The first option is the normal case; for example:

<!ELEMENT letter - - (recipient, salutation, body, closing, (attach|cc)* >

The other keywords may appear instead of a content model; they are not legal in model groups:

<!ELEMENT badnews1	- - CDATA 	>
<!ELEMENT badnews2	- - RCDATA 	>
<!ELEMENT badnews3	- - ANY		>

CDATA declared content means that, when the start-tag for that element is seen, the parser switches to a delimiter recognition mode in which no markup is recognized except for the TAGC ("</") delimiter-in-context. RCDATA declared content is similar, except that general entity references and character references are also recognized (and replaced). In both cases, as soon as the parser encounters a TAGC followed by a letter, the delimiter recognition mode changes back (even if the end-tag is invalid).

ANY declared content means that the parser does not attempt to validate the content of the element; it may contain any subelements or character data, in any order. (The subelements must be declared in the DTD, and their content is validated.)

Another source of confusion is the distinction between CDATA and RCDATA declared content and the #PCDATA content model token:

<!ELEMENT badnews1 - -  CDATA >
<!ELEMENT phrase   - -	(#PCDATA) >
<!-- this is illegal:
    <!element phrase - - #PCDATA > 
-->

Notice that CDATA does not (and cannot) appear inside a parenthesized model group, and it is not prefixed with an RNI (#) delimiter like #PCDATA is.