Norman Walsh, Sun Microsystems, Inc. <Norman.Walsh@Sun.COM>
This Standard defines XML encodings of the 19 standard character entity sets defined in Non-normative Annex D of [SGML].
This is a working draft constructed by the editor. It is not an official committee work product and may not reflect the consensus opinion of the committee.
Please send comments on this specification to the <docbook@lists.oasis-open.org> list. To subscribe, send an email message to <docbook-request@lists.oasis-open.org> with the word "subscribe" as the body of the message.
Copyright © 2001, 2002 The Organization for the Advancement of Structured Information Standards [OASIS]. All Rights Reserved.
This Standard defines XML encodings of the standard SGML character entity sets.
Non-normative Annex D of [SGML] defines 19 standard SGML character entity sets: Added Latin 1, Added Latin 2, Greek Letters, Monotoniko Greek, Russian Cyrillic, Non-Russian Cyrillic, Numeric and Special Graphic, Diacritical Marks, Publishing, Box and Line Drawing, General Technical, Greek Symbols, Alternative Greek Symbols, Added Math Symbols: Ordinary, Added Math Symbols: Binary Operators, Added Math Symbols: Relations, Added Math Symbols: Negated Relations, Added Math Symbols: Arrow Relations, Added Math Symbols: Delimiters. The SGML declarations for these entities use the specific character data (SDATA) entity type that is not supported in XML, so alternative XML declarations are necessary.
In XML, the specific character data of most entities can be expressed as a [Unicode] character.
The character entity sets defined by this Standard are summarized in Appendix A through Appendix S.
In order to use these entities in a document, they must be declared. Entities can be declared in the external subset or the internal subset, as described in [XML]. An example document, with the declaration in the internal subset, is shown in Example 1.
Example 1. Declaring and Using the ISO Latin 1 Character Entity Set
<!DOCTYPE doc [
<!ENTITY % iso-lat1 PUBLIC "ISO 8879:1986//ENTITIES Added Latin 1//EN//XML"
"http://www.oasis-open.org/docbook/xmlcharent/0.3/iso-lat1.ent">
%iso-lat1;
]>
<doc>
<p>This document declares the ISO Latin 1 Character Entity Set, providing
access to the ISO Latin 1 entities, such as "é" and "©".</p>
</doc>Non-validating XML Parsers may choose not to process externally declared entities. This Standard does not alter the semantics of XML processors. If a processor does not see the declaration for an entity, it will not be able to report the correct replacement text for that entity.
The replacement text of some entities includes more than a single Unicode character. Some characters are composed with the "combining reverse solidus overlay" (20E5) and some are composed with a variation selector (FE00, FE01, …).
Historically, the inodot entity is multiply defined in iso-lat2.ent and iso-amso.ent. If both entity sets are included, some parsers will warn about redefinition of this entity. The warning can be ignored.
There are a small number of entities that have no [Unicode] representation. These entities are all mapped to the Unicode character "FFFD", the "replacement character".
| Entity Name | Entity Set | Description |
|---|---|---|
| fjlig | iso-pub.ent | Small fj ligature |
| gnap | iso-amsn.ent | Greater, not approximate |
| jnodot | iso-amso.ent | Small j, no dot |
| lnap | iso-amsn.ent | Less, not approximate |
| lpargt | iso-amsc.ent | Greater than, left arc |
| nsmid | iso-amsn.ent | Negated short mid |
| prnE | iso-amsn.ent | Precedes, not double equals |
| rpargt | iso-amsc.ent | Right paren, greater than |
| scnE | iso-amsn.ent | Succeeds, not double equals |
| smid | iso-amsr.ent | shortmid r |
| vsubnE | iso-amsn.ent | Subset not double equals, variant |
Users needing these characters will have to rely on the private use area or other non-portable mechanisms to access them.
There are a few more for which there is no specific [Unicode] representation but where a reasonable substitution has been used:
| Entity Name | Entity Set | Substitution | Description |
|---|---|---|---|
| bepsi | iso-amsr.ent | 220D | Back epsilon: such that |
| ges | iso-amsr.ent | 2265 | Greater-or-equal, slanted |
| gvnE | iso-amsn.ent | 2269 | Gt, vert, not double equals |
| iff | iso-tech.ent | 21D4 | If and only if |
| les | iso-amsr.ent | 2264 | Less-than-or-equal, slanted |
| lozf | iso-pub.ent | 2726 | Lozenge, filled |
| lvnE | iso-amsn.ent | 2268 | Less, vert, not double equals |
| nge | iso-amsn.ent | 2271 | Neither greater-than nor equal to |
| nle | iso-amsn.ent | 2270 | Not less-than-or-equal |
| npre | iso-amsn.ent | 22E0 | Not precedes, equals |
| nsce | iso-amsn.ent | 22E1 | Not succeeds, equals |
| nspar | iso-amsn.ent | 2226 | Not short parallel |
| pre | iso-amsr.ent | 227C | Precedes, equals |
| spar | iso-amsr.ent | 2225 | Short parallel |
| ssetmn | iso-amsb.ent | 2216 | Small set minus (reverse solidus) |
| star | iso-pub.ent | 22C6 | Star operator |
| starf | iso-pub.ent | 2605 | Black star |
| thkap | iso-amsr.ent | 2248 | Thick approximate |
| thksim | iso-amsr.ent | 223C | Thick similar |
| vsubne | iso-amsn.ent | 228A | Subset, not equals, variant |
| vsupnE | iso-amsn.ent | 228B | Subset not double equals, variant |
| vsupne | iso-amsn.ent | 228B | Superset, not equals, variant |
| xhArr | iso-amsa.ent | 2194 | Long left and right double arr |
| xharr | iso-amsa.ent | 2194 | Long left and right arr |
| xlArr | iso-amsa.ent | 21D0 | Long left double arrow |
| xrArr | iso-amsa.ent | 21D2 | Long right double arr |
| ssmile | iso-amsr.ent | 2323 | Small smile |
| sfrown | iso-amsr.ent | 2322 | Small frown |
Users needing alternate glyphs for these characters will have to rely on redefining them to use the private use area or other non-portable mechanisms to access them.
Named XML entities (except for the five predefined entities) cannot be used if they are not declared. Entity declaration requires either an external or an internal subset. Some classes of applications forbid the occurrence of markup declarations in documents. For these documents, named character entities are inaccessible.
In this section, we introduce an XML vocabulary with the semantics of character entity reference. This Standard defines the semantics of elements and attributes declared in the "http://www.oasis-open.org/docbook/xmlcharent/names" namespace.
This namespace contains exactly one element, char. The char element has two attributes, entity and name. They are mutually exclusive.
The entity attribute identifies characters by their character entity names. (The set of valid names is the closed set of names associated with character entity sets defined by this Standard.) Case is significant in entity names.
The name attribute identifies characters by their Unicode character names. (The set of valid names is the set of character names published in the [Unicode] specification, or any later version of that specification.) Case is insignificant in character names.
The [RELAX NG] definition of this namespace is shown in figure Figure 1.
Figure 1. The RELAX NG Definition of the http://www.oasis-open.org/docbook/xmlcharent/names Namespace
<?xml version="1.0"?>
<grammar xmlns="http://relaxng.org/ns/structure/0.9"
ns="http://www.oasis-open.org/docbook/xmlcharent/names">
<start>
<element name="char">
<choice>
<attribute name="entity">
<ref name="EntityNames"/>
</attribute>
<attribute name="name">
<ref name="UnicodeNames"/>
</attribute>
</choice>
</element>
</start>
<define name="EntityNames">
<!-- Logically, this is the list of ISO 9573 Character Entity Names -->
<!-- For now, just text. -->
<text/>
</define>
<define name="UnicodeNames">
<!-- Logically, this is the list of Unicode Character Names -->
<!-- For now, just text. -->
<text/>
</define>
</grammar>
Example 2 shows a sample document using this mechanism.
Example 2. Declaring and Using the ISO Latin 1 Character Entity Set
<doc xmlns:e="http://www.oasis-open.org/docbook/xmlcharent/names"> <p>This document uses the character names element to access character entities, such as "<e:char name="eacute"/>" and "<e:char name="COPYRIGHT SIGN"/>".</p> </doc>
The character names element is limited to contexts where elements may occur. In particular, elements may not occur in XML attribute values. Note, however, that internationalization requirements such as bidirectional language support and Ruby already require structure in arbitrary contexts. It is probably an error to use attributes for human-readable content.
Identifiers for this entity set:
| Public identifier: ISO 8879:1986//ENTITIES Added Latin 1//EN//XML |
| System identifier: http://www.oasis-open.org/docbook/xmlcharent/0.3/iso-lat1.ent |
The following character entities are defined in this entity set:
Identifiers for this entity set:
| Public identifier: ISO 8879:1986//ENTITIES Added Latin 2//EN//XML |
| System identifier: http://www.oasis-open.org/docbook/xmlcharent/0.3/iso-lat2.ent |
The following character entities are defined in this entity set:
Identifiers for this entity set:
| Public identifier: ISO 8879:1986//ENTITIES Greek Letters//EN//XML |
| System identifier: http://www.oasis-open.org/docbook/xmlcharent/0.3/iso-grk1.ent |
The following character entities are defined in this entity set:
Identifiers for this entity set:
| Public identifier: ISO 8879:1986//ENTITIES Monotoniko Greek//EN//XML |
| System identifier: http://www.oasis-open.org/docbook/xmlcharent/0.3/iso-grk2.ent |
The following character entities are defined in this entity set:
Identifiers for this entity set:
| Public identifier: ISO 8879:1986//ENTITIES Russian Cyrillic//EN//XML |
| System identifier: http://www.oasis-open.org/docbook/xmlcharent/0.3/iso-cyr1.ent |
The following character entities are defined in this entity set:
Identifiers for this entity set:
| Public identifier: ISO 8879:1986//ENTITIES Non-Russian Cyrillic//EN//XML |
| System identifier: http://www.oasis-open.org/docbook/xmlcharent/0.3/iso-cyr2.ent |
The following character entities are defined in this entity set:
Identifiers for this entity set:
| Public identifier: ISO 8879:1986//ENTITIES Numeric and Special Graphic//EN//XML |
| System identifier: http://www.oasis-open.org/docbook/xmlcharent/0.3/iso-num.ent |
The following character entities are defined in this entity set: