Extended Reference Concrete Syntax (ERCS)
From: http://www.allette.com.au/sgml/ercs/text/ercs.txt
Extended Reference Concrete Syntax (ERCS)
& ISO/IEC 10646-1 (Unicode 1.1) Public Character Entities
Nov 30, 1995
-------------------------------------------------------------------------------
The major technical challenge for SGML at the current time is how to support
the SGML documents of languages that require more than just ISO 646 (ASCII):
East Asian and CJK (China/Japan/Korea) documents in particular.
The proposed Extended Reference Concrete Syntaxes (ERCS) address the issues of
native-language tagging, tagging for interchange between different character
sets, and the representation of extended characters.
-------------------------------------------------------------------------------
Contents
Which? (Issues)
Why? (Goals)
What? (Overview)
Example of SGML Declaration using ERCS
Relationship to ISO/IEC 10646-1:1993 (Unicode 1.1)
ISO/IEC 10646-1 Public Character Entities
How? (Rules)
CJK Issues
European Application
Restricted Zone "compatibility" characters
Vendor's Strategy
Key Terms
Subset Repertoires and Draft Declarations
Status
-------------------------------------------------------------------------------
Request for Comments
Comments are sought on the ERCS. Email Rick Jelliffe at ricko@allette.com.au.
Who? (Acknowledgments)
Where?
Change Log
-------------------------------------------------------------------------------
NOTE: all example SGML declarations are not current: I'll fix them in teh next
week or so.
Which Issues Does ERCS Address ?
November 30, 1995
-------------------------------------------------------------------------------
The existing concrete syntaxes of ISO 8879, including the default Reference
Concrete Syntax (RCS), only allow certain ISO 646 (ASCII) characters to be used
for markup (tag names, short references, etc.). This is satisfactory for the
English language and for many international predefined DTDs such as DocBook and
HTML, may be not enough for users and documents of non-Latin-based
character-sets.
-------------------------------------------------------------------------------
Native-Language Warkup
Much of the value of using SGML markup, especially for structure-based searches
in hypertext, is that the tag names and other markup can have meaning to the
user rather than being cryptic mnemonics. This is most true for SGML documents
that contain fielded data. So the provision of native-language tagging is a key
facility that SGML will need to supply to be successful.
So the best concrete syntax for a given character set is one that does not
artificially or gratuitously restrict what characters are available for use as
markup. In the absence of other factors, if a character appears in words in the
native language, it should be available for use in NAMEs. And similarly, if a
symbol character is readily available from the keyboard, it should be available
for use in short references.
-------------------------------------------------------------------------------
Document Interchange
In the CJK nations in particular, there are dozens of character sets. As Ken
Lunde's book Understanding Japanese Information Processing (O'Reilly 1993)
shows though, there is some order and subset correspondence between them.
However, even the largest sets are not enough for many uses; furthermore,
character sets are often augmented by user-defined characters.
When SGML documents are interchanged between systems with different character
sets, only the common characters may be used as markup and attribute values.
The excess characters must be transported in (PCDATA and RCDATA data content
and CDATA attributes) data content through the use of entity references. This
complicates the natural approach for supporting native-language tagging: to
merely make all characters available for markup.
There are two approaches:
* "lowest-common-denominator" just uses the ISO 8879 RCS: this is the only
standardized approach at the moment;
* "greatest-common-denominator" uses the characters available from the
intersection set repertoire of all the character sets in use.
ERCS allows a standardized greatest-common-denominator approach.
-------------------------------------------------------------------------------
Character Representation
Users of smaller CJK character sets often may need to represent other
characters, especially Han ideographs (Kanji). ERCS proposes a three-tier
system: the document character set characters, plus an ISO 10646 public
character entity set to cover most needs, with any other characters being named
with respect to national character dictionaries.
-------------------------------------------------------------------------------
ISO 8879 SGML Declaration Improvement
The SGML Standard ISO 8879:1986 does not support CJK needs conveniently: in
fact, it is a hindrance. ERCS gives more experience in what reforms would be
useful.
I presented the syntax changes suggested by ERCS at WG8 meeting at Broomfield
in October 1995, as Australian delegate, and with the support of Japan. This is
note N 1815 and is available as an RTF file.
-------------------------------------------------------------------------------
Goals of ERCS
March 18, 1995
-------------------------------------------------------------------------------
ERCS has the following simple goals:
* East Asians should be able to mark up documents using native-language
characters and conventions;
* East Asian languages should be supported just as simply as English
currently is;
* East Asians should be able to store and maintain documents in national
character sets;
* only solutions consistent with existing and proposed standards should be
used;
* ERCS must be readily comprehensible and implementable by vendors, and meet
basic user needs: in particular ISO 8879 should not be made more
complicated to understand or use;
* a solution should be derived "backwards" from the desirable qualities of
SGML documents and markup rather than "forwards" from a consideration of
character encoding issues. The proper subject is not "how to handle East
Asian character sets" but "how to handle East Asian documents";
* ERCS should be locale-free and character-set independent, and be flexible
enough to be useful for national character sets of emerging nationsr;.
ERCS is primarily concerned with the needs of East Asian and CJK SGML
documents. However, ERCS should also be useful for most non-Latin-based
languages: indeed perhaps for most major non-English languages.
-------------------------------------------------------------------------------
Overview of ERCS
November 30, 1995
-------------------------------------------------------------------------------
The various parts of ERCS are:
* a large catalog of characters, giving their SGML character class and
roles;
* concrete syntaxes based on subsets of the catalog character repertoire;
* a large public entity set, based on the catalog character repertoire;
* guidelines for vendors for selecting the most useful ERCS syntaxes.
ERCS is designed to provide a simple & standard basis for vendors, and to be
convenient and powerful for users.
In essence, the ERCS says this: "if your document character set has character
X, and if you want character X to be used in markup, then this should be its
class and role."
-------------------------------------------------------------------------------
SGML Declaration using ERCS
March 30, 1995
-------------------------------------------------------------------------------
To demonstrate how much ERCS could simplify the SGML declaration, here is an
example, using a fictitious system character set that contains all the same
repertoire as ISO/IEC 10646-1.
As can be seen, it is just as simple as an English-language DTD.
%unilatsup;
%unilatext2;
%uniipa;
%unisml;
%unidcm;
%unibasbrk;
%unigrkcop;
%unicyr;
%uniarm;
%uniheba;
%uniheb;
%unihebb;
%uniarab;
%uniarabext;
%unidev;
%uniben;
%unigur;
%uniguja;
%unioriya;
%unitamil;
%unitelu;
%unikann;
%unimal;
%unithai;
%unilao;
%unigeo;
%unigeoext;
%unijamo;
%unilatextadd;
%unigrkext;
%unipunct;
%unisupsub;
%unibuck;
%unicdmfs;
%uniquasi;
%uninum;
%uniarrow;
%unimath;
%unitech;
%unicont;
%uniocr;
%unienc;
%unibox;
%uniblock;
%unishapes;
%unimisc;
%dingbats;
%unicjksym;
%unihira;
%unikata;
%unibopo;
%unijamo-c;
%unicjkmisc;
%unicjkenc;
%unicjk-c;
%unihangul;
%unihangula;
%unihagulb;
%unihan;
%uniprivate;
%unihan-c;
%unialpha-r;
%uniarab-r;
%unihalf-r;
%unicjk-r;
%unismall-r;
%uniarabb-r;
%unihalffull-r;
%unispecial;
There have been some other proposals following on from these, for new character
reference mechanisms in SGML, in particular, an external numberic character
reference and a CHAR markup declaration. Both these would provide a better
solution than the above mechanism.
-------------------------------------------------------------------------------
ISO/IEC 10646-1 Public Entity Set
June 15, 1995
-------------------------------------------------------------------------------
This is a public entity set to make all ISO 10646-1 1993 BMP (with
implementation level 3) characters available for use in any document of any
character set.
The names are based on the UCS-2 encoding, which is also available as Version
1.1 of the Unicode Character Standard. Hex numbers are used; names are not
memorable after the first few hundred, and many Han ideographs do not have
descriptive names. The Unicode 1.1 character set has about 34 168 characters:
20 192 of them are named by their code number.
The Unicode Consortium publishes lists of characters with representative glyphs
that are more convenient for users.
The convention for referring to Unicode 1.1 characters is U+xxxx. The ERCS
ISO/IEC 10646-1:1993 Public Entity Set version of this character may be invoked
with &Uxxxx;, for example &U4000;.
The public identifier for this entity set is currently:
"-//SPREAD ERCS//ENTITIES ISO/IEC 10646-1:1993 BMP UCS-2//EN"
Because this entity set is so large, the intention is that vendors should build
it into their products rather than read in the set from a file. They could use
the number in the name as an index for example, rather than using the name as a
key.
Only CJK needs this full set. A subset may be more suitable for the rest of
East Asia and other 8-bit countries.
If an explicit entity set is required, it can be constructed by repeating the
following line 60 000 times (actually, far fewer are needed), incrementing
xxxx:
The previous form follows that suggested in ISO/IEC SC18 WG9 CD 14755 on the
canonical form for displaying characters in the ISO 10646 repertoire when the
appropriate glyph is not available. An alternate form could be possible, to fit
in for Unicode fans:
A longer form of content is possible, using the ISO/IEC 10646/Unicode standard
name:
Most explicitly, a quasi-formal identifier form is possible:
And finally, the character U+FFFD REPLACEMENT CHARACTER, or its alternative "*"
could be used.
Entity Subsets
Including the full set is overkill, especially for most non-CJK application. So
subset entity sets have been defined.
-------------------------------------------------------------------------------
ISO WG8 N 1816
The above was presented to ISO WG8. An RTF version is available.
-------------------------------------------------------------------------------
General Rules for Determining Class and Role
Dec 10, 1995
-------------------------------------------------------------------------------
The following are the general rules for classifying characters, in order.
1. All roles assigned by ISO 8879 SGMLREF are kept.
2. ftp://unicode.org/MappingTables/UnicodeData-1.1.4.txt. It assigns a
category to every character. DELMCHAR implies usable as a short reference
delimiter. Class Lu includes Han ideographs: U+4E00 to U+9FFF and U+F900
to U+FAFF.
o Lu = Uppercase Letter = ERCS UCNMSTART
o Ll = Lowercase Letter = ERCS LCNMSTART
o Lm = Modifier Letter = ERCS UCNMCHAR
o Lo = Other Letter = ERCS UCNMSTART
o Mn = Non-Spacing Mark = ERCS UCNMCHAR (20D0-20E1 UNMCHAR but
deprecated)
o Mc = Combining Mark = ERCS UCNMCHAR
o Nd = Decimal Number = ERCS UCNMCHAR(future: DIGIT?)
o No = Other Number = ERCS UCNMCHAR
o Pd = Dash Punctuation =ERCS DELMCHAR
o Ps = Open Punctuation =ERCS DELMCHAR
o Pe = Close Punctuation =ERCS DELMCHAR
o Po = Other Punctuation =ERCS DELMCHAR
o Sm = Math Symbol = ERCS DELMCHAR
o Sc = Currency Symbol = ERCS DELMCHAR
o So = Other Symbol = ERCS DELMCHAR
o Zs = Space Separator = ERCS SEPCHAR (+ allowable SHORTREF)
o Zl = Line Separator = ERCS SEPCHAR (+ allowable SHORTREF, not to be
confused with RE or RS?)
o Zp = Paragraph Separator = ERCS SEPCHAR (+ allowable SHORTREF)
o Cc = Control or Format Character = U+0000-U+009F ERCS CONTROL
U+200C-U+206F ERCS UCNMSTART (+ allowable SHORTREF)
o Co = Other Character (e.g. private use) = ERCS SHORTREF
o Cn = Non-Character (i.e. not part of Unicode 1.1) = ERCS NSGML
3. Unicode characters in the "restricted" compatibility zone (FB00-FFEF)
present many problems. While they are, in some respects, equivalent to
other characters in the normal part of ISO 10646, ERCS does not implement
any equivalence. They should only be used with caution in markup,
especially because of the visual misinterpretation and spelling errors
they allow.
General Block Results
This is a general guide to what characters are most commonly found in each area
when the ERCS rules are applied.
Start Stop Block Name General Result
0020 007E BASIC LATIN SGMLREF
00A0 00FF LATIN-1 SUPPLEMENT SHORTREF or NAME
0100 017F LATIN EXTENDED-A NAME
0180 024F LATIN EXTENDED-B NAME
0250 02AF IPA EXTENSIONS NAME
02B0 02FF SPACING MODIFIER LETTERS NAME
0300 036F COMBINING DIACRITICAL MARKS NAME
0370 03CF BASIC GREEK NAME
03D0 03FF GREEK SYMBOLS AND COPTIC NAME
0400 04FF CYRILLIC NAME
0530 058F ARMENIAN NAME
0590 05CF HEBREW EXTENDED-A NAME
05D0 05EA BASIC HEBREW NAME
05EB 05FF HEBREW EXTENDED-B NAME
0600 0652 BASIC ARABIC NAME
0653 06FF ARABIC EXTENDED NAME
0900 097F DEVANAGARI NAME
0980 09FF BENGALI NAME
0A00 0A7F GURMUKHI NAME
0A80 0AFF GUJARATI NAME
0B00 0B7F ORIYA NAME
0B80 0BFF TAMIL NAME
0C00 0C7F TELUGU NAME
0C80 0CFF KANNADA NAME
0D00 0D7F MALAYALAM NAME
0E00 0E7F THAI NAME
0E80 0EFF LAO NAME
10D0 10FF BASIC GEORGIAN NAME
10A0 10CF GEORGIAN EXTENDED NAME
1100 11FF HANGULJAMO NAME
1E00 1EFF LATIN EXTENDED ADDITIONAL NAME
1F00 1FFF GREEK EXTENDED NAME
2000 206F GENERAL PUNCTUATION SHORTREF
2070 209F SUPERSCRIPTS AND SUBSCRIPTS SHORTREF
20A0 20CF CURRENCY SYMBOLS SHORTREF
20D0 20FF COMBINING DIACRITICAL MARKS FOR SYMBOLS DATACHAR
2100 214F LETTERLIKE SYMBOLS SHORTREF
2150 218F NUMBER FORMS SHORTREF
2190 21FF ARROWS SHORTREF
2200 22FF MATHEMATICAL OPERATORS SHORTREF
2300 23FF MISCELLANEOUS TECHNICAL SHORTREF
2400 243F CONTROL PICTURES SHORTREF
2440 245F OPTICAL CHARACTER RECOGNITION SHORTREF
2460 24FF ENCLOSED ALPHANUMERICS SHORTREF
2500 257F BOX DRAWING SHORTREF
2580 259F BLOCK ELEMENTS SHORTREF
25A0 25FF GEOMETRIC SHAPES SHORTREF
2600 26FF MISCELLANEOUS SYMBOLS SHORTREF
2700 27BF DINGBATS SHORTREF
3000 303F CJK SYMBOLS AND PUNCTUATION SHORTREF or NAME
3040 309F HIRAGANA NAME
30A0 30FF KATAKANA NAME
3100 312F BOPOMOFO NAME
3130 318F HANGUL COMPATIBILITY JAMO NAME
3190 319F CJK MISCELLANEOUS SHORTREF
3200 32FF ENCLOSED CJK LETTERS AND MONTHS SHORTREF
3300 33FF CJK COMPATIBILITY SHORTREF
3400 3D2D HANGUL NAME
3D2E 44B7 HANGUL SUPPLEMENTARY-A NAME
44B8 4DFF HANGUL SUPPLEMENTARY-B NAME
4E00 9FFF CJK UNIFIED IDEOGRAPHS NAME
E000 F8FF PRIVATE USE AREA SHORTREF
F900 FAFF CJK COMPATIBILITY IDEOGRAPHS NAME (deprecated)
FB00 FB4F ALPHABETIC PRESENTATION FORMS NAME (deprecated)
FB50 FDFF ARABIC PRESENTATION FORMS-A NAME (deprecated)
FE20 FE2F COMBINING HALF MARKS NAME (deprecated)
FE30 FE4F CJK COMPATIBILITY FORMS NAME (deprecated)
FE50 FE6F SMALL FORM VARIANTS NAME (deprecated)
FE70 FEFE ARABIC PRESENTATION FORMS-B NAME (deprecated)
FF00 FFEF HALFWIDTH AND FULLWIDTH FORMS NAME (deprecated)
FFF0 FFFD SPECIALS SEPCHAR
-------------------------------------------------------------------------------
On Reconciling Gaiji by Short References to a Character Catalogue
November 8, 1995
Abstract: Japanese corporations frequently extend the standard character sets.
The compatability problems that this causes can be removed for SGML documents
by treating these characters as short references to character entities.
-------------------------------------------------------------------------------
Introduction
Computers can share text data only if they agree on the characters being used.
The general method of this is for the computers to use a common character set:
national standards bodies promulgate such sets. By default, SGML uses the ISO
646 character set.
Outside ISO 646-using countries, this default is not so useful. SGML allows the
character set of a document to be explicitly declared in an SGML declaration.
If characters are needed in a document that are not found in the document's
character set, SDATA character entity references can be used. This allows the
document to use names for the characters. These names must be resolved, perhaps
by human intervention, into the forms known by each specific system, when the
character is required.
Gaiji
Japanese users sometimes need to add extra characters to the standard character
sets. This type of extra character is called a "gaiji" in Japanese and a
"user-defined character" in English. Japanese corporations frequently also
extend the standard character sets. Gaiji can prevent text sharing between
computers of different companies; the computers do not agree on characters,
except for the national standard subset. (Gaiji create superset character sets;
I am not using the term in any general sense such as "other characters needed
but not found", I mean gaiji as actual extensions to registered character
sets.)
If all the user-defined characters are declared in the SGML declaration as
short reference delimiters, and then if short reference maps are defined for
them in the DTD, then when the document is parsed, the gaiji will be
effectively removed from the document character data! An application will only
see standard characters; the gaiji have been replaced by SDATA character entity
references.
Furthermore, if the character entity references are themselves references to a
larger standard character set, in particular to ISO 10646, then the SGML system
can resolve the references automatically.
In other words, by using a character catalogue, such as ISO 10646, that can be
known by all the computers, the situation is reached again where all the
computers (i.e. the SGML applications) agree on the characters being used. Data
sharing is possible, just as if all computers were using the same character
set.
Impact
The importance of this is that there is no need to deprecate the use of gaiji
in SGML document character sets: by treating them as short references to
character entity references to characters in a large character catalogue, text
sharing is possible entirely from within the current SGML model.
So rather than trying to suppress the use of gaiji, a better strategy is to
promote the use of ISO 10646 as a character catalogue on all Japanese SGML
systems.
Of course, gaiji not found in the catalogue cannot be resolved in such a
system-independent way; but they will still be SGML SDATA character references
entities.
-------------------------------------------------------------------------------
Restricted Zone "Compatibility" Characters
Zenkaku & Hankaku
December 8, 1995
-------------------------------------------------------------------------------
What are they?
In ISO 10646-1 BMP UCS-2, there are a few hundred repeated characters, put into
the "Restricted Zone" from U=FE30 to U+FFE6. Examples are the halfwidth
katakana and fullwidth Latin characters.
In Japanese, Shift-JIS and EUC-J encodings also have these repeated characters.
Korean character sets have the same issue.
-------------------------------------------------------------------------------
When are they used?
UNIX users in Japan often do not use the half-width katakana form. However, PC
and Macintosh users do. Fullwidth and halfwidth characters tend to be used in
different places. For example, an English phrase will be typed in half-width.
But a Roman letter that is part of a Japanese word (especially contractions)
may use full-width.
Software may use this "implicit markup" to key which typesetting rules and
glyphs to use. The SGML model does not recognise this kind of markup.
There is a partial method to get this second usage in SGML: if all the
compatibility zone characters were short reference delimiters. Then for
example:
HZZH
(where H is half-width "H" and Z is fullwidth "Z") can be marked up to mean
HZZH
which might be useful sometimes (in particular, to translate from EUC or Shift
JIS to plain JIS (or a set that doesn't have the duplication).
-------------------------------------------------------------------------------
What Could ERCS Do?
There are several ways the compatibility zone characters could be handled:
1. make them illegal;
2. treat them as DATACHAR;
3. treat them as short-ref delimiters DELMCHAR (see above);
4. make them NAME and delimiter characters, with no overlap;
5. treat the LETTERS (LATIN + KATAKANA) as lower-case equivalents of the main
Upper-case characters (e.g. half-width katakana "KA" is equal to fullwidth
katakana "KA"; and fullwidth Latin "a" is equal to halfwidth Latin "A");
6. alter SGML to allow the compatibility zone characters to be equal in
significance to their 'proper' versions, including converting the composed
character sequences to their single character equivalent.
-------------------------------------------------------------------------------
What is the best solution?
1. Good, simple, encourages explicit markup, but ignores issues 1 & 2 above;
2. Good, but characters have no significance: easiest;
3. Good, but only solves usage 2) above; also, I don't want any LETTERS or
KANA to be delimiters: it is confusing; also,
4. Bad: encourages the use of these characters;
5. Doesn't really solve anything: all compatibility zone characters should be
treated the same way;
6. Best, but only solves usage 1 above. Also, this requires a big change to
ISO 8879 and to parsers. However, the character differences are preserved,
so typesetting software can still make use of the "implicit markup".
The ERCS earlier drafts proposed & DOCP provisionally endorsed #6. However,
later discussion and discussion at WG8 in Broomfield, October 1995, decided
that the third option was most practical. Though, should further experience
(and requests from other national bodies) demand it, the meeting was not
antagonistic to increasing the SGML parser's notion of equivalence.
-------------------------------------------------------------------------------
SGML declaration for EUC
March 18, 1995
-------------------------------------------------------------------------------
Here is an SGML declaration for the system.
--
0xA1C4 43 0xA1C4
0xA2A1 94 0xA2A1
-- 0xA3A1 94 0x00A1 map zenkaku Roman not allowed pending thought--
0xA4A1 94 0xA4A1
0xA5A1 94 0xA5A1
0xA6A1 94 0xA6A1
0xA7A1 94 0xA7A1
0xA8A1 94 0xA8A1
0xA9A1 94 0xA9A1
0xA0A1 94 0xA9A1
0xAAA1 94 0xAAA1
0xABA1 94 0xABA1
0xACA1 94 0xACA1
0xADA1 94 0xADA1
0xAEA1 94 0xAEA1
0xAFA1 94 0xAFA1
0xB1A1 94 0xB1A1
...
0xFEA1 94 0xFEA1
-- BASESET "ISO Registration Number 13//CHARSET JIS X 0201-1986//EN"
DESCSET 161 63 0xA5A1 --
-- hankaku katakana not allowed pending resolution of problem --
FUNCTION RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
FULLSPACE SEPCHAR 0xA1A1
NAMING LCNMSTRT -- not yet finished --
161 ... 223 -- substitute: never used --
0xA521 ... 0xA2FE
...
0xFE21 ... 0xFEFE
UCNMSTRT
161 ... 223 -- substitute: never used --
0xA521 ... 0xA2FE
...
0xFE21 ... 0xFEFE
LCNMCHAR ".-"
UCNMCHAR ".-"
NAMECASE GENERAL YES
ENTITY NO
DELIM GENERAL SGMLREF
SHORTREF SGMLREF
0xA1A2 ... 0xA1FE
0xA221 ... 0xA2FE
0xA8A1 ... 0xA8CF
NAMES SGMLREF
QUANTITY SGMLREF
ATTSPLEN 1920 -- ?? --
LITLEN 240 -- ?? --
NAMELEN 64 -- ?? --
PILEN 1920 -- ?? --
TAGLEN 1920 -- ?? --
-------------------------------------------------------------------------------
European Application
Nov 30, 1995
-------------------------------------------------------------------------------
ERCS has been prompted by East Asian needs in general and CJK document needs in
particular. Nevertheless, it should be applicable to European national &
Economic Community needs as well.
For a country that uses ISO 8859-2 as its national character set, for example,
native-language tagging dictates that more than ISO 8879:1986 RCS be used: so
"-//SPREAD ERCS//SYNTAX Extended (ISO 8859-2 repertoire)//EN" would be useful.
For multilingual, multiple-character-set documents, each could be created with
the ERCS of it's character set's repertoire. In other words, one document of an
ISO 8859-2 nation could be created using "-//SPREAD ERCS//SYNTAX Extended (ISO
8859-2 repertoire)//EN" while another document of an ISO 8859-3 nation could be
created using "-//SPREAD ERCS//SYNTAX Extended (ISO 8859-3 repertoire)//EN".
But both could be then incorporated into the same document (as SUBDOC documents
for example) merely by giving the full "-//SPREAD ERCS//SYNTAX Extended
(ISO/IEC 10646-1:1993 repertoire)//EN" as the document syntax (the same syntax
has to apply to subdocuments as well as documents). The Entity Manager would
have to convert the incoming documents into the appropriate composite or
unified character set of course, for example ISO/IEC 10646-1 (Unicode).
This seems to present a workable alternative to the current
"lowest-common-denominator" approach of using ISO 646 as the syntax reference
character set for RCS. While using RCS is certainly useful for many such
things, it seems strange that, for a fictitious example, in a document from
Bulgaria that might have both Cyrillic and Greek subdocuments, both would have
to be marked up in ISO 646: in other words, because not all multinational
documents involve Latin characters, it seems otiose to invoke them merely for
markup.
A better example perhaps, might be a Middle East document containing both
Hebrew and Arabic subdocuments. Using ERCS, the markup need not be in ISO 646,
but could use words in the same language as the documents: words presumably
more familiar to the users of the documents.
-------------------------------------------------------------------------------
Suggested Strategy for Implementing ERCS in Products
Dec 8, 1995
-------------------------------------------------------------------------------
* Vendors should support elipses in the SGML declaration syntax declaration,
for NAMING and SHORTREFs. (Proposed for ISO 8879)
* Vendors should support NMCHAR and NMSTRT character classes in NAMING
declaration. (Proposed for ISO 8879)
* Vendors should support invocation of syntax declaration through public
identifiers.
* Vendors should specify (in system declaration)\ whether their system
allows 16-bit markup characters, if there are restrictions on the number
of shortrefs or SDATA entities. Any 8-bit dependencies should be noted.
* Vendors should make sure that their products support long LITLEN.
* Include a copy of "-//SPREAD ERCS//SYNTAX Extended (ISO/IEC 10646-1:1993
repertoire)//EN"
* Include copies of "-//SPREAD ERCS//SYNTAX Extended (ISO 8859-1
repertoire)//EN" to "-//SPREAD ERCS//SYNTAX Extended (ISO 8859-9
repertoire)//EN" as they are developed with distributions of the SGML
application.
* Implement the algorithmic processing of ISO/IEC 10646-1:1993 public
entity characters (i.e. those beginning with U), and build them in so they
don't need to be read in from a file. (If the proposed CHAR declarations
are not accepted into ISO 8879.)
* Consider building or bundling tools to help clean up incoming data; in
error handling of name misspelling, consider the half-width full-width
equivalence and the small and normal katakana equivalence for Japanese;
for Indic and Middle Eastern languages consider canonical ordering of
characters; for European and accented languages consider the equivalence
of combined characters and of base+accent sequences. This will help
robustness, without sacrificing validatability.
-------------------------------------------------------------------------------
Subset Repertoires
November 30, 1995
-------------------------------------------------------------------------------
Where a document is to be parsed on a system with a system character set that
is a subset of ISO/IEC 10646-1, or where a document will be stored or
transmitted using such a character subset, it is prudent to restrict the
repertoire of characters that can be used for markup and CDATA data content.
Because this is likely to be the normal case, several subsets have been
defined: if a user on a Unicode 1.1 system is making SGML documents for an
Western European country, the ISO 8859-1 subset is appropriate. Or a Japanese
user creating a document in EUC, but with a target of a shift-JIS computer,
would be wise to adopt the discipline of only using JIS X 0208:1990 characters
for markup and CDATA data content (perhaps using the ISO/IEC 10646-1 Public
Character Entities to refer to other characters):
* "-//SPREAD ERCS//SYNTAX Extended (ISO/IEC 10646-1:1993 repertoire)//EN"
* "-//SPREAD ERCS//SYNTAX Extended (ISO 646 repertoire)//EN"
* "-//SPREAD ERCS//SYNTAX Extended (ISO 8859-1 repertoire)//EN"
* "-//SPREAD ERCS//SYNTAX Extended (PC 1252 repertoire)//EN"
* "-//SPREAD ERCS//SYNTAX Extended (ISO 8859-n repertoire)//EN"
* "-//SPREAD ERCS//SYNTAX Extended (JIS X 0208:1990 repertoire)//EN"
* "-//SPREAD ERCS//SYNTAX Extended (JIS X 0208:1990 + JIS X 0212:1990
repertoire)//EN"
Note: for the sake of illustration, the formal public identifiers "-//SPREAD
ERCS//SYNTAX Extended (xxx repertoire)//EN" are being used. (The "xx" should be
filled in later.) The public identifier and many details may change.
-------------------------------------------------------------------------------
"-//SPREAD ERCS//SYNTAX Extended (ISO/IEC 10646-1:1993 repertoire)//EN"
March 18, 1995
-------------------------------------------------------------------------------
This is the syntax for use with Version 1.1 of the Unicode Character Standard
and ISO/IEC 10646-1 BMP character sets.
See Rules for Determining Syntax for information on how this has been
generated. See also Syntax Extensions needed for External Syntax Entities for
information on the non-standard syntax used below.
SHUNCHAR CONTROLS
BASESET "ISO Registration Number 176//CHARSET
ISO/IEC 10646-1:1993 UCS-2 with implementation level 3//ESC 2/5 2/15 4/5"
DESCSET 0 65536 0 -- 16 bit --
FUNCTION RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
"NO-BREAK-SPACE" SEPCHAR 160
"EN-QUAD" SEPCHAR 8192
"EM-QUAD" SEPCHAR 8193
"EN-SPACE" SEPCHAR 8194
"EM-SPACE" SEPCHAR 8195
"THREE-PER-EM-SPACE" SEPCHAR 8196
"FOUR-PER-EM-SPACE" SEPCHAR 8197
"SIX-PER-EM-SPACE" SEPCHAR 8198
"FIGURE-SPACE" SEPCHAR 8199
"PUNCTUATION-SPACE" SEPCHAR 8200
"THIN-SPACE" SEPCHAR 8201
"HAIR-SPACE" SEPCHAR 8202
"ZERO-WIDTH-SPACE" SEPCHAR 8203
"IDEOGRAPHIC-SPACE" SEPCHAR 12288
"ZERO-WIDTH-NO-BREAK-SPACE" SEPCHAR 65279
NAMING
LCNMSTRT
12353 -- hiragana small vowels A, I, U, E, O --
12355
12357
12359
12361
12387 -- hiragana small TU, YA, YU, YO, WA --
12419
12421
12423
12430
12449 -- katakana small vowels --
12451
12453
12455
12457
12483 -- katakana small TU, YA, YU, YO, WA --
12515
12517
12519
12526
224 ... 246
248 ... 255
257
259
261
263
265
267
269
271
273
275
277
279
281
283
285
287
289
291
293
295
297
299
301
303
307
309
311
314
316
318
320
322
324
326
328
331
333
335
337
339
341
343
345
347
349
351
353
355
357
359
361
363
365
367
369
371
373
375
378
380
382
387
389
392
396
402
409
417
419
421
424
429
432
436
438
441
445
453
454
456
457
459
460
462
464
466
468
470
472
474
476
479
481
483
485
487
489
491
493
495
498
499
501
507
509
511
513
515
517
519
521
523
525
527
529
531
533
535
595
596
599 ... 601
603
608
611
616
617
623
626
643
648
650
651
658
940 ... 943
945 ... 961
963 ... 974
976
977
981
982
995
997
999
1001
1003
1005
1007 ... 1009
1072 ... 1103
1105 ... 1116
1118
1119
1121
1123
1125
1127
1129
1131
1133
1135
1137
1139
1141
1143
1145
1147
1149
1151
1153
1169
1171
1173
1175
1177
1179
1181
1183
1185
1187
1189
1191
1193
1195
1197
1199
1201
1203
1205
1207
1209
1211
1213
1215
1218
1220
1224
1228
1233
1235
1237
1239
1241
1243
1245
1247
1249
1251
1253
1255
1257
1259
1263
1265
1267
1269
1273
1377 ... 1414
7681
7683
7685
7687
7689
7691
7693
7695
7697
7699
7701
7703
7705
7707
7709
7711
7713
7715
7717
7719
7721
7723
7725
7727
7729
7731
7733
7735
7737
7739
7741
7743
7745
7747
7749
7751
7753
7755
7757
7759
7761
7763
7765
7767
7769
7771
7773
7775
7777
7779
7781
7783
7785
7787
7789
7791
7793
7795
7797
7799
7801
7803
7805
7807
7809
7811
7813
7815
7817
7819
7821
7823
7825
7827
7829
7841
7843
7845
7847
7849
7851
7853
7855
7857
7859
7861
7863
7865
7867
7869
7871
7873
7875
7877
7879
7881
7883
7885
7887
7889
7891
7893
7895
7897
7899
7901
7903
7905
7907
7909
7911
7913
7915
7917
7919
7921
7923
7925
7927
7929
7936 ... 7943
7952 ... 7957
7968 ... 7975
7984 ... 7991
8000 ... 8005
8017
8019
8021
8023
8032 ... 8039
8048 ... 8061
8064 ... 8071
8080 ... 8087
8096 ... 8103
8112
8113
8115
8131
8144
8145
8160
8161
8165
8179
8560 ... 8575
9424 ... 9449
UCNMSTRT
12354 -- hiragana normal vowels AIUEO --
12356
12358
12360
12362
12388 -- hiragana normal TU, YA, YU, YO, WA )
12420
12422
12424
12430
12450 -- katakana normal vowels --
12452
12454
12456
12458
12484 -- katakana normal TU, YA, YU, YO, WA --
12516
12518
12520
12527
192 ... 214
216 ... 222
376
256
258
260
262
264
266
268
270
272
274
276
278
280
282
284
286
288
290
292
294
296
298
300
302
306
308
310
313
315
317
319
321
323
325
327
330
332
334
336
338
340
342
344
346
348
350
352
354
356
358
360
362
364
366
368
370
372
374
377
379
381
386
388
391
395
401
408
416
418
420
423
428
431
435
437
440
444
452
452
455
455
458
458
461
463
465
467
469
471
473
475
478
480
482
484
486
488
490
492
494
497
497
500
506
508
510
512
514
516
518
520
522
524
526
528
530
532
534
385
390
394
398 ... 400
403
404
407
406
412
413
425
430
433
434
439
902
904 ... 906
913 ... 929
931 ... 939
908
910
911
914
920
934
928
994
996
998
1000
1002
1004
1006
922
929
1040 ... 1071
1025 ... 1036
1038
1039
1120
1122
1124
1126
1128
1130
1132
1134
1136
1138
1140
1142
1144
1146
1148
1150
1152
1168
1170
1172
1174
1176
1178
1180
1182
1184
1186
1188
1190
1192
1194
1196
1198
1200
1202
1204
1206
1208
1210
1212
1214
1217
1219
1223
1227
1232
1234
1236
1238
1240
1242
1244
1246
1248
1250
1252
1254
1256
1258
1262
1264
1266
1268
1272
1329 ... 1366
7680
7682
7684
7686
7688
7690
7692
7694
7696
7698
7700
7702
7704
7706
7708
7710
7712
7714
7716
7718
7720
7722
7724
7726
7728
7730
7732
7734
7736
7738
7740
7742
7744
7746
7748
7750
7752
7754
7756
7758
7760
7762
7764
7766
7768
7770
7772
7774
7776
7778
7780
7782
7784
7786
7788
7790
7792
7794
7796
7798
7800
7802
7804
7806
7808
7810
7812
7814
7816
7818
7820
7822
7824
7826
7828
7840
7842
7844
7846
7848
7850
7852
7854
7856
7858
7860
7862
7864
7866
7868
7870
7872
7874
7876
7878
7880
7882
7884
7886
7888
7890
7892
7894
7896
7898
7900
7902
7904
7906
7908
7910
7912
7914
7916
7918
7920
7922
7924
7926
7928
7944 ... 7951
7960 ... 7965
7976 ... 7983
7992 ... 7999
8008 ... 8013
8025
8027
8029
8031
8040 ... 8047
8122
8123
8136 ... 8139
8154
8155
8184
8185
8170
8171
8186
8187
8072 ... 8079
8088 ... 8095
8104 ... 8111
8120
8121
8124
8140
8152
8153
8168
8169
8172
8188
8544 ... 8559
9398 ... 9423
223
304
305
312
329
383
384
393
397
405
410
411
414
415
422
426
427
442
443
446 ... 451
477
496
592 ... 594
597
598
602
604 ... 607
609
610
612 ... 615
618 ... 622
624
625
627 ... 642
644 ... 647
649
652 ... 657
659 ... 680
912
944
962
978 ... 980
986
988
990
992
1010
1011
1216
1415
1488 ... 1514
1520 ... 1522
1569 ... 1594
1600 ... 1610
1649 ... 1719
1722 ... 1726
1728 ... 1742
1744 ... 1747
1749
2309 ... 2361
2392 ... 2401
2437 ... 2444
2447
2448
2451 ... 2472
2474 ... 2480
2482
2486 ... 2489
2524
2525
2527 ... 2529
2544
2545
2565 ... 2570
2575
2576
2579 ... 2600
2602 ... 2608
2610
2611
2613
2614
2616
2617
2649 ... 2652
2654
2693 ... 2699
2701
2703 ... 2705
2707 ... 2728
2730 ... 2736
2738
2739
2741 ... 2745
2784
2821 ... 2828
2831
2832
2835 ... 2856
2858 ... 2864
2866
2867
2870 ... 2873
2908
2909
2911 ... 2913
2949 ... 2954
2958 ... 2960
2962 ... 2965
2969
2970
2972
2974
2975
2979
2980
2984 ... 2986
2990 ... 2997
2999 ... 3001
3056 ... 3058
3077 ... 3084
3086 ... 3088
3090 ... 3112
3114 ... 3123
3125 ... 3129
3168
3169
3205 ... 3212
3214 ... 3216
3218 ... 3240
3242 ... 3251
3253 ... 3257
3294
3296
3297
3333 ... 3340
3342 ... 3344
3346 ... 3368
3370 ... 3385
3424
3425
3585 ... 3632
3634
3635
3648 ... 3654
3663
3674
3675
3713
3714
3716
3719
3720
3722
3725
3732 ... 3735
3737 ... 3743
3745 ... 3747
3749
3751
3754
3755
3757
3758
3760
3762
3763
3773
3776 ... 3780
4256 ... 4293
4304 ... 4342
4352 ... 4441
4447 ... 4514
4520 ... 4601
7830 ... 7834
8016
8018
8020
8022
8114
8116
8118
8119
8130
8132
8134
8135
8146
8147
8150
8151
8162 ... 8164
8166
8167
8178
8180
8182
8183
8204 ... 8207
8234 ... 8238
8298 ... 8303
12295
12321 ... 12329 -- hangzhou numerals --
12363 ... 12386 -- hiragana (smalls & equivs earlier) --
12389 ... 12418
12425 ... 12429
12431 ... 12538
12459 ... 12482 -- katakana (not smalls & UC equivs) --
12484 ... 12514
12521 ... 12525
12528 ... 12538
12549 ... 12588 -- bopomofo --
12593 ... 12686 -- hangul elements --
13312 ... 40869 -- Han ideographs --
63744 ... 64046 -- Han ideographs compatibility --
LCNMCHAR
45
46
UCNMCHAR
45
46
168
175
180
184
688 ... 734
736 ... 745
768 ... 837
864
865
890
900
901
1155 ... 1158
1369
1456 ... 1465
1467 ... 1469
1471
1473
1474
1611 ... 1618
1632 ... 1641
1648
1750 ... 1768
1770 ... 1773
1776 ... 1785
2305 ... 2307
2364
2366 ... 2381
2385 ... 2388
2402
2403
2406 ... 2415
2433 ... 2435
2492
2494 ... 2500
2503
2504
2507 ... 2509
2519
2530
2531
2534 ... 2543
2562
2620
2622 ... 2626
2631
2632
2635 ... 2637
2662 ... 2673
2689 ... 2691
2748
2750 ... 2757
2759 ... 2761
2763 ... 2765
2790 ... 2799
2817 ... 2819
2876
2878 ... 2883
2887
2888
2891 ... 2893
2902
2903
2918 ... 2927
2946
2947
3006 ... 3010
3014 ... 3016
3018 ... 3021
3031
3047 ... 3055
3073 ... 3075
3134 ... 3140
3142 ... 3144
3146 ... 3149
3157
3158
3174 ... 3183
3202
3203
3262 ... 3268
3270 ... 3272
3274 ... 3277
3285
3286
3302 ... 3311
3330
3331
3390 ... 3395
3398 ... 3400
3402 ... 3405
3415
3430 ... 3439
3633
3636 ... 3642
3655 ... 3662
3664 ... 3673
3761
3764 ... 3769
3771
3772
3784 ... 3789
3792 ... 3801
8125 ... 8129
8141 ... 8143
8157 ... 8159
8173 ... 8175
8189
8190
8400 ... 8417
9332 ... 9340
9352 ... 9360
9450
12330 ... 12335
12441 ... 12446 -- hiragana marks --
12540 ... 12542 -- katakana marks --
NAMECASE GENERAL YES
ENTITY NO
DELIM GENERAL SGMLREF
SHORTREF SGMLREF
" " ... "§"
"©" ... "®"
"°" ... "³"
"µ" ... "·"
"¹" ... "¿"
"×"
"÷"
"ʹ"
"͵"
";"
"·"
"҂"
"՚" ... "՟"
"։"
"־"
"׀"
"׃"
"׳"
"״"
"،"
"؛"
"؟"
"٪" ... "٭"
"۔"
"۩"
"ऽ"
"ॐ"
"।"
"॥"
"॰"
"৲" ... "৺"
"ੲ" ... "ੴ"
"ઽ"
"ૐ"
"ଽ"
"୰"
"฿"
"ຯ"
"ໆ"
"ໜ"
"ໝ"
"჻"
" " ... ""
"‰" ... "⁆"
"" ... "⁰"
"⁴" ... "₎"
"₠" ... "₪"
"℀" ... "ℸ"
"⅓" ... "ↂ"
"←" ... "⇪"
"∀" ... "⋱"
"⌀"
"⌂" ... "⍺"
"␀" ... ""
"⑀" ... "⑊"
"①" ... "⑳"
"⑽" ... "⒇"
"⒑" ... "ⓩ"
"─" ... "▕"
"■" ... "◯"
"☀" ... "☓"
"☚" ... "♯"
"✁" ... "✄"
"✆" ... "✉"
"✌" ... "✧"
"✩" ... "❋"
"❍"
"❏" ... "❒"
"❖"
"❘" ... "❞"
"❡" ... "❧"
"❶" ... "➔"
"➘" ... "➯"
"➱" ... "➾"
" " ... "〆"
"〈" ... "〠"
"〰" ... "〷"
"〿"
"・"
"㆐" ... "㆟"
"㈀" ... "㈜"
"㈠" ... "㉃"
"㉠" ... "㉻"
"㉿" ... "㊰"
"㋀" ... "㋋"
"㋐" ... "㋾"
"㌀" ... "㍶"
"㍻" ... "㏝"
"㏠" ... "㏾"
"﬩"
"﴾"
"﴿"
"︰" ... "﹄"
"﹉" ... "﹒"
"﹔" ... "﹦"
"﹨" ... "﹫"
""
"!" ... "/"
":" ... "@"
"[" ... "`"
"{" ... "~"
"。" ... "・"
"ー"
"゙"
"゚"
"¢" ... "₩"
"│" ... "○"
"�"
NAMES SGMLREF
QUANTITY SGMLREF -- To be determined --
ATTSPLEN 1920 -- ?? --
LITLEN 240 -- ?? --
NAMELEN 240 -- ?? --
PILEN 1920 -- ?? --
TAGLEN 1920 -- ?? --
-------------------------------------------------------------------------------
Status of ERCS
November 30, 1995
-------------------------------------------------------------------------------
The ERCS was endorsed by the China/Japan/Korea Document Processing ad hoc
committee (liason to ISO WG8). I am using SPREAD ERCS in the current
public identifier.
The ERCS is being developed for future SGML Open consideration, and has
not been adopted, recommended or authorized by SGML Open. Members of SGML
Open have been involved, and SGML kindly hosted earlier drafts of ERCS on
its website.
-------------------------------------------------------------------------------
TITLE : Liaison to Mr. R. Jelliffe from CJK DOCP
SOURCE : CJK DOCP
DATE : March 3, 1995
With regard to +//SGML Open:1995//SYNTAX Extended//EN, East Asian Document
Issues: A proposal for an extended reference concrete syntax by Rick Jelliffe,
January 7, 1995.
It is the consensus of the participants at the 7th CJK DOCP meeting (Kanazawa
Japan, 2-3 March 1995) that Mr. Jelliffe's proposal appears to solve many of
the problems with using ISO/IEC 10646 with SGML for East Asian document
processing.
The CJK DOCP working project is interested in working on the development of
this proposal with the goal of its eventual incorporation into a revision of
ISO 8879 standard.
-------------------------------------------------------------------------------
In my capacity as Australian delegate to ISO WG8 I presented the
suggestions for changes to ISO 8879 syntax and system declaration to the
BroomField, USA meeting of October 1995. The proposals were accepted in
principle, and had the support of Japan in particular.
These are available in RTF files as notes WG8 N 1815 and N 1816.
-------------------------------------------------------------------------------
CJK Issues
November 30, 1995
-------------------------------------------------------------------------------
Han Ideographic Characters
All Han ideographic characters are NAME characters.
-------------------------------------------------------------------------------
Spaces
All spaces are SEPCHARs.
-------------------------------------------------------------------------------
Gaiji and User-defined Characters
All user-defined characters should be SHORTREFs.
-------------------------------------------------------------------------------
Numbers
All native numbers and digits except for the DIGITs (0..9) should be NMSTRT
characters. (This does not apply to superscript, subscript, numbers inside
circles, and other characters that don't represent simple numbers: these are
just SHORTREFs.)
-------------------------------------------------------------------------------
Repeat Characters
The KANA REPEAT character is a NMCHAR character, but is best avoided.
-------------------------------------------------------------------------------
Half-width and Full-width characters (Zenkaku/Hankaku) and Character
Substitution
For a detailed discussion on compatibility zone characters, see Compatibility
Zone Characters. The CJK DOCP group recommends that the half-width and
full-width characters should not be treated as equivalent for purposes of
markup. However use of halfwidth katakana and fullwidth roman letters and
numbers in markup is deprecated.
The ISO 8879:1986 model could be extended, while maintaining backwards
compatibility by generalizing the case-mapping mechanism into a
string->character mapping mechanism. This would resolve the
half-width/full-width problem properly, and also meet similar needs in several
other languages.
A related problem that some languages have in some character sets (Thai,
Hebrew, Arabic, and perhaps Vietnamese) is that the same glyph-character can be
spelled in several different ways. This problem is particularly present in
ISO/IEC 10646-1:1993 (Unicode 1.1). This may be regarded as a normalisation
problem for the entity manager (or, for example, FIND rules in an OmniMark
context-translation) to handle.
Vendors of tools would do well to add some proprietary spelling correctors for
this specific problem.
-------------------------------------------------------------------------------
Japanese Examples
* Example of a simple SGML declaration
* Example of an SGML declaration for EUC
* "-//SPREAD ERCS//SYNTAX Extended (JIS X 0208:1990 repertoire//EN"
* "-//SPREAD ERCS//SYNTAX Extended (JIS X 0208:1990 + JIS X 0212:1990
repertoire)//EN"
-------------------------------------------------------------------------------
Change Log
and Significant Queries
Dec 12, 1995
-------------------------------------------------------------------------------
Dec 12, 1995
(Tony Graham) Combining Diacritical for Symbols can't be SHORTREF, if we want
to keep only single character SHORTREFs. Changed.
Nov 30, 1995
(Rick Jelliffe) simplify & homologate. Use "SPREAD ERCS" instead of "SGML
Open:TR95xx" for convenience.
Sept 4, 1995
(Rick Jelliffe) user-defined are SHORTREF. Page on Gaiji added.
(Prof Eiji Matsuoka, SGML Asia/Pacific '95) Use Heiwa Kanten etc as character
entity catalogue for characters that can't be found in ISO 10646 (+locale).
(following WG8 Broomfield) Page on ERCS system identifiers removed: superceded
by new HyTime FSI, which is essentially the same. External numeric character
references & CHAR declaration mooted. Add notes presented at WG8.
(CJK DOCP Alaska meeting) don't fold half- & fullwidth, and small and normal
katakana. NMCHAR and NMSTART classes endorsed. No need for NUMBER to include
native digits.
June 15, 1995
(Rick Jelliffe) entity invoked as &Uxxxx; rather than &U-xxxx; to fit in with
ISO WG9 CD 14755 and TERENA. Page on European application added.
March 30, 1995
(Rick Jelliffe) Ignore restricted zone compatibility characters while
justifications are gathered (user's SGML decl decides if DATACHAR or NSGML).
Entity subsets. Fix rules. Add more info on restricted zone. Dump PIs for
SYSTEM identifier prefix.
(Glenn Adams) Standard names, terminology & phrasing improvements. Extended
digits?
(James Clark) Standard names, terminology & phrasing improvements. Extended
digits? Attributes can be replaceable character data. Refer to new Unicode
character classes for SGML character classes. Suggestions for syntax changes
(provides custom version of SP for testing and demonstrating these): SGML
production 189 extension in Handbook regarded as essential minimum. Use SYSTEM
identifier instead of PI? Simplify spaces. Compatibility zone handling method
requires too much change to parsers? Announcement of details of ERCS-support
features for SP.
(Gavin Nicols) Character equivalence discussion. Applicability to WWW.
March 8, 1995
First public request for comment.
To Do:
Complete shift-JIS & EUC declarations
-------------------------------------------------------------------------------
(C) 1995 Rick Jelliffe. May be freely copied and translated.
Email comments ricko@allette.com.au
Recent development version in http://www.allette.com.au/sgml/ercs/ercs-home.html