[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [ubl-dev] Two new documents re: UBL Methodology for Code List and Value Validation
At 2007-04-24 07:00 -0700, David RR Webber \(XML\) wrote: >Good work - this is always a task to ensure everything is consistent. > >I noticed this little innocuous line ; -) > > - use of UTF-8 encoding for all XML artefacts > >Which way did you go on country names? Some had accented characters >- are those now just plain characters? I didn't change the repertoire of characters, David, only the encoding. >Just curious.... This came to me the last time I taught my XML syntax class just recently. An XML document without an XML declaration with encoding is interpreted by a receiving system in any default encoding triggered by a higher-level protocol (if present). It is a surprise to students to hear that UTF-8 is *not* necessarily the default encoding when an XML document does not have an XML declaration, though the default happens to be UTF-8 in most cases because higher-level protocols are not in play: http://www.w3.org/TR/2006/REC-xml-20060816/#charencoding So the default encoding can be at the whim of any higher-level protocols that might be engaged to transmit the document (I'm thinking here perhaps of a Shift-JIS assumption in a Japanese transmission). So there is an infinitesimal but not impossible risk of mismatch if I published committee artefacts without an XML declaration. An XML document with an XML declaration with encoding declares the document is in the specific encoding mentioned ... so having this removes any risk in that regard. But conformant XML processors are only required to support UTF-8 and UTF-16. Some of the encodings I was using for convenience and for manual data entry were US-ASCII and ISO-8879-1. While I'm sure most XML processors would support these, as an international standards artefact I thought it best to make no assumptions about the XML processors that might be working with the documents. Again an infinitesimal but not impossible risk of a user not being able to function with the artefacts. So, summed together, that indicates that committee-published artefacts: (1) - should have an XML declaration for encoding; and (2) - should declare the use of UTF-8 and UTF-16. When I came to that conclusion on my own that reminded me (Doh!) that there are two "additional document constraints" in UBL that say exactly the same thing and I didn't have to go figuring this all out on my own: http://docs.oasis-open.org/ubl/os-UBL-2.0/UBL-2.0.html#d0e3610 [IND2] All UBL instance documents MUST identify their character encoding within the XML declaration. [IND3] In conformance with ISO IEC ITU UN/CEFACT eBusiness Memorandum of Understanding Management Group (MOUMG) Resolution 01/08 (MOU/MG01n83) as agreed to by OASIS, all UBL XML SHOULD be expressed using UTF-8. So ... an XML document with accented letters in country names, expressed in UTF-8 encoding, still has accented letters in country names because the UTF-8 encoding encodes the entire Unicode repertoire and those accented letters are in the repertoire. Thankfully, encoding is orthogonal to repertoire and the encoding decision does not impose any restrictions on characters needed in an XML document. I'll continue to use US-ASCII and ISO-8879-1 for my internal work (my editing software doesn't support UTF-8), but as I'm touching more documents I'm producing for external consumption, the more careful I'm trying to be about explicitly having an XML declaration for UTF-8 encoding. I can write in my own encoding and use an XSLT identity transform to convert any of my files into UTF-8 for publishing as a committee document. I hope this helps understand my rationale. I apologize if it sounds pedantic, but I felt it necessary for the archive to spell out the reasoning so that readers understand the decision was not made lightly. . . . . . . . . . . . . . . . . . Ken -- World-wide corporate, govt. & user group XML, XSL and UBL training RSS feeds: publicly-available developer resources and training G. Ken Holman mailto:gkholman@CraneSoftwrights.com Crane Softwrights Ltd. http://www.CraneSoftwrights.com/u/ Box 266, Kars, Ontario CANADA K0A-2E0 +1(613)489-0999 (F:-0995) Male Cancer Awareness Aug'05 http://www.CraneSoftwrights.com/u/bc Legal business disclaimers: http://www.CraneSoftwrights.com/legal
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]