[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [office-formula] Summary 2010-09-07 - IRI vs URI
Andreas, Thanks for pointing out that language in [RFC3987]. Here's my extended analysis of what is involved with that. I believe the statement patently incorrect if read to mean that IRIs and URIs are co-extensive and there is nothing that needs to be done. The IRI specification would be very short and not have to pay attention to mappings and when and how they apply were it literally true. They would also not require a separate grammar for IRIs. Whatever the case, I believe it is necessary to say exactly how we expect IRIs to be mapped to URIs, what the admissable IRIs are in the case of relative references to package files and subdocuments, and what the corresponding manifest:full-path and Zip directory file names are. GRANDFATHERED URIS VS. UNICODE IRIS Here is the context in section 3.1, revealingly entitled "Mapping of IRIs to URIs": "The above mapping from IRIs to URIs produces URIs fully conforming to [RFC3986]. The mapping is also an identity transformation for URIs and is idempotent; applying the mapping a second time will not change anything. Every URI is by definition an IRI." There is no question the mapping is idempotent as defined in [RFC3987], because the resulting URI has no disallowed ASCII-character encodings and so running the mapping again changes nothing. That is to say, the mapping is an identity transformation for IRIs that are already well-formed URIs. It is in that sense that I say IRIs are subsets of URIs or, put better, the image of admissable IRIs that are not already syntactically well-formed URIs is a subset of the URIs. Also, there is the usual problem of mappings of this nature in that there is no assured inversion from an IRI-mapped URI back to an IRI that is not the URI. This is also the sense in which I believe the statement "Every URI is by definition an IRI" is at best misleading and at worse simply incorrect, since there are well-formed URIs that can never be produced from IRIs that are entirely in Unicode and that only %-encode Basic Latin characters that are not permitted as single-character <pchar>s. It is further misleading if taken to mean that a syntactical IRI can be used where URIs are required. There are many places where well-formed URIs are required (e.g., the XML Schema anyURI datatype and elsewhere). WHY FUSS ABOUT THIS? To ensure that the mapping can be accurately inverted, it is necessary to restrict what %-encoded bytes are allowed in URIs and which are to be employed in reconstructing a non-URI IRI, if any, that is presumably the inverse mapping of the URI in hand. This is a strong constraint, because it signifies to me that only Unicode is carried by IRIs (where URIs do not necessarily have that limitation on how %-encoded bytes with values greater than %7f are to be understood). The considerations for assuring an inverse mapping are reflected in section 3.2 and perhaps elsewhere in [RFC3987]. I assume, to satisfy the requirement that ODF support IRIs at all, one needs to ensure that the IRI using non-allowed URI characters can be recoverable to satisfy whatever use cases there are in mind by those who require that IRIs be supported in naming of package files and in URI references generally. Since the requirement came from JTC1 National Body Japan, I presume that it is desirable to see the IRIs with the actual CJK characters whose Unicode code points are IRI-encoded in the naming of package files and in the introduction of URIs in various ways in ODF documents. How packages could be extracted into file systems allowing CJK character encodings in file and directory names is, of course, outside of our control beyond providing interested parties a consistent way of interpreting the Zip file names that ODF restricts itself to. It is on behalf of that requirement that I believe it is important to ascertain, within ODF, when an IRI must be mapped to a URI and the form of URI be used. This matters, in particular, for any relative references that involves segments that are part of ODF Package manifest:full-path values and that need to match what is used in the Zip directory entry for a package file and/or the ODF notion of an identified package subdocument. My recommendation is that the manifest:full-path always be fully IRI encoded (even though it is neither IRI nor URI) and likewise for the corresponding Zip directory entry, when there is one. Furthermore, the only %-encodings should be what is required for this purpose and the only unencoded characters should be a limited subset of what is used in URIs. My recommendation is to allow only non-empty segment names having only <pchar>s without ":", perhaps without "@", and with "/" as the segment separator. There should be no "." and ".." segments, since these only exist in URI references and are inappropriate in Zip directory file names). - Dennis -----Original Message----- From: Andreas J. Guelzow [mailto:andreas.guelzow@concordia.ab.ca] Sent: Tuesday, September 07, 2010 17:07 To: dennis.hamilton@acm.org Cc: dwheeler@dwheeler.com; office-formula@lists.oasis-open.org Subject: RE: [office-formula] Summary 2010-09-07 - IRI vs URI [ ... ] Of course more importantly: http://www.ietf.org/rfc/rfc3987.txt states in the third last paragraph on page 10: "Every URI is by definition an IRI." And I think we should consider RFC3987 as authoritative on this matter. Andreas --------------------------------------------------------------------- To unsubscribe from this mail list, you must leave the OASIS TC that generates this mail. Follow this link to all your TCs in OASIS at: https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]