[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: [Fwd: [xml-dev] String interning (Was: [xml-dev] Binary XML == "spawn of the devil" ?)]
Please see XML-DEV plugs for Core Components and CAM below. Joe
- From: "Chiusano Joseph" <chiusano_joseph@bah.com>
- To: "Hunsberger Peter" <Peter.Hunsberger@stjude.org>
- Date: Mon, 04 Aug 2003 14:43:51 -0400
<Quote> Essentially we have a large project that reuses (and reuses) various XML fragments from many different sources in many different combinations many times (controlled by some small number of parameters). Think of this as a cache of parsed XML that can subsequently be consumed by XSLT transforms. On occasion some portions of this cache will be invalidated and replaced. </Quote> This sounds very much like a combination of several emerging standards, the first being UN/CEFACT Core Components [1]. If you're interested in learning more, please also reference a recent presentation I gave to a federal working group on "Core Components and ebXML Registry" [2], which also discusses the incorporation of Core Components into the ebXML Registry architecture (a process which I am heading up - see slides 43/44). The second standard is the OASIS Content Assembly Mechanism (CAM) specification [3], described on slide 41 of the presentation. Kind Regards, Joe Chiusano Booz | Allen | Hamilton [1] http://xml.coverpages.org/CCTS-V1pt85-20020930.pdf [2] http://xml.gov/presentations/bah/ebXMLcore.ppt [3] http://www.oasis-open.org/committees/cam "Hunsberger, Peter" wrote: > > On Thursday, July 31, 2003 8:58 PM Tyler Close <tyler@waterken.com> > wrote: > > > > > On Thursday 31 July 2003 19:36, Mike Champion wrote: > > > On Thu, 31 Jul 2003 17:46:32 -0400, Tyler Close > > <tyler@waterken.com> > > > wrote: > > > > For an example of a binary format that supports efficient string > > > > interning, without a penalty to generality, see: > > > > > > > > http://www.waterken.com/dev/Doc/code/ > > > > > > Very interesting point/idea. > > > > Thanks. > > > > > AFAIK much of the overhead of XML text > > > parsing that the binary infoset advocates complain about is in the > > > Unicode encoding/decoding and raw string processing (e.g, > > looking at > > > every character to see where an element ends rather than having a > > > stored length). > > > > The Waterken(TM) Doc code format uses a chunked > > representation for encoding a string. This provides the > > speed benefits of a length prefix without creating an > > unlimited buffering requirement. > > > > > Likewise, a number of alternative infoset serializations use the > > > "stream of SAX events" metaphor, that sounds a bit like what that > > > document describes. > > > > Same basic idea. > > > > > But that doesn't sound like "string interning" to me (and > > "interning" > > > is not mentioned in that document). > > > > Notice that all the meta data (ie: the string identifiers) > > are stored in a set of string registers. Subsequent uses of a > > string specify the index of the string to use. This results > > in each string identifier being instantiated just once. The > > singleton instance is the interned instance. > > > > > I thought "interning" was more of a > > > technique for keeping compiled code small by referencing redundant > > > strings via their hash values. > > > > It's more to do with fast lookup than memory savings. The > > hash only gets computed once and equality checks are just > > pointer comparisons. Same thinking is at work in the Doc code format. > > <snip/> > > One of my vague long term projects is to look at ways of building and > utilizing a sort of PSVI database. (Binary XML that never leaves the > building...) Essentially we have a large project that reuses (and > reuses) various XML fragments from many different sources in many > different combinations many times (controlled by some small number of > parameters). Think of this as a cache of parsed XML that can > subsequently be consumed by XSLT transforms. On occasion some portions > of this cache will be invalidated and replaced. > > So the question becomes; do you think any of this work could form a > basis for such a database? Would it be efficient to parse XML to this > format, then feed (multiple chained) XSLT transforms from this format? > > I'd spend some time examining the code, but we're in the middle of a > release and more than swamped at the moment... (For the Cocoon-dev > lurkers on this list, yes, this is related to the discussion on long > term caching models.) > > ----------------------------------------------------------------- > The xml-dev list is sponsored by XML.org <http://www.xml.org>, an > initiative of OASIS <http://www.oasis-open.org> > > The list archives are at http://lists.xml.org/archives/xml-dev/ > > To subscribe or unsubscribe from this list use the subscription > manager: <http://lists.xml.org/ob/adm.pl>begin:vcard n:Chiusano;Joseph tel;work:(703) 902-6923 x-mozilla-html:FALSE url:www.bah.com org:Booz | Allen | Hamilton;IT Digital Strategies Team adr:;;8283 Greensboro Drive;McLean;VA;22012; version:2.1 email;internet:chiusano_joseph@bah.com title:Senior Consultant fn:Joseph M. Chiusano end:vcard----------------------------------------------------------------- The xml-dev list is sponsored by XML.org <http://www.xml.org>, an initiative of OASIS <http://www.oasis-open.org> The list archives are at http://lists.xml.org/archives/xml-dev/ To subscribe or unsubscribe from this list use the subscription manager: <http://lists.xml.org/ob/adm.pl>
begin:vcard n:Chiusano;Joseph tel;work:(703) 902-6923 x-mozilla-html:FALSE url:www.bah.com org:Booz | Allen | Hamilton;IT Digital Strategies Team adr:;;8283 Greensboro Drive;McLean;VA;22012; version:2.1 email;internet:chiusano_joseph@bah.com title:Senior Consultant fn:Joseph M. Chiusano end:vcard
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]