[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: [relax-ng] Proposal 2 for RNG regular expressions
This is a proposal for extending RNG to support regular expressions within content. Essentially, a new type of pattern is introduced, parallel to the value and data patterns: the regex pattern. Various existing and novel elements can appear within a regex pattern in order to specify exactly which characters (of attribute value or character content) are allowed. There are two main design principles. The first is that everything within a regex pattern (after reduction of ref and externalRef elements) can be compiled into a Perl 5 regular expression, since this type of r.e. is supported by many different libraries. The second is that the facilities provided are "in the spirit of RNG": that is, they provide what is clearly necessary, but do not have much syntactic sugar, and can be implemented using variants of the implementation for ordinary patterns, if a Perl-compatible library is not being used. A non-principle was to be equivalent to XSD regexes. Here is the (compact) syntax for the additions. Anything not defined here is defined in the compact syntax for RNG. This grammar could be improved if the compact RNG grammar were factored better so that there were rules for ref, externalRef, empty, value. # Add regex as a new pattern type pattern &= regex # All regexes are packaged within a regex element regex = element regex { common & re+ } re = element group|choice|optional|zeroOrMore|oneOrMore { common & re+ } # Perl (?:...), |, ?, *, + | element ref { nameNCName, common } # Local regex rule | element empty { common } # No op | element value { commonAttributes, xsd:string } # Literal string: all non-alphabetics get escaped for Perl | element externalRef { href, common } # Global regex rule | element begin|end { common, attribute type { "word"|"line"|"string" } } # \b, \b, ^, $, ^A, ^Z | cset # Character sets cset = element cset { common & ((cs+ & exceptcs?) # cs expression | attribute name { xsd:token} # named character class | (attribute type { "chars"|"ranges" }, xsd:string) } # chars: enumerate members of set # ranges: pairs of characters define inclusive ranges cs = element choice|concur {common & cs+} # union, intersection | element ref { nameNCName, common } | element empty { common } | element externalRef { href, common } | cset csexcept = element except { common & cs+ } # cset difference The following paths are forbidden. Essentially these restrictions force patterns imported by ref or externalRef to conform to the re and cs rules. regex//data regex//list regex//attribute regex//ref regex//interleave cset//optional cset//oneOrMore cset//zeroOrMore cset//group cset//begin cset//end The licit values for cset/@name have yet to be defined. Obvious candidates are any, anyButNewline, lower, upper, alpha, digit, num, alphanum, punct, graph, space, control. Also ICU character predicates, possibly Unicode block names. -- John Cowan <jcowan@reutershealth.com> http://www.reutershealth.com I amar prestar aen, han mathon ne nen, http://www.ccil.org/~cowan han mathon ne chae, a han noston ne 'wilith. --Galadriel, _LOTR:FOTR_
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC