<i18n dev> Suggested corrections to the Level 1 conformance statement

Sun Jan 23 18:55:10 PST 2011

In the JDK7 Pattern documentation, it says:

    This class is in conformance with Level 1 of Unicode 
    Technical Standard #18: Unicode Regular Expression 
    Guidelines, plus RL2.1 Canonical Equivalents.

But the very first thing in tr18's conformance section reads:

  C0. An implementation claiming conformance to this
      specification at any Level shall identify the version of 
      this specification and the version of the Unicode Standard.

What is therefore missing from the JDK7 j.u.r.Pattern
documentation is a mandatory pair of concrete version citations:
one about tr18 and one about which version of Unicode.

Full citation forms can be found at:

    http://www.unicode.org/versions/#References
    http://www.unicode.org/versions/components-6.0.0.html

The versions I have been myself using for these analyses are:

  UTS#18, "Unicode Regular Expressions", version 13 from August 29, 2008.
    http://www.unicode.org/reports/tr18/tr18-13.html

  The Unicode Standard, version 6.0.0 from October 11, 2011.  
    http://www.unicode.org/versions/Unicode6.0.0/

I believe that JDK7 is in a functionality freeze.  

One thing that should still be possible even at this late stage in the JDK7
cycle is to recast the single conformance statement in the documentation
into a more fine-grained set of statements corresponding to each of RL1.1
through RL1.7.

This is the approach taken by Perl.  Instead of a broad brush, we list in
columnar format each of the RL numbers along with our current status toward
meeting that requirement, with footnotes giving any needed elaboration.
See the section "Unicode Regular Expression Support Level" in the
perlunicode manpage for how this looks (and preferably in a current
release, so Perl 5.12 or better).  After my signature I give an example
of this from our current release.

I think this is probably the best way to go anyway, but it is clearly the
only choice given the demands of sound and stable release engineering.
That's because although it may be possible to fix one or two changes in
where there is a clear bug at variance with documented behavior, I do not
believe it possible to sneak in the non-trivial changes needed for things
like RL1.2.

I also suggest that some thought be paid toward how to go about
implementing full Level 1 conformance in as useful but painless a 
manner possible.  I have several ideas related to maintaining 
backwards compatibility while still moving foward.  This necessarily 
requires more deliberation, and is clearly beyond what it allowable 
under a functionality freeze.

But updating the documentation should not be.

--tom

For comparison purposes only, here is the Perl's conformance statement from
the perlunicode manpage.  The footnotes indicate how each requirement is
(or is not) met.  I include only the Level 1 matters; Levels 2 and 3 are 
not well-supported at this time, being limited to \X and \N{}; tailoring
is available via Unicode::Collate and Unicode::Collate::Locale classes,
and normalization via Unicode::Normalize, but there are not yet integrated
into the regular expression system proper.

=head2 Unicode Regular Expression Support Level

  The following list of Unicode support for regular expressions describes
  all the features currently supported.  The references to "Level N" and
  the section numbers refer to the Unicode Technical Standard #18, "Unicode
  Regular Expressions", version 11, in May 2005.

    Level 1 - Basic Unicode Support

    RL1.1   Hex Notation                     - done          [1]
    RL1.2   Properties                       - done          [2][3]
    RL1.2a  Compatibility Properties         - done          [4]
    RL1.3   Subtraction and Intersection     - MISSING       [5]
    RL1.4   Simple Word Boundaries           - done          [6]
    RL1.5   Simple Loose Matches             - done          [7]
    RL1.6   Line Boundaries                  - MISSING       [8]
    RL1.7   Supplementary Code Points        - done          [9]

{IMPLEMENTATION FOOTNOTES}

    [1]  \x{...}
    [2]  \p{...} \P{...}
    [3]  supports not only minimal list, but all Unicode character
	 properties (see L</Unicode Character Properties>)
    [4]  \d \D \s \S \w \W \X [:prop:] [:^prop:]
    [5]  can use regular expression look-ahead [a] or user-defined 
	 character properties [b] to emulate set operations
    [6]  \b \B
    [7]  note that Perl does Full case-folding in matching (but with
         bugs), not Simple: for example U+1F88 is equivalent to U+1F00
         U+03B9, not with 1F80.  This difference matters mainly for
         certain Greek capital letters with certain modifiers: the Full
         case-folding decomposes the letter, while the Simple 
	 case-folding would map it to a single character.
    [8]  should do ^ and $ also on U+000B (\v in C), FF (\f), CR
	 (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS
	 (U+2029); should also affect <>, $., and script line
	 numbers; should not split lines within CRLF [c] (i.e. 
	 there is no empty line between \r and \n)
    [9]  UTF-8/UTF-EBDDIC used in perl allows not only U+10000 
	 to U+10FFFF but also beyond U+10FFFF [d]

{LETTERED FOOTNOTES}

    [a] You can mimic class subtraction using lookahead. 
	For example, what UTS#18 might write as

	    [{Greek}-[{UNASSIGNED}]]

	in Perl can be written as:

	    (?!\p{Unassigned})\p{InGreekAndCoptic}
	    (?=\p{Assigned})\p{InGreekAndCoptic}

	But in this particular example, you probably really want

	    \p{GreekAndCoptic}

        which will match assigned characters known to be part of
        the Greek script.

        Also see the Unicode::Regex::Set module, it does
        implement the full UTS#18 grouping, intersection, union,
        and removal (subtraction) syntax.

    [b] '+' for union, '-' for removal (set-difference), '&' for
	intersection (see L</"User-Defined Character Properties">)

    [c] Try the C<:crlf> layer (see L<PerlIO>).

    [d] U+FFFF will currently generate a warning message if 'utf8'
	warnings are enabled