<i18n dev> Level 1 Unicode support for Java regexes: overview

Fri Jan 21 07:02:55 PST 2011

> Thanks for looking into the Unicode support issues in Java RegEx.

> Since you have been working on Unicode in the past decade, I'm sure
> you understand that most of the issues you are pointing out here
> belongs to the "Extended Unicode Support: Level 2" as documented in
> UTS#18 Unicode Regular Expressions [2].

I don't know that "most" of my issues pertain to Level 2, although 
I haven't actually counted up what falls in what category.

> Unfortunately the current Java RegEx implementation only
> supports the  "Basic Unicode Support: Level 1",

Quite possibly you've done more work to make that statement true, 
but as far as I can tell, the current regex class does not provide
that very most basic "Level 1" Unicode support specified in UTS#18.
It does support some of the Level 1 features, but not all of them.
Several are omitted, which I will draw attention to below.

> as specified in Java RegEx
> java.util.regex.Pattern API document [1].

>   [1] http://download.java.net/jdk7/docs/api/java/util/regex/Pattern.html

Is the source for that available?  If it were, I'm sure many
questions I have I could easily answer myself.

>   [2] http://www.unicode.org/reports/tr18

Perhaps I'm misreading, but I do not believe that Java provides even basic
Level 1 support for regexes as specified in that document.  Sherman, you
may have already added in the necessary functionality for Level 1 support,
but I do not see that in the API you reference above.

It is quite easy to tell whether an implementation meets the Level 1
requirements because under each of those 7 subsections, there is a very
specific statement about what it takes to be considered to have met that
requirement.  These statements are of the form "RX.Y: ..." where X is 1
for Level 1, 2 for Level 2, etc; and where Y is the subsection.  I quote
from UTS#18:

    0.2 Conformance

    The following describes the possible ways that an implementation can
    claim conformance to this technical standard.

    All syntax and API presented in this document is only for the purpose
    of illustration; there is absolutely no requirement to follow such
    syntax or API. Regular expression syntax varies widely: the features
    discussed here would need to be adapted to the syntax of the particular
    implementation. In general, the syntax in examples is similar to that
    of Perl Regular Expressions, but it may not be exactly the same.

    While the API examples generally follow Java style, it is again only
    for illustration.

    C0.  An implementation claiming conformance to this specification at 
         any Level shall identify the version of this specification and
         the version of the Unicode Standard.

    C1.  An implementation claiming conformance to Level 1 of this 
         specification shall meet the requirements described in the
         following sections:

	    RL1.1 Hex Notation
	    RL1.2 Properties
	    RL1.2a Compatibility Properties
	    RL1.3 Subtraction and Intersection
	    RL1.4 Simple Word Boundaries
	    RL1.5 Simple Loose Matches
	    RL1.6 Line Boundaries
	    RL1.7 Supplementary Code Points

I'll now go through each of those individually.  

--tom