<i18n dev> Level 1 Unicode support for Java regexes: overview
Tom Christiansen
tchrist at perl.com
Fri Jan 21 07:02:55 PST 2011
> Thanks for looking into the Unicode support issues in Java RegEx.
> Since you have been working on Unicode in the past decade, I'm sure
> you understand that most of the issues you are pointing out here
> belongs to the "Extended Unicode Support: Level 2" as documented in
> UTS#18 Unicode Regular Expressions [2].
I don't know that "most" of my issues pertain to Level 2, although
I haven't actually counted up what falls in what category.
> Unfortunately the current Java RegEx implementation only
> supports the "Basic Unicode Support: Level 1",
Quite possibly you've done more work to make that statement true,
but as far as I can tell, the current regex class does not provide
that very most basic "Level 1" Unicode support specified in UTS#18.
It does support some of the Level 1 features, but not all of them.
Several are omitted, which I will draw attention to below.
> as specified in Java RegEx
> java.util.regex.Pattern API document [1].
> [1] http://download.java.net/jdk7/docs/api/java/util/regex/Pattern.html
Is the source for that available? If it were, I'm sure many
questions I have I could easily answer myself.
> [2] http://www.unicode.org/reports/tr18
Perhaps I'm misreading, but I do not believe that Java provides even basic
Level 1 support for regexes as specified in that document. Sherman, you
may have already added in the necessary functionality for Level 1 support,
but I do not see that in the API you reference above.
It is quite easy to tell whether an implementation meets the Level 1
requirements because under each of those 7 subsections, there is a very
specific statement about what it takes to be considered to have met that
requirement. These statements are of the form "RX.Y: ..." where X is 1
for Level 1, 2 for Level 2, etc; and where Y is the subsection. I quote
from UTS#18:
0.2 Conformance
The following describes the possible ways that an implementation can
claim conformance to this technical standard.
All syntax and API presented in this document is only for the purpose
of illustration; there is absolutely no requirement to follow such
syntax or API. Regular expression syntax varies widely: the features
discussed here would need to be adapted to the syntax of the particular
implementation. In general, the syntax in examples is similar to that
of Perl Regular Expressions, but it may not be exactly the same.
While the API examples generally follow Java style, it is again only
for illustration.
C0. An implementation claiming conformance to this specification at
any Level shall identify the version of this specification and
the version of the Unicode Standard.
C1. An implementation claiming conformance to Level 1 of this
specification shall meet the requirements described in the
following sections:
RL1.1 Hex Notation
RL1.2 Properties
RL1.2a Compatibility Properties
RL1.3 Subtraction and Intersection
RL1.4 Simple Word Boundaries
RL1.5 Simple Loose Matches
RL1.6 Line Boundaries
RL1.7 Supplementary Code Points
I'll now go through each of those individually.
--tom
More information about the i18n-dev
mailing list