<i18n dev> RL1.1 Hex Notation
Xueming Shen
xueming.shen at oracle.com
Wed Jan 26 12:47:27 PST 2011
On 01/26/2011 11:50 AM, Mark Davis ☕ wrote:
> > I guess you are asking for something like?
>
> I'm not asking for that. What I'm saying is that as far as I can tell,
> there is no way in Java to meet the terms of RL1.1, because there is
> not a way to use hex numbers in any syntax for values above FFFF to
> indicate literals. That is, if you supply "abc\\uD800\\uDC00def" then
> regex fails.
>
> The code was my attempt to try to get something to work even using
> separate surrogates (which was not the intent of RL1.1), but even that
> failed. Maybe there is another way to do it?
>
> Mark
> //
Oh, I see the problem. Obviously I have been working on jdk7 too long
and forgot the
latest release is still 6:-( There is indeed a bug in the previous
implementation which I
fixed in 7 long time ago (I mentioned this in one of the early emails
but was not specific,
my apology), probably should backport to 6 update release asap. The test
case runs well
(the "failures" in literals are expected) on 7 with the following
output. I modified your test
case "slightly" since it appears the UnicodeSet class in our normalizer
package does not
have the size(), replace it with a normal hashset.
-Sherman
------------------------------------------------------------------
LITERALS Failures: 18
set: [9, 10, 11, 12, 13, 32, 35, 36, 40, 41, 42, 43, 63, 91, 92,
94, 123, 124]
example1: a b
exampleN: a|b
INLINE Failures: 0
set: []
example1: null
exampleN: null
INRANGE Failures: 0
set: []
example1: null
exampleN: null
-----------------------------------------------------------------------
import java.util.regex.*;
import java.util.*;
import sun.text.normalizer.*;
public class TestRegex2 {
public static void main(String[] args) {
System.out.println("Check patterns for Unicodeset");
for (int i = 0; i <= 0x10FFFF; ++i) {
// The goal is to make a regex with hex digits, and have it
match the corresponding character
// We check two different environments: inline ("aXb") and
in a range ("a[X]b")
String s = new StringBuilder().appendCodePoint(i).toString();
String hexPattern = i <= 0xFFFF ? "\\u" + Utility.hex(i,4)
: "\\u" + Utility.hex(Character.toChars(i)[0],4) +
"\\u" + Utility.hex(Character.toChars(i)[1],4);
String target = "a" + s + "b";
Failures.LITERALS.checkMatch(i, "a" + s + "b", target);
Failures.INLINE.checkMatch(i, "a" + hexPattern + "b", target);
Failures.INRANGE.checkMatch(i, "a[" + hexPattern + "]b",
target);
}
Failures.LITERALS.showFailures();
Failures.INLINE.showFailures();
Failures.INRANGE.showFailures();
}
static enum Failures {
LITERALS, INLINE, INRANGE;
Set<Integer> failureSet = new LinkedHashSet<Integer>();
String firstSampleFailure;
String lastSampleFailure;
void checkMatch(int codePoint, String pattern, String target) {
if (!matches(pattern, target)) {
failureSet.add(codePoint);
if (firstSampleFailure == null) {
firstSampleFailure = pattern;
}
lastSampleFailure = pattern;
}
}
boolean matches(String hexPattern, String target) {
try {
// use COMMENTS to get the 'worst case'
return Pattern.compile(hexPattern,
Pattern.COMMENTS).matcher(target).matches();
} catch (Exception e) {
return false;
}
}
void showFailures() {
System.out.format(this + " Failures: %s\n\tset:
%s\n\texample1: %s\n\texampleN: %s\n",
failureSet.size(), failureSet, firstSampleFailure,
lastSampleFailure); }
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110126/edaf0411/attachment.html
More information about the i18n-dev
mailing list