<i18n dev> RL1.1 Hex Notation
Mark Davis ☕
mark at macchiato.com
Tue Jan 25 17:00:50 PST 2011
The goal of the clause is to have a mechanism for using hex values for
character literals. That is, you should be able to take a code point from 0
to 10FFFF, get a hex value for that, embed it in some syntax, and
concatenate it into a pattern, and have it work as a literal.
For example:
String pattern = first_part + "\\x{" + hex(myCodePoint) + "}" + second_part;
// for *some* hex notation
...
Matcher m = Pattern.compile(pattern, Pattern.COMMENTS).matcher(target);
...
As far as I can tell, Java really doesn't supply that capability for
non-BMP, because the \u notation doesn't work above FFFF, except insofar as
the preprocessor maps a surrogate pair in hex to literals, which happen all
to work because they aren't syntax characters.
What you can do with Java is:
1. embed the character itself, not the hex representation, which works
some of the time (fails for 18 characters; syntax characters, as expected).
2. in constant expressions only, utilize the Java preprocessor with
\u.... or \u....\u....).
3. for BMP characters, use "\u" + hex(myCodePoint,4)
Here is a quick and dirty test; let me know if I've missed something.
*Output:*
LITERALS Failures: 18
set: [\u0009-\u000D\ #\$(-+?\[\\\^\{|]
example1: a b
exampleN: a|b
INLINE Failures: 1048576
set: [\U00010000-\U0010FFFF]
example1: a\uD800\uDC00b
exampleN: a\uDBFF\uDFFFb
INRANGE Failures: 1048576
set: [\U00010000-\U0010FFFF]
example1: a[\uD800\uDC00]b
exampleN: a[\uDBFF\uDFFF]b
*Code:*
public void TestRegex() {
logln("Check patterns for Unicodeset");
for (int i = 0; i <= 0x10FFFF; ++i) {
// The goal is to make a regex with hex digits, and have it
match the corresponding character
// We check two different environments: inline ("aXb") and in a
range ("a[X]b")
String s = new StringBuilder().appendCodePoint(i).toString();
String hexPattern = i <= 0xFFFF ? "\\u" + Utility.hex(i,4)
: "\\u" + Utility.hex(Character.toChars(i)[0],4) +
"\\u"+ Utility.hex(Character.toChars(i)[1],4);
String target = "a" + s + "b";
Failures.LITERALS.checkMatch(i, "a" + s + "b", target);
Failures.INLINE.checkMatch(i, "a" + hexPattern + "b", target);
Failures.INRANGE.checkMatch(i, "a[" + hexPattern + "]b",
target);
}
Failures.LITERALS.showFailures();
Failures.INLINE.showFailures();
Failures.INRANGE.showFailures();
}
enum Failures {
LITERALS, INLINE, INRANGE;
UnicodeSet failureSet = new UnicodeSet();
String firstSampleFailure;
String lastSampleFailure;
void checkMatch(int codePoint, String pattern, String target) {
if (!matches(pattern, target)) {
failureSet.add(codePoint);
if (firstSampleFailure == null) {
firstSampleFailure = pattern;
}
lastSampleFailure = pattern;
}
}
boolean matches(String hexPattern, String target) {
try {
// use COMMENTS to get the 'worst case'
return Pattern.compile(hexPattern, Pattern.COMMENTS
).matcher(target).matches();
} catch (Exception e) {
return false;
}
}
void showFailures() {
System.out.format(this + " Failures: %s\n\tset: %s\n\texample1:
%s\n\texampleN: %s\n",
failureSet.size(), failureSet, firstSampleFailure,
lastSampleFailure); }
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110125/063af1c2/attachment-0001.html
More information about the i18n-dev
mailing list