<i18n dev> RL1.1 Hex Notation
Xueming Shen
xueming.shen at oracle.com
Tue Jan 25 17:47:41 PST 2011
Hi Mark,
I guess you are asking for something like?
char[] cc = Character.toChars(0x12345);
Matcher m = Pattern.compile("["
+ "\\u" + HEX(cc[0])
+ "\\u" + HEX(cc[1])
+ "]").matcher("");
System.out.println("find=" + m.reset("abc[" + new String(cc) +
"]efg").find());
in which the HEX should be something like below to make it a nnnn.
static String HEX(char c) {
StringBuilder sb = new StringBuilder();
Formatter fm = new Formatter(sb);
fm.format("%04x", (int)c);
return sb.toString();
}
It looks a little tedious, you will probably also have to differentiate
bmp or supplementary
to decide to feed in one utf16 hex or a pair, just to show you can still
use Java Unicode
escape to embed the hex values of the utf16 instead of the "character
itself". Does
it qualify for the RL1.1?
Sure, \x{...} looks more straightforward and convenient. As I said in
previous email
exchange I totally agree it will be a nice enhancement for Java RegEx.
-Sherman
On 1-25-2011 17:00 05:00 PM, Mark Davis ☕ wrote:
> The goal of the clause is to have a mechanism for using hex values for
> character literals. That is, you should be able to take a code point
> from 0 to 10FFFF, get a hex value for that, embed it in some syntax,
> and concatenate it into a pattern, and have it work as a literal.
>
> For example:
>
> String pattern = first_part + "\\x{" + hex(myCodePoint) + "}" +
> second_part; // for *some* hex notation
> ...
> Matcher m = Pattern.compile(pattern,
> Pattern.COMMENTS).matcher(target);
> ...
>
>
> As far as I can tell, Java really doesn't supply that capability for
> non-BMP, because the \u notation doesn't work above FFFF, except
> insofar as the preprocessor maps a surrogate pair in hex to literals,
> which happen all to work because they aren't syntax characters.
>
> What you can do with Java is:
>
> 1. embed the character itself, not the hex representation, which
> works some of the time (fails for 18 characters; syntax
> characters, as expected).
> 2. in constant expressions only, utilize the Java preprocessor with
> \u.... or \u....\u....).
> 3. for BMP characters, use "\u" + hex(myCodePoint,4)
>
> Here is a quick and dirty test; let me know if I've missed something.
>
> *Output:*
>
> LITERALS Failures: 18
>
> set: [\u0009-\u000D\ #\$(-+?\[\\\^\{|]
>
> example1: ab
>
> exampleN: a|b
>
> INLINE Failures: 1048576
>
> set: [\U00010000-\U0010FFFF]
>
> example1: a\uD800\uDC00b
>
> exampleN: a\uDBFF\uDFFFb
>
> INRANGE Failures: 1048576
>
> set: [\U00010000-\U0010FFFF]
>
> example1: a[\uD800\uDC00]b
>
> exampleN: a[\uDBFF\uDFFF]b
>
>
> *Code:*
>
> public void TestRegex() {
>
> logln("Check patterns for Unicodeset");
>
>
> for (int i = 0; i <= 0x10FFFF; ++i) {
>
>
> // The goal is to make a regex with hex digits, and have it match the
> corresponding character
>
> // We check two different environments: inline ("aXb") and in a range
> ("a[X]b")
>
>
> String s = new StringBuilder().appendCodePoint(i).toString();
>
>
> String hexPattern = i <= 0xFFFF ? "\\u" + Utility.hex(i,4)
>
> : "\\u" + Utility.hex(Character.toChars(i)[0],4) +
> "\\u" + Utility.hex(Character.toChars(i)[1],4);
>
>
> String target = "a" + s + "b";
>
>
> Failures.LITERALS.checkMatch(i, "a" + s + "b", target);
>
> Failures.INLINE.checkMatch(i, "a" + hexPattern + "b", target);
>
> Failures.INRANGE.checkMatch(i, "a[" + hexPattern + "]b",
> target);
>
> }
>
> Failures.LITERALS.showFailures();
>
> Failures.INLINE.showFailures();
>
> Failures.INRANGE.showFailures();
>
> }
>
>
> enum Failures {
>
> LITERALS, INLINE, INRANGE;
>
> UnicodeSet failureSet = new UnicodeSet();
>
> String firstSampleFailure;
>
> String lastSampleFailure;
>
>
> void checkMatch(int codePoint, String pattern, String target) {
>
> if (!matches(pattern, target)) {
>
> failureSet.add(codePoint);
>
> if (firstSampleFailure == null) {
>
> firstSampleFailure = pattern;
>
> }
>
> lastSampleFailure = pattern;
>
> }
>
> }
>
> boolean matches(String hexPattern, String target) {
>
> try {
>
> // use COMMENTS to get the 'worst case'
>
> return Pattern.compile(hexPattern,
> Pattern.COMMENTS).matcher(target).matches();
>
> } catch (Exception e) {
>
> return false;
>
> }
>
> }
>
> void showFailures() {
>
> System.out.format(this+ " Failures: %s\n\tset:
> %s\n\texample1: %s\n\texampleN: %s\n",
>
> failureSet.size(), failureSet, firstSampleFailure,
> lastSampleFailure); }
>
> }
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110125/cc3018f3/attachment.html
More information about the i18n-dev
mailing list