Empty regexp replaceall and surrogate pairs results in corrupted utf16.
Ulf Zibis
Ulf.Zibis at gmx.de
Fri Jun 8 19:07:09 UTC 2012
Thanks Sherman!
Am 08.06.2012 20:36, schrieb Xueming Shen:
> On 06/08/2012 05:16 AM, Ulf Zibis wrote:
>>
>>
>> Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode
>> codepoints?
>
> The regex spec says Pattern and Matcher work ON character sequence with the reference to
> CharSequence interface, but the pattern itself does support Unicode character via various
> regex constructors and flags.
In other words, if there is a surrogate pair in the pattern, the CharSequence is seen as sequence of
Unicode code points, right?
"\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[AB\uD840\uDC01C]", "?")
==> "\uD840\uDC00?\uD840\uDC02" // only 1 replacement for \uD840\uDC01
"12\uD840\uDC02".replaceAll("[^0-9]", "?")
==> "12??" // 2 replacements for \uD840\uDC02
"\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[^uD840\uDC00-\uD840\uDC01]", "?")
==> "\uD840\uDC00\uD840\uDC01?" // only 1 replacement for \uD840\uDC02
> An empty String pattern is really a corner case here, it does
> not say anything about "character"
So it should be specified in the javadoc, and I'm with Dawid to implement it as in Python.
-Ulf
More information about the core-libs-dev
mailing list