Empty regexp replaceall and surrogate pairs results in corrupted utf16.

Ulf Zibis Ulf.Zibis at gmx.de
Fri Jun 8 19:07:09 UTC 2012


Thanks Sherman!

Am 08.06.2012 20:36, schrieb Xueming Shen:
> On 06/08/2012 05:16 AM, Ulf Zibis wrote:
>>
>>
>> Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode 
>> codepoints?
>
> The regex spec says Pattern and Matcher work ON character sequence with the reference to
> CharSequence interface,  but the pattern itself does support Unicode character via various
> regex constructors and flags.
In other words, if there is a surrogate pair in the pattern, the CharSequence is seen as sequence of 
Unicode code points, right?
"\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[AB\uD840\uDC01C]", "?")
==> "\uD840\uDC00?\uD840\uDC02"         // only 1 replacement for \uD840\uDC01
"12\uD840\uDC02".replaceAll("[^0-9]", "?")
==> "12??"          // 2 replacements for \uD840\uDC02
"\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[^uD840\uDC00-\uD840\uDC01]", "?")
==> "\uD840\uDC00\uD840\uDC01?"          // only 1 replacement for \uD840\uDC02


> An empty String pattern is really a corner case here, it does
> not say anything about "character"
So it should be specified in the javadoc, and I'm with Dawid to implement it as in Python.

-Ulf




More information about the core-libs-dev mailing list