Empty regexp replaceall and surrogate pairs results in corrupted utf16.

Fri Jun 8 22:35:27 UTC 2012

On 06/08/2012 12:07 PM, Ulf Zibis wrote:
> Thanks Sherman!
>
> Am 08.06.2012 20:36, schrieb Xueming Shen:
>> On 06/08/2012 05:16 AM, Ulf Zibis wrote:
>>>
>>>
>>> Is there any spec weather the Java Regex API has a general contract 
>>> with 16-bit chars or Unicode codepoints?
>>
>> The regex spec says Pattern and Matcher work ON character sequence 
>> with the reference to
>> CharSequence interface,  but the pattern itself does support Unicode 
>> character via various
>> regex constructors and flags.
> In other words, if there is a surrogate pair in the pattern, the 
> CharSequence is seen as sequence of Unicode code points, right?

No exactly what I meant.
The engine currently works as

if the pattern is to match a "character" or "slice of characters" that 
has supplementary
character embedded, engine will try to interpret the target char 
sequence as a sequence
of Unicode code point.

If the pattern is not to match a "character" or match a slice of 
characters that does
not have supplementary character embedded, the engine will try to 
interpret the char
sequence as a sequence of char unit.

For example

Matcher m = 
Pattern.compile("[^a]").matcher("\uD840\uDC00\uD840\uDC01\uD840\uDC02");
while (m.find()) {
     System.out.printf("<%d, %d>%n", m.start(), m.end());
}

The output is

<0, 2>
<2, 4>
<4, 6>

The target string is iterated code point by code point, but

Matcher m = 
Pattern.compile("(?=[^a])").matcher("\uD840\uDC00\uD840\uDC01\uD840\uDC02");
while (m.find()) {
     System.out.printf("<%d, %d>%n", m.start(), m.end());

}

The output is

<0, 0>
<1, 1>
<2, 2>
<3, 3>
<4, 4>
<5, 5>

And the empty string pattern belongs to the latter case.

No, I'm not saying because the implementation works this way, therefor 
this is not a bug:-)
Actually I'm starting to agree that we might not want to stop in the 
middle of a pair of
surrogates, even in non-character case. But it might have some 
performance impact
somewhere (if you iterate the CharSequence by code point).

-Sherman

> "\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[AB\uD840\uDC01C]", 
> "?")
> ==> "\uD840\uDC00?\uD840\uDC02"         // only 1 replacement for 
> \uD840\uDC01
> "12\uD840\uDC02".replaceAll("[^0-9]", "?")
> ==> "12??"          // 2 replacements for \uD840\uDC02
> "\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[^uD840\uDC00-\uD840\uDC01]", 
> "?")
> ==> "\uD840\uDC00\uD840\uDC01?"          // only 1 replacement for 
> \uD840\uDC02
>
>
>> An empty String pattern is really a corner case here, it does
>> not say anything about "character"
> So it should be specified in the javadoc, and I'm with Dawid to 
> implement it as in Python.
>
> -Ulf
>