Empty regexp replaceall and surrogate pairs results in corrupted utf16.
Xueming Shen
xueming.shen at oracle.com
Fri Jun 8 22:35:27 UTC 2012
On 06/08/2012 12:07 PM, Ulf Zibis wrote:
> Thanks Sherman!
>
> Am 08.06.2012 20:36, schrieb Xueming Shen:
>> On 06/08/2012 05:16 AM, Ulf Zibis wrote:
>>>
>>>
>>> Is there any spec weather the Java Regex API has a general contract
>>> with 16-bit chars or Unicode codepoints?
>>
>> The regex spec says Pattern and Matcher work ON character sequence
>> with the reference to
>> CharSequence interface, but the pattern itself does support Unicode
>> character via various
>> regex constructors and flags.
> In other words, if there is a surrogate pair in the pattern, the
> CharSequence is seen as sequence of Unicode code points, right?
No exactly what I meant.
The engine currently works as
if the pattern is to match a "character" or "slice of characters" that
has supplementary
character embedded, engine will try to interpret the target char
sequence as a sequence
of Unicode code point.
If the pattern is not to match a "character" or match a slice of
characters that does
not have supplementary character embedded, the engine will try to
interpret the char
sequence as a sequence of char unit.
For example
Matcher m =
Pattern.compile("[^a]").matcher("\uD840\uDC00\uD840\uDC01\uD840\uDC02");
while (m.find()) {
System.out.printf("<%d, %d>%n", m.start(), m.end());
}
The output is
<0, 2>
<2, 4>
<4, 6>
The target string is iterated code point by code point, but
Matcher m =
Pattern.compile("(?=[^a])").matcher("\uD840\uDC00\uD840\uDC01\uD840\uDC02");
while (m.find()) {
System.out.printf("<%d, %d>%n", m.start(), m.end());
}
The output is
<0, 0>
<1, 1>
<2, 2>
<3, 3>
<4, 4>
<5, 5>
And the empty string pattern belongs to the latter case.
No, I'm not saying because the implementation works this way, therefor
this is not a bug:-)
Actually I'm starting to agree that we might not want to stop in the
middle of a pair of
surrogates, even in non-character case. But it might have some
performance impact
somewhere (if you iterate the CharSequence by code point).
-Sherman
> "\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[AB\uD840\uDC01C]",
> "?")
> ==> "\uD840\uDC00?\uD840\uDC02" // only 1 replacement for
> \uD840\uDC01
> "12\uD840\uDC02".replaceAll("[^0-9]", "?")
> ==> "12??" // 2 replacements for \uD840\uDC02
> "\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[^uD840\uDC00-\uD840\uDC01]",
> "?")
> ==> "\uD840\uDC00\uD840\uDC01?" // only 1 replacement for
> \uD840\uDC02
>
>
>> An empty String pattern is really a corner case here, it does
>> not say anything about "character"
> So it should be specified in the javadoc, and I'm with Dawid to
> implement it as in Python.
>
> -Ulf
>
More information about the core-libs-dev
mailing list