Empty regexp replaceall and surrogate pairs results in corrupted utf16.

Xueming Shen xueming.shen at oracle.com
Fri Jun 8 18:01:43 UTC 2012


If we can re-design everything (not the lib, but the language) allover 
again from the
very beginning , and if we all put an i18n engr's hat on:-)  it might be 
nature to have
a 32-bit char instead of the 16-bit (OK, it's totally a difference story 
if from
performance point of view), and then we would not have this kind of 
trouble at all.

The reality is that String is still UTF16 and char based, Unicode 
supplementary
code point support is an add-on, so the default index based iteration is 
char based.
You will have to use those codePointXYZ() to do code point based access.

If you consider the replaceAll() is to replace all targeted sub-string, 
and this sub-string
happens to be an empty string in this case, as long as 
String.substring(i, i+1) works
for the index "i" which is in between of two paired surrogates, it is 
hard to argue that
you should not replace that "sub-string" with the replacement.

Another "excuse" to back my "not a bug" argument is that as the API doc 
suggests
the String.replaceAll() is simply a wrapper of j.u.regex.Matcher, both 
Pattern and
Matcher class all explicitly documents that they work on "CharSequence", 
which is
pure char based and "does not" know code point (for now, with the new 
defender
method, it might be possible to evolute the CharSequence, if desirable, 
but that
would be a totally different discussion). When I say "work on 
CharSequence", I'm
not saying the Pattern does/should not work on code point, actually the
implementation "transfers" all the String based pattern into a code 
point based
pattern before interpreting it,  but all code/point (mainly for 
supplementary) support
works on the assumption that you have a "constructor". An empty string 
pattern
really does not have any indication on "character".

No, I don't think we have any concrete reference as to how an empty 
string regex
should be interpreted. By its nature, it should be the match of every 
possible
zero-width position in the target CharSequence, which is char based upon 
the API.
It might be desired to document this somewhere in the Pattern class.

-Sherman



On 06/08/2012 04:14 AM, Dawid Weiss wrote:
> I guess a lot depends on the point of view. From historical point of
> view (where a char[] and a String are basically unsigned values) that
> pattern should simply process every value (index) and work like you
> say. But from a practical point of view I think it is a bug -- it
> corrupts the string, transforming legal unicode into invalid values.
>
> I checked with Python (3) and the behavior there is the expected one
> (it work at the unicode codepoint level rather than surrogate level).
>
> Where is the behavior of "" that you mention defined? I admit I
> couldn't find any reference to this in the documentation:
>
>> Using an empty String "" as a regex for the replaceAll() takes the
>> advantage of the special meaning of "", in which it is interpreted as
>> it can match any possible zero-width position of the target String
> I'm not saying you're wrong (and that pattern is definitely not common
> so it's probably academic discussion) but I'd like some concrete
> reference as to how an empty pattern should behave. To me consistency
> with the rest of the Pattern specification would be that it operates
> at "zero width position between unicode characters" not between any
> char[] value, even an incorrect one or in the middle of a surrogate.
>
> Dawid
>
> On Fri, Jun 8, 2012 at 12:46 AM, Xueming Shen<xueming.shen at oracle.com>  wrote:
>> Personally I don't think it is a bug. A j.l.String represents a sequence of
>> UTF-16 chars. While
>> a pair of surrogates represents a supplementary character, a single
>> surrogate itself is still
>> a "legal" independent entity inside a String object and length of a String
>> is still defined as
>> the total number of char unit and an index value between a high surrogate
>> and a low
>> surrogate is still a legal index value that can be used to access the char
>> at that particular
>> position. Using an empty String "" as a regex for the replaceAll() takes the
>> advantage of the
>> special meaning of "", in which it is interpreted as it can match any
>> possible zero-width
>> position of the target String, it does  not imply anything regarding
>> "character"  or
>> "characters" around it, so I would not interpret it as a zero-with character
>> boundary,
>> therefor a "position" in between a pair surrogates is still a good "found"
>> for replacing.
>>
>> -Sherman
>>
>>
>> On 6/7/2012 1:07 PM, Dawid Weiss wrote:
>>> Hi, I'm a committer to the Apache Lucene project. We have randomized
>>> tests and one seed hit the following (simplified) scenario:
>>>
>>>     String s1 = "AB\uD840\uDC00C";
>>>     String s2 = s1.replaceAll("", "X");
>>>
>>> the input contains an extended unicode character (any surrogate pair
>>> will do). The pattern is an empty string (in fact, it was randomized
>>> as "]|" but it's the same problem so I omit the details). The problem
>>> is that after applying this pattern, replaceAll inserts X in between
>>> the surrogate pair characters and this results in invalid UTF-16:
>>>
>>> AB��C
>>> XAXBX?X?XCX
>>>
>>> I believe this is a bug in the regexp implementation (sorry, don't
>>> have a patch for it) but I'd like to confirm it's not something known.
>>> Pointers appreciated.
>>>
>>> Dawid




More information about the core-libs-dev mailing list