Empty regexp replaceall and surrogate pairs results in corrupted utf16.
Hi, I'm a committer to the Apache Lucene project. We have randomized tests and one seed hit the following (simplified) scenario: String s1 = "AB\uD840\uDC00C"; String s2 = s1.replaceAll("", "X"); the input contains an extended unicode character (any surrogate pair will do). The pattern is an empty string (in fact, it was randomized as "]|" but it's the same problem so I omit the details). The problem is that after applying this pattern, replaceAll inserts X in between the surrogate pair characters and this results in invalid UTF-16: AB𠀀C XAXBX?X?XCX I believe this is a bug in the regexp implementation (sorry, don't have a patch for it) but I'd like to confirm it's not something known. Pointers appreciated. Dawid
Personally I don't think it is a bug. A j.l.String represents a sequence of UTF-16 chars. While a pair of surrogates represents a supplementary character, a single surrogate itself is still a "legal" independent entity inside a String object and length of a String is still defined as the total number of char unit and an index value between a high surrogate and a low surrogate is still a legal index value that can be used to access the char at that particular position. Using an empty String "" as a regex for the replaceAll() takes the advantage of the special meaning of "", in which it is interpreted as it can match any possible zero-width position of the target String, it does not imply anything regarding "character" or "characters" around it, so I would not interpret it as a zero-with character boundary, therefor a "position" in between a pair surrogates is still a good "found" for replacing. -Sherman On 6/7/2012 1:07 PM, Dawid Weiss wrote:
Hi, I'm a committer to the Apache Lucene project. We have randomized tests and one seed hit the following (simplified) scenario:
String s1 = "AB\uD840\uDC00C"; String s2 = s1.replaceAll("", "X");
the input contains an extended unicode character (any surrogate pair will do). The pattern is an empty string (in fact, it was randomized as "]|" but it's the same problem so I omit the details). The problem is that after applying this pattern, replaceAll inserts X in between the surrogate pair characters and this results in invalid UTF-16:
AB𠀀C XAXBX?X?XCX
I believe this is a bug in the regexp implementation (sorry, don't have a patch for it) but I'd like to confirm it's not something known. Pointers appreciated.
Dawid
I guess a lot depends on the point of view. From historical point of view (where a char[] and a String are basically unsigned values) that pattern should simply process every value (index) and work like you say. But from a practical point of view I think it is a bug -- it corrupts the string, transforming legal unicode into invalid values. I checked with Python (3) and the behavior there is the expected one (it work at the unicode codepoint level rather than surrogate level). Where is the behavior of "" that you mention defined? I admit I couldn't find any reference to this in the documentation:
Using an empty String "" as a regex for the replaceAll() takes the advantage of the special meaning of "", in which it is interpreted as it can match any possible zero-width position of the target String
I'm not saying you're wrong (and that pattern is definitely not common so it's probably academic discussion) but I'd like some concrete reference as to how an empty pattern should behave. To me consistency with the rest of the Pattern specification would be that it operates at "zero width position between unicode characters" not between any char[] value, even an incorrect one or in the middle of a surrogate. Dawid On Fri, Jun 8, 2012 at 12:46 AM, Xueming Shen <xueming.shen@oracle.com> wrote:
Personally I don't think it is a bug. A j.l.String represents a sequence of UTF-16 chars. While a pair of surrogates represents a supplementary character, a single surrogate itself is still a "legal" independent entity inside a String object and length of a String is still defined as the total number of char unit and an index value between a high surrogate and a low surrogate is still a legal index value that can be used to access the char at that particular position. Using an empty String "" as a regex for the replaceAll() takes the advantage of the special meaning of "", in which it is interpreted as it can match any possible zero-width position of the target String, it does not imply anything regarding "character" or "characters" around it, so I would not interpret it as a zero-with character boundary, therefor a "position" in between a pair surrogates is still a good "found" for replacing.
-Sherman
On 6/7/2012 1:07 PM, Dawid Weiss wrote:
Hi, I'm a committer to the Apache Lucene project. We have randomized tests and one seed hit the following (simplified) scenario:
String s1 = "AB\uD840\uDC00C"; String s2 = s1.replaceAll("", "X");
the input contains an extended unicode character (any surrogate pair will do). The pattern is an empty string (in fact, it was randomized as "]|" but it's the same problem so I omit the details). The problem is that after applying this pattern, replaceAll inserts X in between the surrogate pair characters and this results in invalid UTF-16:
AB𠀀C XAXBX?X?XCX
I believe this is a bug in the regexp implementation (sorry, don't have a patch for it) but I'd like to confirm it's not something known. Pointers appreciated.
Dawid
I tend to agree Dawid. Especially the comparison with Python behaviour is demonstrative. Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode codepoints? Thinking about the search pattern e.g. "[AB\uD840\uDC00C]"; what does it actually search for, the isolated occurence of each char, or the occurence of the codepoint "\uD840\uDC00" ? Last, but not least, we should think about, which would be the common use case, an which would be more easy to work around. (I think, the current view on isolated chars is more easy to work around: StringBuilder sb = new StringBuilder(s1.length + 1).append('X'); for (char c : s1.getChars()) sb.append(c).append('X'); String s2 = sb.toString(); ) Additionally I like to discuss: "any possible zero-width position of the target String" If String length is l, maybe it's arguable, that position l is no valid position in the String. From the use case point of view, I think "P e t e r" as result of "Peter".replaceAll("", " ") is the most useful. -Ulf Am 08.06.2012 13:14, schrieb Dawid Weiss:
I guess a lot depends on the point of view. From historical point of view (where a char[] and a String are basically unsigned values) that pattern should simply process every value (index) and work like you say. But from a practical point of view I think it is a bug -- it corrupts the string, transforming legal unicode into invalid values.
I checked with Python (3) and the behavior there is the expected one (it work at the unicode codepoint level rather than surrogate level).
Where is the behavior of "" that you mention defined? I admit I couldn't find any reference to this in the documentation:
Using an empty String "" as a regex for the replaceAll() takes the advantage of the special meaning of "", in which it is interpreted as it can match any possible zero-width position of the target String I'm not saying you're wrong (and that pattern is definitely not common so it's probably academic discussion) but I'd like some concrete reference as to how an empty pattern should behave. To me consistency with the rest of the Pattern specification would be that it operates at "zero width position between unicode characters" not between any char[] value, even an incorrect one or in the middle of a surrogate.
Dawid
On Fri, Jun 8, 2012 at 12:46 AM, Xueming Shen<xueming.shen@oracle.com> wrote:
Personally I don't think it is a bug. A j.l.String represents a sequence of UTF-16 chars. While a pair of surrogates represents a supplementary character, a single surrogate itself is still a "legal" independent entity inside a String object and length of a String is still defined as the total number of char unit and an index value between a high surrogate and a low surrogate is still a legal index value that can be used to access the char at that particular position. Using an empty String "" as a regex for the replaceAll() takes the advantage of the special meaning of "", in which it is interpreted as it can match any possible zero-width position of the target String, it does not imply anything regarding "character" or "characters" around it, so I would not interpret it as a zero-with character boundary, therefor a "position" in between a pair surrogates is still a good "found" for replacing.
-Sherman
On 6/7/2012 1:07 PM, Dawid Weiss wrote:
Hi, I'm a committer to the Apache Lucene project. We have randomized tests and one seed hit the following (simplified) scenario:
String s1 = "AB\uD840\uDC00C"; String s2 = s1.replaceAll("", "X");
the input contains an extended unicode character (any surrogate pair will do). The pattern is an empty string (in fact, it was randomized as "]|" but it's the same problem so I omit the details). The problem is that after applying this pattern, replaceAll inserts X in between the surrogate pair characters and this results in invalid UTF-16:
AB𠀀C XAXBX?X?XCX
I believe this is a bug in the regexp implementation (sorry, don't have a patch for it) but I'd like to confirm it's not something known. Pointers appreciated.
Dawid
Oops, correction: StringBuilder sb = new StringBuilder(s1.length * 2 + 1); for (char c : s1.getChars()) sb.append('X').append(c); String s2 = sb.append('X').toString(); Am 08.06.2012 14:16, schrieb Ulf Zibis:
I tend to agree Dawid. Especially the comparison with Python behaviour is demonstrative.
Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode codepoints?
Thinking about the search pattern e.g. "[AB\uD840\uDC00C]"; what does it actually search for, the isolated occurence of each char, or the occurence of the codepoint "\uD840\uDC00" ?
Last, but not least, we should think about, which would be the common use case, an which would be more easy to work around. (I think, the current view on isolated chars is more easy to work around: StringBuilder sb = new StringBuilder(s1.length + 1).append('X'); for (char c : s1.getChars()) sb.append(c).append('X'); String s2 = sb.toString(); )
Additionally I like to discuss: "any possible zero-width position of the target String" If String length is l, maybe it's arguable, that position l is no valid position in the String.
From the use case point of view, I think "P e t e r" as result of "Peter".replaceAll("", " ") is the most useful.
-Ulf
Am 08.06.2012 13:14, schrieb Dawid Weiss:
I guess a lot depends on the point of view. From historical point of view (where a char[] and a String are basically unsigned values) that pattern should simply process every value (index) and work like you say. But from a practical point of view I think it is a bug -- it corrupts the string, transforming legal unicode into invalid values.
I checked with Python (3) and the behavior there is the expected one (it work at the unicode codepoint level rather than surrogate level).
Where is the behavior of "" that you mention defined? I admit I couldn't find any reference to this in the documentation:
Using an empty String "" as a regex for the replaceAll() takes the advantage of the special meaning of "", in which it is interpreted as it can match any possible zero-width position of the target String I'm not saying you're wrong (and that pattern is definitely not common so it's probably academic discussion) but I'd like some concrete reference as to how an empty pattern should behave. To me consistency with the rest of the Pattern specification would be that it operates at "zero width position between unicode characters" not between any char[] value, even an incorrect one or in the middle of a surrogate.
Dawid
On Fri, Jun 8, 2012 at 12:46 AM, Xueming Shen<xueming.shen@oracle.com> wrote:
Personally I don't think it is a bug. A j.l.String represents a sequence of UTF-16 chars. While a pair of surrogates represents a supplementary character, a single surrogate itself is still a "legal" independent entity inside a String object and length of a String is still defined as the total number of char unit and an index value between a high surrogate and a low surrogate is still a legal index value that can be used to access the char at that particular position. Using an empty String "" as a regex for the replaceAll() takes the advantage of the special meaning of "", in which it is interpreted as it can match any possible zero-width position of the target String, it does not imply anything regarding "character" or "characters" around it, so I would not interpret it as a zero-with character boundary, therefor a "position" in between a pair surrogates is still a good "found" for replacing.
-Sherman
On 6/7/2012 1:07 PM, Dawid Weiss wrote:
Hi, I'm a committer to the Apache Lucene project. We have randomized tests and one seed hit the following (simplified) scenario:
String s1 = "AB\uD840\uDC00C"; String s2 = s1.replaceAll("", "X");
the input contains an extended unicode character (any surrogate pair will do). The pattern is an empty string (in fact, it was randomized as "]|" but it's the same problem so I omit the details). The problem is that after applying this pattern, replaceAll inserts X in between the surrogate pair characters and this results in invalid UTF-16:
AB𠀀C XAXBX?X?XCX
I believe this is a bug in the regexp implementation (sorry, don't have a patch for it) but I'd like to confirm it's not something known. Pointers appreciated.
Dawid
On 06/08/2012 05:16 AM, Ulf Zibis wrote:
Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode codepoints?
The regex spec says Pattern and Matcher work ON character sequence with the reference to CharSequence interface, but the pattern itself does support Unicode character via various regex constructors and flags. An empty String pattern is really a corner case here, it does not say anything about "character", the current implementation interprets it as each, every stop when you iterate through the target CharSequence. It might not be desirable for some use scenario, but not not-reasonable.
Additionally I like to discuss: "any possible zero-width position of the target String" If String length is l, maybe it's arguable, that position l is no valid position in the String.
If you considering those "boundary matcher" regex constructs, it might be reasonable to consider this "invalid position" as a valid when using regex. I think must of other regex engines do the same thing, for example, the perl. $mystring="Peter"; $mystring =~ s// /g; printf "[%s]\n", $mystring; [ P e t e r ] But I have to say you might have a point here:-) -Sherman
From the use case point of view, I think "P e t e r" as result of "Peter".replaceAll("", " ") is the most useful.
Thanks Sherman! Am 08.06.2012 20:36, schrieb Xueming Shen:
On 06/08/2012 05:16 AM, Ulf Zibis wrote:
Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode codepoints?
The regex spec says Pattern and Matcher work ON character sequence with the reference to CharSequence interface, but the pattern itself does support Unicode character via various regex constructors and flags. In other words, if there is a surrogate pair in the pattern, the CharSequence is seen as sequence of Unicode code points, right? "\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[AB\uD840\uDC01C]", "?") ==> "\uD840\uDC00?\uD840\uDC02" // only 1 replacement for \uD840\uDC01 "12\uD840\uDC02".replaceAll("[^0-9]", "?") ==> "12??" // 2 replacements for \uD840\uDC02 "\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[^uD840\uDC00-\uD840\uDC01]", "?") ==> "\uD840\uDC00\uD840\uDC01?" // only 1 replacement for \uD840\uDC02
An empty String pattern is really a corner case here, it does not say anything about "character" So it should be specified in the javadoc, and I'm with Dawid to implement it as in Python.
-Ulf
On 06/08/2012 12:07 PM, Ulf Zibis wrote:
Thanks Sherman!
Am 08.06.2012 20:36, schrieb Xueming Shen:
On 06/08/2012 05:16 AM, Ulf Zibis wrote:
Is there any spec weather the Java Regex API has a general contract with 16-bit chars or Unicode codepoints?
The regex spec says Pattern and Matcher work ON character sequence with the reference to CharSequence interface, but the pattern itself does support Unicode character via various regex constructors and flags. In other words, if there is a surrogate pair in the pattern, the CharSequence is seen as sequence of Unicode code points, right?
No exactly what I meant. The engine currently works as if the pattern is to match a "character" or "slice of characters" that has supplementary character embedded, engine will try to interpret the target char sequence as a sequence of Unicode code point. If the pattern is not to match a "character" or match a slice of characters that does not have supplementary character embedded, the engine will try to interpret the char sequence as a sequence of char unit. For example Matcher m = Pattern.compile("[^a]").matcher("\uD840\uDC00\uD840\uDC01\uD840\uDC02"); while (m.find()) { System.out.printf("<%d, %d>%n", m.start(), m.end()); } The output is <0, 2> <2, 4> <4, 6> The target string is iterated code point by code point, but Matcher m = Pattern.compile("(?=[^a])").matcher("\uD840\uDC00\uD840\uDC01\uD840\uDC02"); while (m.find()) { System.out.printf("<%d, %d>%n", m.start(), m.end()); } The output is <0, 0> <1, 1> <2, 2> <3, 3> <4, 4> <5, 5> And the empty string pattern belongs to the latter case. No, I'm not saying because the implementation works this way, therefor this is not a bug:-) Actually I'm starting to agree that we might not want to stop in the middle of a pair of surrogates, even in non-character case. But it might have some performance impact somewhere (if you iterate the CharSequence by code point). -Sherman
"\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[AB\uD840\uDC01C]", "?") ==> "\uD840\uDC00?\uD840\uDC02" // only 1 replacement for \uD840\uDC01 "12\uD840\uDC02".replaceAll("[^0-9]", "?") ==> "12??" // 2 replacements for \uD840\uDC02 "\uD840\uDC00\uD840\uDC01\uD840\uDC02".replaceAll("[^uD840\uDC00-\uD840\uDC01]", "?") ==> "\uD840\uDC00\uD840\uDC01?" // only 1 replacement for \uD840\uDC02
An empty String pattern is really a corner case here, it does not say anything about "character" So it should be specified in the javadoc, and I'm with Dawid to implement it as in Python.
-Ulf
If we can re-design everything (not the lib, but the language) allover again from the very beginning , and if we all put an i18n engr's hat on:-) it might be nature to have a 32-bit char instead of the 16-bit (OK, it's totally a difference story if from performance point of view), and then we would not have this kind of trouble at all. The reality is that String is still UTF16 and char based, Unicode supplementary code point support is an add-on, so the default index based iteration is char based. You will have to use those codePointXYZ() to do code point based access. If you consider the replaceAll() is to replace all targeted sub-string, and this sub-string happens to be an empty string in this case, as long as String.substring(i, i+1) works for the index "i" which is in between of two paired surrogates, it is hard to argue that you should not replace that "sub-string" with the replacement. Another "excuse" to back my "not a bug" argument is that as the API doc suggests the String.replaceAll() is simply a wrapper of j.u.regex.Matcher, both Pattern and Matcher class all explicitly documents that they work on "CharSequence", which is pure char based and "does not" know code point (for now, with the new defender method, it might be possible to evolute the CharSequence, if desirable, but that would be a totally different discussion). When I say "work on CharSequence", I'm not saying the Pattern does/should not work on code point, actually the implementation "transfers" all the String based pattern into a code point based pattern before interpreting it, but all code/point (mainly for supplementary) support works on the assumption that you have a "constructor". An empty string pattern really does not have any indication on "character". No, I don't think we have any concrete reference as to how an empty string regex should be interpreted. By its nature, it should be the match of every possible zero-width position in the target CharSequence, which is char based upon the API. It might be desired to document this somewhere in the Pattern class. -Sherman On 06/08/2012 04:14 AM, Dawid Weiss wrote:
I guess a lot depends on the point of view. From historical point of view (where a char[] and a String are basically unsigned values) that pattern should simply process every value (index) and work like you say. But from a practical point of view I think it is a bug -- it corrupts the string, transforming legal unicode into invalid values.
I checked with Python (3) and the behavior there is the expected one (it work at the unicode codepoint level rather than surrogate level).
Where is the behavior of "" that you mention defined? I admit I couldn't find any reference to this in the documentation:
Using an empty String "" as a regex for the replaceAll() takes the advantage of the special meaning of "", in which it is interpreted as it can match any possible zero-width position of the target String I'm not saying you're wrong (and that pattern is definitely not common so it's probably academic discussion) but I'd like some concrete reference as to how an empty pattern should behave. To me consistency with the rest of the Pattern specification would be that it operates at "zero width position between unicode characters" not between any char[] value, even an incorrect one or in the middle of a surrogate.
Dawid
On Fri, Jun 8, 2012 at 12:46 AM, Xueming Shen<xueming.shen@oracle.com> wrote:
Personally I don't think it is a bug. A j.l.String represents a sequence of UTF-16 chars. While a pair of surrogates represents a supplementary character, a single surrogate itself is still a "legal" independent entity inside a String object and length of a String is still defined as the total number of char unit and an index value between a high surrogate and a low surrogate is still a legal index value that can be used to access the char at that particular position. Using an empty String "" as a regex for the replaceAll() takes the advantage of the special meaning of "", in which it is interpreted as it can match any possible zero-width position of the target String, it does not imply anything regarding "character" or "characters" around it, so I would not interpret it as a zero-with character boundary, therefor a "position" in between a pair surrogates is still a good "found" for replacing.
-Sherman
On 6/7/2012 1:07 PM, Dawid Weiss wrote:
Hi, I'm a committer to the Apache Lucene project. We have randomized tests and one seed hit the following (simplified) scenario:
String s1 = "AB\uD840\uDC00C"; String s2 = s1.replaceAll("", "X");
the input contains an extended unicode character (any surrogate pair will do). The pattern is an empty string (in fact, it was randomized as "]|" but it's the same problem so I omit the details). The problem is that after applying this pattern, replaceAll inserts X in between the surrogate pair characters and this results in invalid UTF-16:
AB𠀀C XAXBX?X?XCX
I believe this is a bug in the regexp implementation (sorry, don't have a patch for it) but I'd like to confirm it's not something known. Pointers appreciated.
Dawid
participants (3)
-
Dawid Weiss
-
Ulf Zibis
-
Xueming Shen