Empty regexp replaceall and surrogate pairs results in corrupted utf16.

Xueming Shen xueming.shen at oracle.com
Fri Jun 8 18:36:57 UTC 2012


On 06/08/2012 05:16 AM, Ulf Zibis wrote:
>
>
> Is there any spec weather the Java Regex API has a general contract 
> with 16-bit chars or Unicode codepoints?

The regex spec says Pattern and Matcher work ON character sequence with 
the reference to
CharSequence interface,  but the pattern itself does support Unicode 
character via various
regex constructors and flags. An empty String pattern is really a corner 
case here, it does
not say anything about "character", the current implementation 
interprets it as each, every
stop when you iterate through the target CharSequence. It might not be 
desirable for some
use scenario, but not not-reasonable.

>
> Additionally I like to discuss: "any possible zero-width position of 
> the target String"
> If String length is l, maybe it's arguable, that position l is no 
> valid position in the String.
>
If you considering those "boundary matcher" regex constructs,  it might 
be reasonable
to consider this "invalid position" as a valid when using regex. I think 
must of other
regex engines do the same thing, for example, the perl.

$mystring="Peter";
$mystring =~ s// /g;
printf "[%s]\n", $mystring;
[ P e t e r ]

But I have to say you might have a point here:-)

-Sherman

> From the use case point of view, I think "P e t e r" as result of 
> "Peter".replaceAll("", " ") is the most useful.





More information about the core-libs-dev mailing list