RFR: JDK-8027645: Pattern.split() with positive lookahead
Xueming Shen
xueming.shen at oracle.com
Fri Nov 8 21:56:35 UTC 2013
On 11/08/2013 01:19 AM, Paul Sandoz wrote:
> Hi Sherman.
>
> When you say:
>
> + * of the stream. A zero-width match at the beginning however never produces
> + * such empty leading substring.
>
> Is it possible to have a starting sequence of one or more zero-width matches?
The matcher.find() always increases its "next find start position" at least one
as showed in Matcher.find() impl ("first" starts from -1), so the matcher.find()
should keep going forward, never produce more than one zero-length substring.
Matcher:
public boolean find() {
int nextSearchIndex = last;
if (nextSearchIndex == first)
nextSearchIndex++;
...
The webrev has been updated to use your optimized version in splitAsStream().
http://cr.openjdk.java.net/~sherman/8027645/webrev/
Thanks!
-Sherman
> It would be useful to add a comment on skipping for zero-width match.
>
> IIUC you could simplify the code in splitAsStream:
>
> while (matcher.find()) {
> nextElement = input.subSequence(current, matcher.start()).toString();
> current = matcher.end();
> if (!nextElement.isEmpty()) {
> return true;
> } else if (current> 0) // Ignore for zero-width match
> emptyElementCount++;
> }
> }
>
> That is less efficient for zero-width matching, but how common is that?
>
> Paul.
>
> On Nov 7, 2013, at 7:59 PM, Xueming Shen<xueming.shen at oracle.com> wrote:
>
>> Hi,
>>
>> As suggested in the bug report [1] the spec of j.u.Pattern.split()
>> does not clearly specify what the expected behavior should be for scenario
>> like a zero-width match is found at the beginning of the input string
>> (such as whether or not an empty leading string should be included into
>> the resulting array), worse, the implementation is not consistent as well
>> (for different input cases, such as "Abc".split(...) vs "AbcEfg".split(...)).
>>
>> The spec also is not clear regarding what the expected behavior should be
>> if the size of the input string is 0 [2].
>>
>> As a reference, Perl.split() function has clear/explicit spec regarding
>> above use scenario [3].
>>
>> So the proposed change here is to updatethe spec&impl of Pattern.split() to have
>> clear specification for above use scanrio, as Perl does
>>
>> (1) A zero-length input sequence always results zero-length resulting array
>> (instead of returning a string[] only contains an empty string)
>> (2) An empty leading substring is included at the beginning of the resulting
>> array, when there is a positive-width match at the beginning of the input
>> sequence. A zero-width match at the beginning however never produces such
>> empty leading substring.
>>
>> webrev:
>> http://cr.openjdk.java.net/~sherman/8027645/webrev/
>>
>> Thanks!
>> -Sherman
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8027645
>> [2] https://bugs.openjdk.java.net/browse/JDK-6559590
>> [3] http://perldoc.perl.org/functions/split.html
>>
>> btw:the following perl script is used to verify the perl behavior
>> ------------------
>> $str = "AbcEfgHij";
>> @substr = split(/(?=\p{Uppercase})/, $str);
>> #$str = "abc efg hij";
>> #@substr = split(/ /, $str);
>> print "split[sz=", scalar @substr, "]=[", join(",", @substr), "]\n";
>> ------------------
More information about the core-libs-dev
mailing list