RFR: JDK-8027645: Pattern.split() with positive lookahead

Xueming Shen xueming.shen at oracle.com
Fri Nov 8 21:56:35 UTC 2013


On 11/08/2013 01:19 AM, Paul Sandoz wrote:
> Hi Sherman.
>
> When you say:
>
> +     * of the stream. A zero-width match at the beginning however never produces
> +     * such empty leading substring.
>
> Is it possible to have a starting sequence of one or more zero-width matches?

The matcher.find() always increases its "next find start position" at least one
as showed in Matcher.find() impl ("first" starts from -1), so the matcher.find()
should keep going forward, never produce more than one zero-length substring.

Matcher:
     public boolean find() {
         int nextSearchIndex = last;
         if (nextSearchIndex == first)
             nextSearchIndex++;
         ...

The webrev has been updated to use your optimized version in splitAsStream().

http://cr.openjdk.java.net/~sherman/8027645/webrev/

Thanks!
-Sherman

> It would be useful to add a comment on skipping for zero-width match.
>
> IIUC you could simplify the code in splitAsStream:
>
>                  while (matcher.find()) {
>                      nextElement = input.subSequence(current, matcher.start()).toString();
>                      current = matcher.end();
>                      if (!nextElement.isEmpty()) {
>                          return true;
>                      } else if (current>  0) // Ignore for zero-width match
>                         emptyElementCount++;
>                      }
>                  }
>
> That is less efficient for zero-width matching, but how common is that?
>
> Paul.
>
> On Nov 7, 2013, at 7:59 PM, Xueming Shen<xueming.shen at oracle.com>  wrote:
>
>> Hi,
>>
>> As suggested in the bug report [1] the spec of j.u.Pattern.split()
>> does not clearly specify what the expected behavior should be for scenario
>> like a zero-width match is found at the beginning of the input string
>> (such as whether or not an empty leading string should be included into
>> the resulting array), worse, the implementation is not consistent as well
>> (for different input cases, such as "Abc".split(...) vs "AbcEfg".split(...)).
>>
>> The spec also is not clear regarding what the expected behavior should be
>> if the size of the input string is 0 [2].
>>
>> As a reference, Perl.split() function has clear/explicit spec regarding
>> above use scenario [3].
>>
>> So the proposed change here is to updatethe spec&impl of Pattern.split() to have
>> clear specification for above use scanrio, as Perl does
>>
>> (1) A zero-length input sequence always results zero-length resulting array
>>     (instead of returning a string[] only contains an empty string)
>> (2) An empty leading substring is included at the beginning of the resulting
>>     array, when there is a positive-width match at the beginning of the input
>>     sequence. A zero-width match at the beginning however never produces such
>>     empty leading substring.
>>
>> webrev:
>> http://cr.openjdk.java.net/~sherman/8027645/webrev/
>>
>> Thanks!
>> -Sherman
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8027645
>> [2] https://bugs.openjdk.java.net/browse/JDK-6559590
>> [3] http://perldoc.perl.org/functions/split.html
>>
>> btw:the following perl script is used to verify the perl behavior
>> ------------------
>> $str = "AbcEfgHij";
>> @substr = split(/(?=\p{Uppercase})/, $str);
>> #$str = "abc efg  hij";
>> #@substr = split(/ /, $str);
>> print "split[sz=", scalar @substr, "]=[", join(",", @substr), "]\n";
>> ------------------




More information about the core-libs-dev mailing list