RFR: JDK-8027645: Pattern.split() with positive lookahead

Paul Sandoz paul.sandoz at oracle.com
Fri Nov 8 09:19:35 UTC 2013


Hi Sherman.

When you say:

+     * of the stream. A zero-width match at the beginning however never produces
+     * such empty leading substring.

Is it possible to have a starting sequence of one or more zero-width matches?

It would be useful to add a comment on skipping for zero-width match.

IIUC you could simplify the code in splitAsStream:

                while (matcher.find()) {
                    nextElement = input.subSequence(current, matcher.start()).toString();
                    current = matcher.end();
                    if (!nextElement.isEmpty()) {
                        return true;
                    } else if (current > 0) // Ignore for zero-width match 
                       emptyElementCount++;
                    }
                }

That is less efficient for zero-width matching, but how common is that?

Paul.

On Nov 7, 2013, at 7:59 PM, Xueming Shen <xueming.shen at oracle.com> wrote:

> Hi,
> 
> As suggested in the bug report [1] the spec of j.u.Pattern.split()
> does not clearly specify what the expected behavior should be for scenario
> like a zero-width match is found at the beginning of the input string
> (such as whether or not an empty leading string should be included into
> the resulting array), worse, the implementation is not consistent as well
> (for different input cases, such as "Abc".split(...) vs "AbcEfg".split(...)).
> 
> The spec also is not clear regarding what the expected behavior should be
> if the size of the input string is 0 [2].
> 
> As a reference, Perl.split() function has clear/explicit spec regarding
> above use scenario [3].
> 
> So the proposed change here is to updatethe spec&impl of Pattern.split() to have
> clear specification for above use scanrio, as Perl does
> 
> (1) A zero-length input sequence always results zero-length resulting array
>    (instead of returning a string[] only contains an empty string)
> (2) An empty leading substring is included at the beginning of the resulting
>    array, when there is a positive-width match at the beginning of the input
>    sequence. A zero-width match at the beginning however never produces such
>    empty leading substring.
> 
> webrev:
> http://cr.openjdk.java.net/~sherman/8027645/webrev/
> 
> Thanks!
> -Sherman
> 
> [1] https://bugs.openjdk.java.net/browse/JDK-8027645
> [2] https://bugs.openjdk.java.net/browse/JDK-6559590
> [3] http://perldoc.perl.org/functions/split.html
> 
> btw:the following perl script is used to verify the perl behavior
> ------------------
> $str = "AbcEfgHij";
> @substr = split(/(?=\p{Uppercase})/, $str);
> #$str = "abc efg  hij";
> #@substr = split(/ /, $str);
> print "split[sz=", scalar @substr, "]=[", join(",", @substr), "]\n";
> ------------------



More information about the core-libs-dev mailing list