RFR: JDK-8027645: Pattern.split() with positive lookahead
Paul Sandoz
paul.sandoz at oracle.com
Fri Nov 8 09:19:35 UTC 2013
Hi Sherman.
When you say:
+ * of the stream. A zero-width match at the beginning however never produces
+ * such empty leading substring.
Is it possible to have a starting sequence of one or more zero-width matches?
It would be useful to add a comment on skipping for zero-width match.
IIUC you could simplify the code in splitAsStream:
while (matcher.find()) {
nextElement = input.subSequence(current, matcher.start()).toString();
current = matcher.end();
if (!nextElement.isEmpty()) {
return true;
} else if (current > 0) // Ignore for zero-width match
emptyElementCount++;
}
}
That is less efficient for zero-width matching, but how common is that?
Paul.
On Nov 7, 2013, at 7:59 PM, Xueming Shen <xueming.shen at oracle.com> wrote:
> Hi,
>
> As suggested in the bug report [1] the spec of j.u.Pattern.split()
> does not clearly specify what the expected behavior should be for scenario
> like a zero-width match is found at the beginning of the input string
> (such as whether or not an empty leading string should be included into
> the resulting array), worse, the implementation is not consistent as well
> (for different input cases, such as "Abc".split(...) vs "AbcEfg".split(...)).
>
> The spec also is not clear regarding what the expected behavior should be
> if the size of the input string is 0 [2].
>
> As a reference, Perl.split() function has clear/explicit spec regarding
> above use scenario [3].
>
> So the proposed change here is to updatethe spec&impl of Pattern.split() to have
> clear specification for above use scanrio, as Perl does
>
> (1) A zero-length input sequence always results zero-length resulting array
> (instead of returning a string[] only contains an empty string)
> (2) An empty leading substring is included at the beginning of the resulting
> array, when there is a positive-width match at the beginning of the input
> sequence. A zero-width match at the beginning however never produces such
> empty leading substring.
>
> webrev:
> http://cr.openjdk.java.net/~sherman/8027645/webrev/
>
> Thanks!
> -Sherman
>
> [1] https://bugs.openjdk.java.net/browse/JDK-8027645
> [2] https://bugs.openjdk.java.net/browse/JDK-6559590
> [3] http://perldoc.perl.org/functions/split.html
>
> btw:the following perl script is used to verify the perl behavior
> ------------------
> $str = "AbcEfgHij";
> @substr = split(/(?=\p{Uppercase})/, $str);
> #$str = "abc efg hij";
> #@substr = split(/ /, $str);
> print "split[sz=", scalar @substr, "]=[", join(",", @substr), "]\n";
> ------------------
More information about the core-libs-dev
mailing list