RFR 8024341: j.u.regex.Pattern.splitAsStream() doesn't correspond to split() method if using an example from the spec

Paul Sandoz paul.sandoz at oracle.com
Wed Sep 18 17:16:13 UTC 2013


On Sep 18, 2013, at 8:20 AM, Alan Bateman <Alan.Bateman at oracle.com> wrote:

> On 15/09/2013 17:27, Paul Sandoz wrote:
>> Hi,
>> 
>> http://cr.openjdk.java.net/~psandoz/tl/JDK-8024341-pattern-splitAsStream/webrev/
>> 
>> This fixes an issue with Pattern.splitAsStream reporting empty trailing elements and aligns with the functionality of Pattern.split(CharSequence input).
>> 
>> The matching iterator passed to the stream was updated to aggressively consume and keep a count of a sequence of empty matching elements such that those elements can either be reported if not trailing, or discarded if trailing.
>> 
>> Paul.
> It make sense to adjust the spec to have it consistent with split(CharSequence).
> 
> On the implementation then I had to read it a few times to understand how emptyElementCount is used. I wonder if it could be done in a simpler way, say just setting a flag when current reaches input.length? Maybe you have tried this already.
> 

The problem is when an empty matching element is encountered we don't know if it is trailing or not. This can only be determined when, later on, a non-empty matching element is encountered and/or there are no further matches. Thus we need to aggressively consume empty matching elements and retain how many have been encountered in case we need report them, for example, here is a particular test exercising this:

        description = "Many repeated separators before last match";
        input = "fooooo:";
        pattern = Pattern.compile("o");
        expected = new ArrayList<>();
        expected.add("f");
        expected.add("");
        expected.add("");
        expected.add("");
        expected.add("");
        expected.add(":");  // At this point we know the previously encountered matching empty elements need to be reported and not discarded 

I don't think it is practically possible in general to derive the number of empty elements from a start and end index since we don't know easily know the lengths of strings matched by the pattern in the input.


> On the test then you probably should include 8016846 in @bug tag as otherwise it looks like it was added specifically for 8024341.
> 

Thanks, updated. I wish there was a way to automate this by adding bug ids to meta-data to files in the repository. Any commit with tests would automatically update the test meta-data with the correspond bug id.

Paul.


More information about the core-libs-dev mailing list