RFR 8071479: Stream and lamdification improvements to j.u.regex.Matcher

Stuart Marks stuart.marks at oracle.com
Wed Feb 11 19:23:33 UTC 2015


On 2/11/15 2:25 AM, Paul Sandoz wrote:
> Hi Stuart,
>
> Thanks for the detailed review.
>
> Here is a possible way forward:
>
> 1) Add the methods to Matcher, as proposed in the initial webrev.

Yes, I think this is the way to go, and it seems that Sherman concurs.

> 1.1) Change the specification of Matcher.results to reset the stream before matching, making it consistent with the replace* methods.

I'm not sure about this. The current replaceAll/replaceFirst methods reset the 
matcher before doing any matching, so the lambda-based overloads should do the same.

However, the model for

     Stream<MatchResult> results()

seems to me to be a stream of matches that would be returned by successive calls 
to find(). (Indeed, that's how it's implemented.) The no-arg find() call doesn't 
reset the Matcher, and it respects the existing region of the Matcher. I think 
results() should do the same.

Now there's also a find(int start) overload that does reset the matcher 
(discarding the region) but starts at the given position. This suggests another 
overload,

     Stream<MatchResult> results(int start)

which also resets the matcher. I don't think this is necessary, though, since I 
believe the equivalent effect can be achieved by setting the region before 
calling the no-arg results() -- as long as it doesn't reset the matcher.

No matter what, I think results() will have to check for concurrent 
modification. It seems to be in the nature of this API, sigh.

By the way, I think Matcher.results() is a fine name. The overload of 
Pattern.matches() is what bothers me.

> 2) Add convenience methods for all replace*() and matches() on Pattern that defer to those on Matcher.

I'm not sure convenience methods are necessary. After all, there are the 
existing replaceFirst/replaceAll methods on Matcher that aren't on Pattern.

In addition, if you don't need to hang onto the Pattern or Matcher instances, my 
example from earlier can be compressed to:

     String result = Pattern.compile("a*b")
         .matcher(input).replaceAll(mr -> mr.group().toUpperCase());

which ain't too bad.


> We can do that in two stages, focusing on 1) in this review.

Yes.

> I was not too concerned about the static method and the stream returning method having the same name as the context is quite different. For stream returning methods there is a de-facto pattern of using a plural of the stream element kind, so i prefer that to findAll. What about the name "Pattern.matchResults"? which chains well with "Pattern.match(...).results()".
>
> --
>
> Regarding the disparity between MatchResult and Matcher. I think that would require a new sub-interface of MatchResult from which Matcher extends from and returns. If we think it important we should do that for 9 otherwise we will be stuck for the stream-based methods.

Why do you think we need a new sub-interface? Wouldn't it be sufficient to add 
default methods?

There is of course the possibility of existing 3rd party classes that implement 
MatchResult having conflicting methods. But I don't think MatchResult is 
implemented quite as often as the collection interfaces, where we've been 
reticent to add default methods because of so many implementations.

I did a quick web search and I did find a few implementations/extensions of 
MatchResult, but there were no obvious conflicts. This isn't definitive, of course.

The default implementations of group(String), start(String), and end(String) 
could throw IllegalArgumentException, since that's what the ones on Matcher do 
if there's no such named capture group.

I admit this is a little weird, since it's only safe to use these new methods if 
you know where the MatchResult instance came from. So logically perhaps it's a 
different type. My hunch, though, is that MatchResults aren't passed around, 
they're created from Patterns/Matchers and processed within the same code, so in 
practice this won't be a problem.

s'marks


> Paul.
>
> On Feb 11, 2015, at 2:02 AM, Stuart Marks <stuart.marks at oracle.com> wrote:
>
>> Hi Paul,
>>
>> I spent some time looking at this API. Overall it seems to me that things work a bit more nicely when these methods are added to Pattern instead of Matcher. Unfortunately there are some odd things with the existing API that make this tradeoff not so obvious.
>>
>> First, here's what a simple replacement operation looks like when replaceAll() is added to Matcher:
>>
>>         String input = "fooaaaabbfooabbbfoo";
>>         Pattern p = Pattern.compile("a*b");
>>         Matcher m = p.matcher(input);
>>         String result = m.replaceAll(mr -> mr.group().toUpperCase());
>>
>> But if replaceAll() is on Pattern, we can skip a step:
>>
>>         String input = "fooaaaabbfooabbbfoo";
>>         Pattern p = Pattern.compile("a*b");
>>         String result = p.replaceAll(input, mr -> mr.group().toUpperCase());
>>
>> Getting stream of match results is similar. So yes, I agree that it simplifies things to have these be on Pattern instead of Matcher.
>>
>> An advantage of having these on Pattern is that the matcher that gets created is encapsulated, and its state isn't exposed to being mucked about by the application. Thus you can avoid the additional concurrent modification checks that you have to do if replaceAll et. al. are on Matcher.
>>
>> Unfortunately, putting these on Pattern now creates some difficulties meshing with the existing API.
>>
>> One issue is that Matcher already has replaceAll(String) and replaceFirst(String). It would be strange to have these here and to have replaceAll(replacer) and replaceFirst(replacer) on Pattern.
>>
>> Another issue is that Matcher supports matching on region (subrange) of its input. For example, today you can do this:
>>
>>     pattern.matcher(input).region(start, end)
>>
>> The region will constrain the matching for certain operations, such as find() (but not replaceAll or replaceFirst). If something like results() were added to Matcher, I'd expect that it would respect the Matcher's region, but if results() (or matches() as you called it) were on Pattern, the region constraint would be lacking.
>>
>> Also note that Pattern already has this:
>>
>>     static boolean matches(regex, input)
>>
>> so I don't think an overload of matches() that returns a Stream would be a good idea. (Maybe findAll()?)
>>
>> Another issue, not directly related to where the new lambda/streams methods get added, is that MatchResult allows references only numbered capturing groups. Matcher, which implements MatchResult, also supports named capturing groups, with the new overloaded methods group(String), start(String), and end(String). These were added in Java 7. Logically these also belong on MatchResult, but they probably weren't added because of the compatibility issue of adding methods to interfaces. Maybe we should add these as default methods to MatchResult.
>>
>> (But what would the supported implementation be? Just throw UnsupportedOperationException?)
>>
>> --
>>
>> I'm not entirely sure where this leaves things. It certainly seems more convenient to have the new methods on Pattern. But given the way the existing API is set up, it seems like it's a better fit to add them to Matcher.
>>
>> s'marks
>>
>>
>>
>> On 2/9/15 6:18 AM, Paul Sandoz wrote:
>>> Here is an alternative that pushes the methods on to Pattern instead:
>>>
>>>    http://cr.openjdk.java.net/~psandoz/jdk9/JDK-8071479--Matcher-stream-results/on-Pattern/webrev/
>>>
>>> (Whe webrev reports some files as empty, please ingore those, i have this webrev stacked on the previous one.)
>>>
>>> I have also included replaceFirst.
>>>
>>> This simplifies things for streaming on the match results and also for replacing.
>>>
>>> Note that the existing replace* methods on Matcher reset the matcher before matching and indicate that the matcher should be reset afterwards for reuse. Thus there is no loss in functionality moving such lambda accepting methods from Matcher to Pattern. It comes down to the performance of reusing a matcher, which does not seems so compelling to me.
>>>
>>> Paul.
>>>
>>> On Feb 5, 2015, at 11:59 AM, Paul Sandoz <Paul.Sandoz at oracle.com> wrote:
>>>
>>>> Hi.
>>>>
>>>> Please review these stream/lambda enhancements on Matcher:
>>>>
>>>>   http://cr.openjdk.java.net/~psandoz/jdk9/JDK-8071479--Matcher-stream-results/webrev/
>>>>
>>>> Two new methods are added to Matcher:
>>>>
>>>> 1) replaceAll(Function<MatchResult, String> ) that is more flexible than the existing replaceAll that accepts a single value.
>>>>
>>>> 2) Stream<MatchResult> results() that returns a stream of MatchResult for all matches.
>>>>
>>>> The former does introduce a minor source incompatibility for a null argument, but then so did the new append methods accepting StringBuilder that were recently added (see JDK-8039124).
>>>>
>>>> For the latter i opted to place the method on Matcher rather than Pattern as i think that is a better fit with current usages of Matcher and operating on a MatchResult. That marginally increases the complexity since co-modification checking is required.
>>>>
>>>> I update the test PatternStreamTest to derive the expected result.
>>>>
>>>>
>>>> I suppose i could add another method replaceFirst(Function<MatchResult, String> ) if anyone feels strongly about that. Consistency-wise it seems the right thing to do.
>>>>
>>>> Paul.
>>>
>



More information about the core-libs-dev mailing list