JDK 9 RFR(s): 8150488: add note to Scanner.findAll() regardingpossible infinite streams
Stuart Marks
stuart.marks at oracle.com
Tue Apr 4 19:52:44 UTC 2017
Anything further on this?
I'd at least like to add the API note I proposed in order to document this
issue. I'm reluctant to start tinkering with the behavior of this method at this
late stage in the release.
BTW I used Scanner.findAll() in a little programming exercise I worked on the
other day. It worked perfectly. :-)
s'marks
On 3/30/17 2:19 PM, Stuart Marks wrote:
> Hi Timo, Sherman,
>
> Thanks for looking at this.
>
> Sherman wrote:
>
>> This might practically put the api itself almost useless? it might be an easy
>> task to spot
>> whether or not it's a 0-width-match-possible regex when the regex is simple,
>> but it gets
>> harder and harder, if not impossible when the regex gets complicated,
>> especially consider
>> the possible use scenario that the use site is embedded deeply inside a
>> library implementation.
>
> Well, not "useless", but perhaps less useful than one might like. :-)
>
> I think this is potentially surprising behavior, which is why I at least wanted
> to add the note. It's not clear to me whether we should try to fix this by
> changing Scanner though.
>
> Essentially, findAll() is defined in terms of findWithinHorizon(pattern, 0). So
> if one were to write a loop like so:
>
> String str;
> while ((str = scanner.findWithinHorizon(pattern, 0)) != null) {
> System.out.println(str);
> }
>
> then this loop would have the same problem if pattern were to match zero
> characters.
>
>> The alternative is to "fix" it, maybe as what Matcher.find() does, if the
>> previous match is
>> zero-width-match (the fist==last), we step one to the next cursor before next
>> try. I know
>
> Interesting, I didn't know Matcher.find() advances the cursor like this. But
> Scanner.findWithinHorizon() apparently doesn't, so that's why an infinite loop
> can occur.
>
>> Scanner.findPatternInBuffer() is setting new region set every time it is
>> invoked which makes
>> it complicated, but I would assume it might be still worth a trying? for
>> example, utilize the
>> "hasNextResult"/matcher.end(). I'm not sure without looking into the code, does
>>
>> while (hasNext(pattern)) {
>> next(pattern);
>> }
>>
>> have the same issue, when pattern matches 0-width?
>
> No, this doesn't have the problem, because hasNext(pat) and next(pat) match
> delimited tokens. Each call to next() implicitly advances past the next
> delimiter to reach the subsequent token, if any.
>
>
> On 3/30/17 8:56 AM, Timo Kinnunen wrote:
>> I guess this somewhat contrived example also wouldn’t work?
>>
>> String s = "\\b\\w+|\\G|\\B";
>> String t = "Matcher m = Pattern.compile(s).matcher(t);\n";
>> Matcher m = Pattern.compile(s).matcher(t);
>> while(m.find()) {
>> System.out.println("'" + m.group() + "'");
>> }
>
> Right, so if you rewrote this loop to use Scanner.findWithinHorizon() instead of
> Matcher,
>
> Scanner sc = new Scanner(t);
> String str;
> while ((str = sc.findWithinHorizon(s, 0)) != null) {
> System.out.println("'" + str + "'");
> }
>
> you'd get an infinite loop with str being continually assigned the empty string.
> As Sherman mentioned, the Matcher.find() will advance the cursor if it gets a
> zero-width match, avoiding this problem.
>
> * * *
>
> This didn't come up in the code review thread, which was mostly about concurrent
> modification and late-binding of the spliterator:
>
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-September/035034.html
>
> I remember noting this phenomenon a while back, which is why I had filed the bug
> to add a note. I seem to remember discussing it, though, but it might have been
> in a meeting or in a hallway conversation.
>
> This bug (JDK-8150488) does note that an infinite stream might be unexpected or
> surprising, but it's not a fatal problem. It can be terminated with limit(). It
> can also be terminated with takeWhile(), also added in JDK 9. Maybe I could
> mention these in the API note.
>
> I guess we could also consider changing the implicit findWithinHorizon() loop
> that findAll() does, perhaps by having it terminate on a zero-width match. Or we
> could even change findWithinHorizon's behavior if it gets a zero-width match,
> siilar to what Matcher.find() does. But I'm quite reluctant to start making such
> changes at this point.
>
> s'marks
>
>
>
>> // Outputs:
>> // 'Matcher'
>> // ''
>> // 'm'
>> // ''
>> // ''
>> // ''
>> // 'Pattern'
>> // ''
>> // 'compile'
>> // ''
>> // 's'
>> // ''
>> // ''
>> // 'matcher'
>> // ''
>> // 't'
>> // ''
>> // ''
>> // ''
>> // ''
>>
>>
>>
>> Sent from Mail for Windows 10
>>
>> From: Xueming Shen
>> Sent: Thursday, March 30, 2017 05:41
>> To: core-libs-dev at openjdk.java.net
>> Subject: Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()
>> regardingpossible infinite streams
>>
>> On 3/29/17, 5:56 PM, Stuart Marks wrote:
>>> Hi all,
>>>
>>> Please review these non-normative textual additions to the
>>> Scanner.findAll() method docs. These methods were added earlier in JDK
>>> 9; there's a small pitfall if the regex can match zero characters.
>>>
>> Stuart,
>>
>> This might practically put the api itself almost useless? it might be an
>> easy task to spot
>> whether or not it's a 0-width-match-possible regex when the regex is
>> simple, but it gets
>> harder and harder, if not impossible when the regex gets complicated,
>> especially consider
>> the possible use scenario that the use site is embedded deeply inside a
>> library implementation.
>>
>> The alternative is to "fix" it, maybe as what Matcher.find() does, if
>> the previous match is
>> zero-width-match (the fist==last), we step one to the next cursor before
>> next try. I know
>> Scanner.findPatternInBuffer() is setting new region set every time it is
>> invoked which makes
>> it complicated, but I would assume it might be still worth a trying? for
>> example, utilize the
>> "hasNextResult"/matcher.end(). I'm not sure without looking into the
>> code, does
>>
>> while (hasNext(pattern)) {
>> next(pattern);
>> }
>>
>> have the same issue, when pattern matches 0-width?
>>
>> Thanks!
>> -Sherman
>>
>>
>>
>>
>>> Thanks,
>>>
>>> s'marks
>>>
>>>
>>> # HG changeset patch
>>> # User smarks
>>> # Date 1490749958 25200
>>> # Tue Mar 28 18:12:38 2017 -0700
>>> # Node ID 6b43c4698752779793d58813f46d3687c17dde75
>>> # Parent fb54b256d751ae3191e9cef42ff9f5630931f047
>>> 8150488: add note to Scanner.findAll() regarding possible infinite
>>> streams
>>> Reviewed-by: XXX
>>>
>>> diff -r fb54b256d751 -r 6b43c4698752
>>> src/java.base/share/classes/java/util/Scanner.java
>>> --- a/src/java.base/share/classes/java/util/Scanner.java Mon Mar 27
>>> 15:12:01 2017 -0700
>>> +++ b/src/java.base/share/classes/java/util/Scanner.java Tue Mar 28
>>> 18:12:38 2017 -0700
>>> @@ -2808,6 +2808,10 @@
>>> * }
>>> * }</pre>
>>> *
>>> + * <p>The pattern must always match at least one character. If
>>> the pattern
>>> + * can match zero characters, the result will be an infinite stream
>>> + * of empty matches.
>>> + *
>>> * @param pattern the pattern to be matched
>>> * @return a sequential stream of match results
>>> * @throws NullPointerException if pattern is null
>>> @@ -2829,6 +2833,11 @@
>>> * scanner.findAll(Pattern.compile(patString))
>>> * }</pre>
>>> *
>>> + * @apiNote
>>> + * The pattern must always match at least one character. If the
>>> pattern
>>> + * can match zero characters, the result will be an infinite stream
>>> + * of empty matches.
>>> + *
>>> * @param patString the pattern string
>>> * @return a sequential stream of match results
>>> * @throws NullPointerException if patString is null
>>
>>
More information about the core-libs-dev
mailing list