JDK 9 RFR(s): 8150488: add note to Scanner.findAll()regardingpossible infinite streams
Stuart Marks
stuart.marks at oracle.com
Wed Apr 5 05:50:41 UTC 2017
On 4/4/17 5:33 PM, Timo Kinnunen wrote:
>
> Hi,
>
> I agree, the behavior should change to work like Matcher.find() which is
> probably the better known API. E.g.
> http://www.regular-expressions.info/continue.html describes Java’s Matcher’s
> behavior on empty matches but doesn’t mention any other APIs beside that.
>
I don't think Sherman was suggesting that.
In any case this would be a change to findWithinHorizon(), which isn't in scope.
s'marks
>
> If we say a regex engine’s task is to find non-overlapping matches in a given
> input then it repeatedly returning the same empty match at the same location
> could even be considered a bug.
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>
> *From: *Xueming Shen <mailto:xueming.shen at oracle.com>
> *Sent: *Wednesday, April 5, 2017 01:11
> *To: *Stuart Marks <mailto:stuart.marks at oracle.com>
> *Cc: *Timo Kinnunen <mailto:timo.kinnunen at gmail.com>;
> core-libs-dev at openjdk.java.net <mailto:core-libs-dev at openjdk.java.net>
> *Subject: *Re: JDK 9 RFR(s): 8150488: add note to
> Scanner.findAll()regardingpossible infinite streams
>
> On 4/4/17, 12:52 PM, Stuart Marks wrote:
>
> > Anything further on this?
>
> >
>
> > I'd at least like to add the API note I proposed in order to document
>
> > this issue. I'm reluctant to start tinkering with the behavior of this
>
> > method at this late stage in the release.
>
> >
>
> > BTW I used Scanner.findAll() in a little programming exercise I worked
>
> > on the other day. It worked perfectly. :-)
>
> H Stuart,
>
> The word "useless" was definitely exaggerate :-) Really meant to say the
>
> note might make the api less
>
> useful/popular.
>
> Personally I think the use scenario and the expected resulting behavior
>
> of Stream<MR>finaAll(ptn)
>
> should be more equivalent/similar to the use case of while
>
> (s.hasNext(p)) { s.next(p); }, or while
>
> (m.find()) { }, therefor it is probably desired it can be used without
>
> worrying the possibility of getting
>
> into an infinite loop. That said, I agree it might be hard to argue to
>
> fix it in jdk9. The question is if
>
> we are going to fix it in 9u sometime in the further, is it still worth
>
> putting down the note in the
>
> API now. I'm fine if you believe a note will help at least make the
>
> issue a known/documented usage
>
> limitation, for JDK9.
>
> Yes, I think we did chat about this one at the hallway some time when
>
> either you or something
>
> ran into the loop by using the method ...
>
> Thanks,
>
> Sherman
>
> >
>
> > s'marks
>
> >
>
> > On 3/30/17 2:19 PM, Stuart Marks wrote:
>
> >> Hi Timo, Sherman,
>
> >>
>
> >> Thanks for looking at this.
>
> >>
>
> >> Sherman wrote:
>
> >>
>
> >>> This might practically put the api itself almost useless? it might
>
> >>> be an easy
>
> >>> task to spot
>
> >>> whether or not it's a 0-width-match-possible regex when the regex is
>
> >>> simple,
>
> >>> but it gets
>
> >>> harder and harder, if not impossible when the regex gets complicated,
>
> >>> especially consider
>
> >>> the possible use scenario that the use site is embedded deeply inside a
>
> >>> library implementation.
>
> >>
>
> >> Well, not "useless", but perhaps less useful than one might like. :-)
>
> >>
>
> >> I think this is potentially surprising behavior, which is why I at
>
> >> least wanted
>
> >> to add the note. It's not clear to me whether we should try to fix
>
> >> this by
>
> >> changing Scanner though.
>
> >>
>
> >> Essentially, findAll() is defined in terms of
>
> >> findWithinHorizon(pattern, 0). So
>
> >> if one were to write a loop like so:
>
> >>
>
> >> String str;
>
> >> while ((str = scanner.findWithinHorizon(pattern, 0)) != null) {
>
> >> System.out.println(str);
>
> >> }
>
> >>
>
> >> then this loop would have the same problem if pattern were to match zero
>
> >> characters.
>
> >>
>
> >>> The alternative is to "fix" it, maybe as what Matcher.find() does,
>
> >>> if the
>
> >>> previous match is
>
> >>> zero-width-match (the fist==last), we step one to the next cursor
>
> >>> before next
>
> >>> try. I know
>
> >>
>
> >> Interesting, I didn't know Matcher.find() advances the cursor like
>
> >> this. But
>
> >> Scanner.findWithinHorizon() apparently doesn't, so that's why an
>
> >> infinite loop
>
> >> can occur.
>
> >>
>
> >>> Scanner.findPatternInBuffer() is setting new region set every time
>
> >>> it is
>
> >>> invoked which makes
>
> >>> it complicated, but I would assume it might be still worth a trying?
>
> >>> for
>
> >>> example, utilize the
>
> >>> "hasNextResult"/matcher.end(). I'm not sure without looking into the
>
> >>> code, does
>
> >>>
>
> >>> while (hasNext(pattern)) {
>
> >>> next(pattern);
>
> >>> }
>
> >>>
>
> >>> have the same issue, when pattern matches 0-width?
>
> >>
>
> >> No, this doesn't have the problem, because hasNext(pat) and next(pat)
>
> >> match
>
> >> delimited tokens. Each call to next() implicitly advances past the next
>
> >> delimiter to reach the subsequent token, if any.
>
> >>
>
> >>
>
> >> On 3/30/17 8:56 AM, Timo Kinnunen wrote:
>
> >>> I guess this somewhat contrived example also wouldn’t work?
>
> >>>
>
> >>> String s = "\\b\\w+|\\G|\\B";
>
> >>> String t = "Matcher m = Pattern.compile(s).matcher(t);\n";
>
> >>> Matcher m = Pattern.compile(s).matcher(t);
>
> >>> while(m.find()) {
>
> >>> System.out.println("'" + m.group() + "'");
>
> >>> }
>
> >>
>
> >> Right, so if you rewrote this loop to use Scanner.findWithinHorizon()
>
> >> instead of
>
> >> Matcher,
>
> >>
>
> >> Scanner sc = new Scanner(t);
>
> >> String str;
>
> >> while ((str = sc.findWithinHorizon(s, 0)) != null) {
>
> >> System.out.println("'" + str + "'");
>
> >> }
>
> >>
>
> >> you'd get an infinite loop with str being continually assigned the
>
> >> empty string.
>
> >> As Sherman mentioned, the Matcher.find() will advance the cursor if
>
> >> it gets a
>
> >> zero-width match, avoiding this problem.
>
> >>
>
> >> * * *
>
> >>
>
> >> This didn't come up in the code review thread, which was mostly about
>
> >> concurrent
>
> >> modification and late-binding of the spliterator:
>
> >>
>
> >>
>
> >> http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-September/035034.html
>
> >>
>
> >> I remember noting this phenomenon a while back, which is why I had
>
> >> filed the bug
>
> >> to add a note. I seem to remember discussing it, though, but it might
>
> >> have been
>
> >> in a meeting or in a hallway conversation.
>
> >>
>
> >> This bug (JDK-8150488) does note that an infinite stream might be
>
> >> unexpected or
>
> >> surprising, but it's not a fatal problem. It can be terminated with
>
> >> limit(). It
>
> >> can also be terminated with takeWhile(), also added in JDK 9. Maybe I
>
> >> could
>
> >> mention these in the API note.
>
> >>
>
> >> I guess we could also consider changing the implicit
>
> >> findWithinHorizon() loop
>
> >> that findAll() does, perhaps by having it terminate on a zero-width
>
> >> match. Or we
>
> >> could even change findWithinHorizon's behavior if it gets a
>
> >> zero-width match,
>
> >> siilar to what Matcher.find() does. But I'm quite reluctant to start
>
> >> making such
>
> >> changes at this point.
>
> >>
>
> >> s'marks
>
> >>
>
> >>
>
> >>
>
> >>> // Outputs:
>
> >>> // 'Matcher'
>
> >>> // ''
>
> >>> // 'm'
>
> >>> // ''
>
> >>> // ''
>
> >>> // ''
>
> >>> // 'Pattern'
>
> >>> // ''
>
> >>> // 'compile'
>
> >>> // ''
>
> >>> // 's'
>
> >>> // ''
>
> >>> // ''
>
> >>> // 'matcher'
>
> >>> // ''
>
> >>> // 't'
>
> >>> // ''
>
> >>> // ''
>
> >>> // ''
>
> >>> // ''
>
> >>>
>
> >>>
>
> >>>
>
> >>> Sent from Mail for Windows 10
>
> >>>
>
> >>> From: Xueming Shen
>
> >>> Sent: Thursday, March 30, 2017 05:41
>
> >>> To: core-libs-dev at openjdk.java.net
>
> >>> Subject: Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()
>
> >>> regardingpossible infinite streams
>
> >>>
>
> >>> On 3/29/17, 5:56 PM, Stuart Marks wrote:
>
> >>>> Hi all,
>
> >>>>
>
> >>>> Please review these non-normative textual additions to the
>
> >>>> Scanner.findAll() method docs. These methods were added earlier in JDK
>
> >>>> 9; there's a small pitfall if the regex can match zero characters.
>
> >>>>
>
> >>> Stuart,
>
> >>>
>
> >>> This might practically put the api itself almost useless? it might
>
> >>> be an
>
> >>> easy task to spot
>
> >>> whether or not it's a 0-width-match-possible regex when the regex is
>
> >>> simple, but it gets
>
> >>> harder and harder, if not impossible when the regex gets complicated,
>
> >>> especially consider
>
> >>> the possible use scenario that the use site is embedded deeply inside a
>
> >>> library implementation.
>
> >>>
>
> >>> The alternative is to "fix" it, maybe as what Matcher.find() does, if
>
> >>> the previous match is
>
> >>> zero-width-match (the fist==last), we step one to the next cursor
>
> >>> before
>
> >>> next try. I know
>
> >>> Scanner.findPatternInBuffer() is setting new region set every time
>
> >>> it is
>
> >>> invoked which makes
>
> >>> it complicated, but I would assume it might be still worth a trying?
>
> >>> for
>
> >>> example, utilize the
>
> >>> "hasNextResult"/matcher.end(). I'm not sure without looking into the
>
> >>> code, does
>
> >>>
>
> >>> while (hasNext(pattern)) {
>
> >>> next(pattern);
>
> >>> }
>
> >>>
>
> >>> have the same issue, when pattern matches 0-width?
>
> >>>
>
> >>> Thanks!
>
> >>> -Sherman
>
> >>>
>
> >>>
>
> >>>
>
> >>>
>
> >>>> Thanks,
>
> >>>>
>
> >>>> s'marks
>
> >>>>
>
> >>>>
>
> >>>> # HG changeset patch
>
> >>>> # User smarks
>
> >>>> # Date 1490749958 25200
>
> >>>> # Tue Mar 28 18:12:38 2017 -0700
>
> >>>> # Node ID 6b43c4698752779793d58813f46d3687c17dde75
>
> >>>> # Parent fb54b256d751ae3191e9cef42ff9f5630931f047
>
> >>>> 8150488: add note to Scanner.findAll() regarding possible infinite
>
> >>>> streams
>
> >>>> Reviewed-by: XXX
>
> >>>>
>
> >>>> diff -r fb54b256d751 -r 6b43c4698752
>
> >>>> src/java.base/share/classes/java/util/Scanner.java
>
> >>>> --- a/src/java.base/share/classes/java/util/Scanner.java Mon Mar 27
>
> >>>> 15:12:01 2017 -0700
>
> >>>> +++ b/src/java.base/share/classes/java/util/Scanner.java Tue Mar 28
>
> >>>> 18:12:38 2017 -0700
>
> >>>> @@ -2808,6 +2808,10 @@
>
> >>>> * }
>
> >>>> * }</pre>
>
> >>>> *
>
> >>>> + * <p>The pattern must always match at least one character. If
>
> >>>> the pattern
>
> >>>> + * can match zero characters, the result will be an infinite
>
> >>>> stream
>
> >>>> + * of empty matches.
>
> >>>> + *
>
> >>>> * @param pattern the pattern to be matched
>
> >>>> * @return a sequential stream of match results
>
> >>>> * @throws NullPointerException if pattern is null
>
> >>>> @@ -2829,6 +2833,11 @@
>
> >>>> * scanner.findAll(Pattern.compile(patString))
>
> >>>> * }</pre>
>
> >>>> *
>
> >>>> + * @apiNote
>
> >>>> + * The pattern must always match at least one character. If the
>
> >>>> pattern
>
> >>>> + * can match zero characters, the result will be an infinite
>
> >>>> stream
>
> >>>> + * of empty matches.
>
> >>>> + *
>
> >>>> * @param patString the pattern string
>
> >>>> * @return a sequential stream of match results
>
> >>>> * @throws NullPointerException if patString is null
>
> >>>
>
> >>>
>
More information about the core-libs-dev
mailing list