RFR 8230365 : Pattern for a control-char matches non-control characters
Ivan Gerasimov
ivan.gerasimov at oracle.com
Tue Sep 10 03:20:12 UTC 2019
Thank you Stuart for the analysis!
Please see my comments inline.
On 9/9/19 4:39 PM, Stuart Marks wrote:
>
>
> On 9/5/19 1:43 PM, Ivan Gerasimov wrote:
>> Personally, I don't have a strong preference here.
>>
>> The compatibility property are meant to be temporary anyways.
>>
>> Maybe if we have a single option that will control several different
>> aspects of behavior, it will be harder to get rid of it?
>>
>> Partially, because it will be tempting to reuse it for other similar
>> changes, should they be needed.
>
> OK, let's take an inventory of what behavior changes are being
> contemplated for regexes:
>
> JDK-8230675 restrict IDs for control chars
> JDK-xxxxxxx allow case-insensitive IDs for control chars *NOTE*
> JDK-8225021 Treat ambiguous embedded flags as parse syntax errors
> JDK-8214245 Case insensitive matching doesn't work correctly for some
> character classes
>
I quickly searched JBS and found several more bugs/enhancements requests
that, if implemented, may result in the behavior changes.
Here's (presumably incomplete) list:
JDK-8218146 $ matches before end of line, even without MULTILINE mode
JDK-8217977 Matcher matching trailing high surrogate reports false for
requireEnd()
JDK-8217501 Matcher.hitEnd returns false for incomplete surrogate pairs
JDK-8217496 Matcher.group() can return null after usePattern
JDK-8216332 Grapheme regex does not work with emoji sequences
JDK-8199594 Regex Pattern class improperly ignores spaces in character
classes
JDK-8187083 Regex: Capturing groups inside a lookahead aren't backtracked
JDK-8187082 Regex: Nested capturing groups under lazy repetition aren't
backtracked
JDK-8183391 Regex: End of line found more than once for non-multiline regex
JDK-8179668 Valid regex patterns match the latter half of complete
surrogate pairs
JDK-8029966 Broken supplementary character support in regex
JDK-6919621 Matcher find returns wrong result in java 1.6 for certain
patterns
All of them are of low priorities, so I don't anticipate active work on
these bugs in the near future.
Though at least some of them, if fixed, would make the Java regexp
engine better, so it probably wouldn't make sense to just abandon these
request because of the compatibility reasons.
> *NOTE* this was part of the original JDK-8230675 proposal, but you
> removed it after discussion. I don't know if we decided never to do
> this, or whether we're merely considering it separately. It seemed to
> me that there was a possibility that we'd do this in the future.
>
I was thinking of filling an enhancement request with the fix version
set to TBD, so we can return to this proposal in some future release, if
desirable.
> Is this all the behavior changes being contemplated, or is this simply
> the set that we happened to have stumbled across recently? Put another
> way, if we decided to do some further analysis of regexes, would we
> run across other issues where we might say, "Yeah, we ought to fix
> that, but that would be a potentially incompatible behavior change, so
> we need to add another property." ?
>
> In practice, such properties are only removed after a very long time,
> or perhaps even "never." It's not like this change would be added in
> this release (JDK 14), with backward compatibility support removed in
> a year (say, JDK 16) along with the property. The property, and the
> backward compatibility mode, would stick around in the code for many
> years.
>
> What I want to avoid doing is to introduce behavior changes -- and
> properties to control them -- in a piecemeal fashion. It looks like we
> might have three or four candidates already. Would we want to live
> with three or four properties? If we did this and continued with
> additional changes, we might end up with six or eight or ten
> properties over time.
>
> I'd like to see us look ahead a bit and take stock of what changes
> we're contemplating, and then decided how to deal with compatibility
> and migration based on that. I'd like to avoid making individual
> changes (and adding properties) one at a time, with decisions made in
> isolation, because that will lead to a proliferation of properties.
>
So, there are two alternatives at the table at this time:
1) A single compatibility property to revert to the old behavior; The
property is reused for each of listed above bugs, so with each fix a
portion of revert logic is added to the property.
PROS: Easy to implement and maintain.
CONS: Over time, can become hard to track, what exactly the property
controls, so may be hard to use. For example, if a user turns on this
property to revert a single aspect of behavior, one will get all other
behavior oddities.
2) Individual properties for every change of behavior.
PROS: If needed, the behavior can be fine-grained. Easier to
understand what the expected behavior would be with every set of
properties set.
CONS: Complex to maintain. For the majority of cases would be just an
overkill. Also, can greatly increase number of testing (naively up to
2^{# of properties}).
One possible compromise might be to introduce one umbrella property +
set of individual properties as desired. This all can be plugged into
one string property, of course:
jdk.util.regex.mode=strict # default
jdk.util.regex.mode=compatibility # turns on all compatibility
properties at once
jdk.util.regex.mode=restrictCntrlCharIds=yes,rejectAmbiguousEmbeddedFlags=no
# fine grained settings
If the changes implemented carefully, so that the individual properties
are "orthogonal", then we wouldn't need to test all possible
combinations, but only two opposite modes: strict and compatibility.
Do you think it's a viable approach?
--
With kind regards,
Ivan Gerasimov
More information about the core-libs-dev
mailing list