RFR 8230365 : Pattern for a control-char matches non-control characters

Ivan Gerasimov ivan.gerasimov at oracle.com
Tue Sep 10 03:20:12 UTC 2019


Thank you Stuart for the analysis!

Please see my comments inline.

On 9/9/19 4:39 PM, Stuart Marks wrote:
>
>
> On 9/5/19 1:43 PM, Ivan Gerasimov wrote:
>> Personally, I don't have a strong preference here.
>>
>> The compatibility property are meant to be temporary anyways.
>>
>> Maybe if we have a single option that will control several different 
>> aspects of behavior, it will be harder to get rid of it?
>>
>> Partially, because it will be tempting to reuse it for other similar 
>> changes, should they be needed.
>
> OK, let's take an inventory of what behavior changes are being 
> contemplated for regexes:
>
> JDK-8230675 restrict IDs for control chars
> JDK-xxxxxxx allow case-insensitive IDs for control chars *NOTE*
> JDK-8225021 Treat ambiguous embedded flags as parse syntax errors
> JDK-8214245 Case insensitive matching doesn't work correctly for some 
> character classes
>
I quickly searched JBS and found several more bugs/enhancements requests 
that, if implemented, may result in the behavior changes.

Here's (presumably incomplete) list:

JDK-8218146  $ matches before end of line, even without MULTILINE mode
JDK-8217977  Matcher matching trailing high surrogate reports false for 
requireEnd()
JDK-8217501  Matcher.hitEnd returns false for incomplete surrogate pairs
JDK-8217496  Matcher.group() can return null after usePattern
JDK-8216332  Grapheme regex does not work with emoji sequences
JDK-8199594  Regex Pattern class improperly ignores spaces in character 
classes
JDK-8187083  Regex: Capturing groups inside a lookahead aren't backtracked
JDK-8187082  Regex: Nested capturing groups under lazy repetition aren't 
backtracked
JDK-8183391  Regex: End of line found more than once for non-multiline regex
JDK-8179668  Valid regex patterns match the latter half of complete 
surrogate pairs
JDK-8029966  Broken supplementary character support in regex
JDK-6919621  Matcher find returns wrong result in java 1.6 for certain 
patterns

All of them are of low priorities, so I don't anticipate active work on 
these bugs in the near future.
Though at least some of them, if fixed, would make the Java regexp 
engine better, so it probably wouldn't make sense to just abandon these 
request because of the compatibility reasons.

> *NOTE* this was part of the original JDK-8230675 proposal, but you 
> removed it after discussion. I don't know if we decided never to do 
> this, or whether we're merely considering it separately. It seemed to 
> me that there was a possibility that we'd do this in the future.
>
I was thinking of filling an enhancement request with the fix version 
set to TBD, so we can return to this proposal in some future release, if 
desirable.


> Is this all the behavior changes being contemplated, or is this simply 
> the set that we happened to have stumbled across recently? Put another 
> way, if we decided to do some further analysis of regexes, would we 
> run across other issues where we might say, "Yeah, we ought to fix 
> that, but that would be a potentially incompatible behavior change, so 
> we need to add another property." ?
>
> In practice, such properties are only removed after a very long time, 
> or perhaps even "never." It's not like this change would be added in 
> this release (JDK 14), with backward compatibility support removed in 
> a year (say, JDK 16) along with the property. The property, and the 
> backward compatibility mode, would stick around in the code for many 
> years.
>
> What I want to avoid doing is to introduce behavior changes -- and 
> properties to control them -- in a piecemeal fashion. It looks like we 
> might have three or four candidates already. Would we want to live 
> with three or four properties? If we did this and continued with 
> additional changes, we might end up with six or eight or ten 
> properties over time.
>
> I'd like to see us look ahead a bit and take stock of what changes 
> we're contemplating, and then decided how to deal with compatibility 
> and migration based on that. I'd like to avoid making individual 
> changes (and adding properties) one at a time, with decisions made in 
> isolation, because that will lead to a proliferation of properties.
>
So, there are two alternatives at the table at this time:
1) A single compatibility property to revert to the old behavior; The 
property is reused for each of listed above bugs, so with each fix a 
portion of revert logic is added to the property.

PROS:  Easy to implement and maintain.
CONS:  Over time, can become hard to track, what exactly the property 
controls, so may be hard to use.  For example, if a user turns on this 
property to revert a single aspect of behavior, one will get all other 
behavior oddities.

2) Individual properties for every change of behavior.

PROS:  If needed, the behavior can be fine-grained.  Easier to 
understand what the expected behavior would be with every set of 
properties set.
CONS:  Complex to maintain.  For the majority of cases would be just an 
overkill.  Also, can greatly increase number of testing (naively up to 
2^{# of properties}).

One possible compromise might be to introduce one umbrella property + 
set of individual properties as desired.  This all can be plugged into 
one string property, of course:

jdk.util.regex.mode=strict  # default
jdk.util.regex.mode=compatibility  #  turns on all compatibility 
properties at once
jdk.util.regex.mode=restrictCntrlCharIds=yes,rejectAmbiguousEmbeddedFlags=no 
# fine grained settings

If the changes implemented carefully, so that the individual properties 
are "orthogonal", then we wouldn't need to test all possible 
combinations, but only two opposite modes: strict and compatibility.

Do you think it's a viable approach?

-- 
With kind regards,
Ivan Gerasimov



More information about the core-libs-dev mailing list