RFR 8230365 : Pattern for a control-char matches non-control characters

Fri Aug 30 03:13:41 UTC 2019

Hi Stuart!

Thank you for your comments on the proposal!
I totally agree with you that the fixes that result in changing the 
behavior have to be carefully planned and well thought.
Please see my answers inline.

On 8/29/19 6:15 PM, Stuart Marks wrote:
> Hi Ivan,
>
> This change certainly makes regex patterns more rigorous, but I'm 
> concerned about the compatibility. This is a spec change and also a 
> behavior change. While the current behavior might not strictly be 
> correct, it does have some characteristics that applications might be 
> depending on -- perhaps even by accident. If this change is made, it 
> might cause subtle issues in applications that would be quite 
> difficult to diagnose.
>
There are two types of changes in the proposal:
First, make \cx construct more restrictive w.r.t. possible values of x.
Second, make it case-insensitive, so \cz will mean the same as \cZ.

I agree that the later type of changes can potentially cause hard to 
diagnose failures in the existing applications.
The reason for this later change was an attempt to make Java regexp more 
Perl-compatible (and Perl does exactly this:  Treats \cx with a 
lower-case x as \cX with its upper-case counterpart).

The former type of change, on the other hand, might actually be useful 
for the existing applications, as it may allow to see (otherwise hard to 
find) bugs in the code.

For example, if some existing application has a regexp, which contains 
"\\c\t", this is most likely a programming error (or a nasty hack to 
code a char 'I'), so it would be beneficial to report it via throwing a 
PatternSyntaxException.

> Examples of changes I'm concerned about are:
>
> pattern \ca currently matches '!' would now match \u0001
> pattern \cÀ currently matches \u0080 would now throw exception
> pattern \c0 currently matches 'p' would now throw exception
>
> and so forth. That is, using \c with characters in the range [a-z] 
> would now match different characters from before, and using \c with 
> characters outside the set that correspond to C0 control characters 
> would now throw an exception whereas before they would matching 
> something that was predictable, if in some sense incorrect.
>
> There are some ways to mitigate the incompatibility, for example, by 
> adding a system property, or by adding a Pattern flag that explicitly 
> enables this behavior, though I'm not sure that either is worthwhile. 
> Maybe there are less intrusive ways that I haven't thought of.
>
I think it would make sense to first separate the changes mentioned 
above, so it will be easier to reason about them.

While I think that making lower-case \\cx a synonym for the upper-case 
\\cX is a good thing to have (mainly for additional compatibility with 
Perl-style regexps), it's supposed be deferred to a later time.

>
> The current behavior seems to be have been established around 1999 
> (JDK 1.3?) so it's been around a long time, plenty of time for 
> applications to have formed inadvertent dependencies on this behavior. 
> An alternative would be simply to document the current behavior, even 
> though it's arguably incorrect.
>
> Is there some benefit to this change, for example, does it enable one 
> to write an application that wasn't possible before because of this bug?
>
In my opinion, the benefits are:
1) Fail-early approach, which will let the developers catch the bugs 
earlier,
2) Better compatibility with Perl-style regexps, so that a wider class 
of regular expressions can be used across different platforms,
3) Well defined strict rules will allow IDEs to implement additional 
edit-time checks of regexp, which will again help developers.

Please find the updated webrev here:
http://cr.openjdk.java.net/~igerasim/8230365/01/webrev/

With kind regards,
Ivan

> s'marks
>
>
> On 8/29/19 4:39 PM, Ivan Gerasimov wrote:
>> Hello!
>>
>> In a regular expression pattern a sequence of the form \\cx is 
>> allowed to specify a control character that corresponds to the name 
>> char x.
>>
>> Current implementation has a few issues with that:
>> 1)  It allows x to be just any character, including non-printable ones;
>> 2)  The produced regexp may correspond to a non-control characters;
>> 3)  The expression is case-sensitive, so, for example \\cA differs 
>> from \\ca, while they both have to produce ctrl-A.
>>
>> It is proposed to make parsing more strict and reject invalid values 
>> of x, and also clarify the documentation to explicitly list valid 
>> values of x.
>>
>> If we agree on this proposal, then a CSR will probably need to be 
>> filed to capture the changes in the regexp parsing.
>>
>> Would you please help review the fix?
>>
>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8230365
>> WEBREV: http://cr.openjdk.java.net/~igerasim/8230365/00/webrev/
>>
>
-- 
With kind regards,
Ivan Gerasimov