RFR 8230365 : Pattern for a control-char matches non-control characters

Fri Aug 30 01:15:27 UTC 2019

Hi Ivan,

This change certainly makes regex patterns more rigorous, but I'm concerned 
about the compatibility. This is a spec change and also a behavior change. While 
the current behavior might not strictly be correct, it does have some 
characteristics that applications might be depending on -- perhaps even by 
accident. If this change is made, it might cause subtle issues in applications 
that would be quite difficult to diagnose.

Examples of changes I'm concerned about are:

pattern \ca currently matches '!' would now match \u0001
pattern \cÀ currently matches \u0080 would now throw exception
pattern \c0 currently matches 'p' would now throw exception

and so forth. That is, using \c with characters in the range [a-z] would now 
match different characters from before, and using \c with characters outside the 
set that correspond to C0 control characters would now throw an exception 
whereas before they would matching something that was predictable, if in some 
sense incorrect.

There are some ways to mitigate the incompatibility, for example, by adding a 
system property, or by adding a Pattern flag that explicitly enables this 
behavior, though I'm not sure that either is worthwhile. Maybe there are less 
intrusive ways that I haven't thought of.

The current behavior seems to be have been established around 1999 (JDK 1.3?) so 
it's been around a long time, plenty of time for applications to have formed 
inadvertent dependencies on this behavior. An alternative would be simply to 
document the current behavior, even though it's arguably incorrect.

Is there some benefit to this change, for example, does it enable one to write 
an application that wasn't possible before because of this bug?

s'marks

On 8/29/19 4:39 PM, Ivan Gerasimov wrote:
> Hello!
> 
> In a regular expression pattern a sequence of the form \\cx is allowed to 
> specify a control character that corresponds to the name char x.
> 
> Current implementation has a few issues with that:
> 1)  It allows x to be just any character, including non-printable ones;
> 2)  The produced regexp may correspond to a non-control characters;
> 3)  The expression is case-sensitive, so, for example \\cA differs from \\ca, 
> while they both have to produce ctrl-A.
> 
> It is proposed to make parsing more strict and reject invalid values of x, and 
> also clarify the documentation to explicitly list valid values of x.
> 
> If we agree on this proposal, then a CSR will probably need to be filed to 
> capture the changes in the regexp parsing.
> 
> Would you please help review the fix?
> 
> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8230365
> WEBREV: http://cr.openjdk.java.net/~igerasim/8230365/00/webrev/
>