RFR 8230365 : Pattern for a control-char matches non-control characters

Ivan Gerasimov ivan.gerasimov at oracle.com
Thu Sep 5 01:49:41 UTC 2019


Thank you Martin!

On 8/30/19 6:19 PM, Martin Buchholz wrote:
> There's a strong expectation that ctrl-A and ctrl-a both map to 
> \u0001, so I support Ivan's initiative.
> I'm surprised java regex gets this wrong.
> Might need a transitional system property.
>
Right.  I think it would be best to introduce two system properties:

The first, to turn on/off the restrictions on the control-char names.  
This will be enabled by default, and will permit names from the limited 
list: capital letters and a few other special characters.

The second one, to enable mapping of lower-case control-char names to 
their upper-case counterpart.  This option should be turned off by 
default for the current release of JDK, and then turned on by default 
for some subsequent release (when, presumably, most applications that 
use this kind of regexp are fixed).

This all feels like a little bit too much for such a rarely used 
feature, but probably is a proper thing to do anyway :-)

If we have an agreement on these system properties, I can create a 
separate test to verify all possible combinations.


> What's the best bit-twiddle?  Untested:
> if ((c ^= 0x40) < 0x20) return c;
> if ((c ^=0x20)  <= 26 && c > 0) return c;
>
> 0x40 is more readable than 64.
>
`((ch-0x3f)|(0x5f-ch)) >= 0` does the trick for regular (non-lower-case) 
ids.

> Does ctrol-? get mapped to 0x7f ?
>
Yes. I've got it in the test at the end of the line 4997.

Would you please help review the updated webrev:

http://cr.openjdk.java.net/~igerasim/8230365/02/webrev/

With kind regards,

Ivan




More information about the core-libs-dev mailing list