RFR 8230365 : Pattern for a control-char matches non-control characters

Martin Buchholz martinrb at google.com
Thu Sep 5 04:00:44 UTC 2019


Thanks, Ivan.  We're mostly in agreement.

+     * If {@code true} then lower-case control-character ids are mapped to the
+     * their upper-case counterparts.

Extra "the".

After all these decades I only now realize that c ^= 0x40 moves '?' to the
end of the ASCII range and all the other controls to the start!

Should we support lower-case controls?  Compatibility with perl regex still
matters, but a lot less than in 2003.  But the key is that we got the WRONG
ANSWER previously, so when we restrict the control ids let's just make
lower case controls syntax errors.  Silently changing behavior is bad
for users.  ... so let's abandon ALLOW_LOWERCASE_CONTROL_CHAR_IDS.

An alternative:
int ch = read() ^ 0x40;
if (!RESTRICTED_CONTROL_CHAR_IDS || ch < 0x20 || ch == 0x7f) return ch;



On Wed, Sep 4, 2019 at 6:49 PM Ivan Gerasimov <ivan.gerasimov at oracle.com>
wrote:

> Thank you Martin!
>
> On 8/30/19 6:19 PM, Martin Buchholz wrote:
> > There's a strong expectation that ctrl-A and ctrl-a both map to
> > \u0001, so I support Ivan's initiative.
> > I'm surprised java regex gets this wrong.
> > Might need a transitional system property.
> >
> Right.  I think it would be best to introduce two system properties:
>
> The first, to turn on/off the restrictions on the control-char names.
> This will be enabled by default, and will permit names from the limited
> list: capital letters and a few other special characters.
>
> The second one, to enable mapping of lower-case control-char names to
> their upper-case counterpart.  This option should be turned off by
> default for the current release of JDK, and then turned on by default
> for some subsequent release (when, presumably, most applications that
> use this kind of regexp are fixed).
>
> This all feels like a little bit too much for such a rarely used
> feature, but probably is a proper thing to do anyway :-)
>
> If we have an agreement on these system properties, I can create a
> separate test to verify all possible combinations.
>
>
> > What's the best bit-twiddle?  Untested:
> > if ((c ^= 0x40) < 0x20) return c;
> > if ((c ^=0x20)  <= 26 && c > 0) return c;
> >
> > 0x40 is more readable than 64.
> >
> `((ch-0x3f)|(0x5f-ch)) >= 0` does the trick for regular (non-lower-case)
> ids.
>
> > Does ctrol-? get mapped to 0x7f ?
> >
> Yes. I've got it in the test at the end of the line 4997.
>
> Would you please help review the updated webrev:
>
> http://cr.openjdk.java.net/~igerasim/8230365/02/webrev/
>
> With kind regards,
>
> Ivan
>
>
>


More information about the core-libs-dev mailing list