RFR 8230365 : Pattern for a control-char matches non-control characters
Bernd Eckenfels
ecki at zusammenkunft.net
Thu Sep 5 05:54:58 UTC 2019
Hallo,
Since not all combinations make sense (Exception+convert) a multi value might be better:
jdk.regex.control=WARN|EXCEPTION|STANDARD|LEGACY
With Exception generating an error, Standard beeing the planned new default (treating upper/lower same and error on all undefined chars) and legacy beeing the manual fallback to current behavior and WARN the same fallback but with logging.
I guess some form of early feedback like EXCPETION or WARN is needed, even when it is between a rock and a hard place. Maybe have at least one iteration where it defaults to LEGACY (+Release Notes announcement), then WARN and then finally STANDARD?
Gruss
Bernd
--
http://bernd.eckenfels.net
________________________________
Von: core-libs-dev <core-libs-dev-bounces at openjdk.java.net> im Auftrag von Ivan Gerasimov <ivan.gerasimov at oracle.com>
Gesendet: Donnerstag, September 5, 2019 4:00 AM
An: Martin Buchholz; Stuart Marks
Cc: core-libs-dev
Betreff: Re: RFR 8230365 : Pattern for a control-char matches non-control characters
Thank you Martin!
On 8/30/19 6:19 PM, Martin Buchholz wrote:
> There's a strong expectation that ctrl-A and ctrl-a both map to
> \u0001, so I support Ivan's initiative.
> I'm surprised java regex gets this wrong.
> Might need a transitional system property.
>
Right. I think it would be best to introduce two system properties:
The first, to turn on/off the restrictions on the control-char names.
This will be enabled by default, and will permit names from the limited
list: capital letters and a few other special characters.
The second one, to enable mapping of lower-case control-char names to
their upper-case counterpart. This option should be turned off by
default for the current release of JDK, and then turned on by default
for some subsequent release (when, presumably, most applications that
use this kind of regexp are fixed).
This all feels like a little bit too much for such a rarely used
feature, but probably is a proper thing to do anyway :-)
If we have an agreement on these system properties, I can create a
separate test to verify all possible combinations.
> What's the best bit-twiddle? Untested:
> if ((c ^= 0x40) < 0x20) return c;
> if ((c ^=0x20) <= 26 && c > 0) return c;
>
> 0x40 is more readable than 64.
>
`((ch-0x3f)|(0x5f-ch)) >= 0` does the trick for regular (non-lower-case)
ids.
> Does ctrol-? get mapped to 0x7f ?
>
Yes. I've got it in the test at the end of the line 4997.
Would you please help review the updated webrev:
http://cr.openjdk.java.net/~igerasim/8230365/02/webrev/
With kind regards,
Ivan
More information about the core-libs-dev
mailing list