RFR 8230365 : Pattern for a control-char matches non-control characters

Martin Buchholz martinrb at google.com
Sat Aug 31 01:19:25 UTC 2019


There's a strong expectation that ctrl-A and ctrl-a both map to \u0001, so
I support Ivan's initiative.
I'm surprised java regex gets this wrong.
Might need a transitional system property.

What's the best bit-twiddle?  Untested:
if ((c ^= 0x40) < 0x20) return c;
if ((c ^=0x20)  <= 26 && c > 0) return c;

0x40 is more readable than 64.

Does ctrol-? get mapped to 0x7f ?


On Thu, Aug 29, 2019 at 8:15 PM Ivan Gerasimov <ivan.gerasimov at oracle.com>
wrote:

> Hi Stuart!
>
> Thank you for your comments on the proposal!
> I totally agree with you that the fixes that result in changing the
> behavior have to be carefully planned and well thought.
> Please see my answers inline.
>
> On 8/29/19 6:15 PM, Stuart Marks wrote:
> > Hi Ivan,
> >
> > This change certainly makes regex patterns more rigorous, but I'm
> > concerned about the compatibility. This is a spec change and also a
> > behavior change. While the current behavior might not strictly be
> > correct, it does have some characteristics that applications might be
> > depending on -- perhaps even by accident. If this change is made, it
> > might cause subtle issues in applications that would be quite
> > difficult to diagnose.
> >
> There are two types of changes in the proposal:
> First, make \cx construct more restrictive w.r.t. possible values of x.
> Second, make it case-insensitive, so \cz will mean the same as \cZ.
>
> I agree that the later type of changes can potentially cause hard to
> diagnose failures in the existing applications.
> The reason for this later change was an attempt to make Java regexp more
> Perl-compatible (and Perl does exactly this:  Treats \cx with a
> lower-case x as \cX with its upper-case counterpart).
>
> The former type of change, on the other hand, might actually be useful
> for the existing applications, as it may allow to see (otherwise hard to
> find) bugs in the code.
>
> For example, if some existing application has a regexp, which contains
> "\\c\t", this is most likely a programming error (or a nasty hack to
> code a char 'I'), so it would be beneficial to report it via throwing a
> PatternSyntaxException.
>
>
> > Examples of changes I'm concerned about are:
> >
> > pattern \ca currently matches '!' would now match \u0001
> > pattern \cÀ currently matches \u0080 would now throw exception
> > pattern \c0 currently matches 'p' would now throw exception
> >
> > and so forth. That is, using \c with characters in the range [a-z]
> > would now match different characters from before, and using \c with
> > characters outside the set that correspond to C0 control characters
> > would now throw an exception whereas before they would matching
> > something that was predictable, if in some sense incorrect.
> >
> > There are some ways to mitigate the incompatibility, for example, by
> > adding a system property, or by adding a Pattern flag that explicitly
> > enables this behavior, though I'm not sure that either is worthwhile.
> > Maybe there are less intrusive ways that I haven't thought of.
> >
> I think it would make sense to first separate the changes mentioned
> above, so it will be easier to reason about them.
>
> While I think that making lower-case \\cx a synonym for the upper-case
> \\cX is a good thing to have (mainly for additional compatibility with
> Perl-style regexps), it's supposed be deferred to a later time.
>
> >
> > The current behavior seems to be have been established around 1999
> > (JDK 1.3?) so it's been around a long time, plenty of time for
> > applications to have formed inadvertent dependencies on this behavior.
> > An alternative would be simply to document the current behavior, even
> > though it's arguably incorrect.
> >
> > Is there some benefit to this change, for example, does it enable one
> > to write an application that wasn't possible before because of this bug?
> >
> In my opinion, the benefits are:
> 1) Fail-early approach, which will let the developers catch the bugs
> earlier,
> 2) Better compatibility with Perl-style regexps, so that a wider class
> of regular expressions can be used across different platforms,
> 3) Well defined strict rules will allow IDEs to implement additional
> edit-time checks of regexp, which will again help developers.
>
> Please find the updated webrev here:
> http://cr.openjdk.java.net/~igerasim/8230365/01/webrev/
>
> With kind regards,
> Ivan
>
>
> > s'marks
> >
> >
> > On 8/29/19 4:39 PM, Ivan Gerasimov wrote:
> >> Hello!
> >>
> >> In a regular expression pattern a sequence of the form \\cx is
> >> allowed to specify a control character that corresponds to the name
> >> char x.
> >>
> >> Current implementation has a few issues with that:
> >> 1)  It allows x to be just any character, including non-printable ones;
> >> 2)  The produced regexp may correspond to a non-control characters;
> >> 3)  The expression is case-sensitive, so, for example \\cA differs
> >> from \\ca, while they both have to produce ctrl-A.
> >>
> >> It is proposed to make parsing more strict and reject invalid values
> >> of x, and also clarify the documentation to explicitly list valid
> >> values of x.
> >>
> >> If we agree on this proposal, then a CSR will probably need to be
> >> filed to capture the changes in the regexp parsing.
> >>
> >> Would you please help review the fix?
> >>
> >> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8230365
> >> WEBREV: http://cr.openjdk.java.net/~igerasim/8230365/00/webrev/
> >>
> >
> --
> With kind regards,
> Ivan Gerasimov
>
>


More information about the core-libs-dev mailing list