RFR 8197462 : Inconsistent exception messages for invalid capturing group names
Hello! Capturing group name can be used in a regular expression in two contexts: When introducing a group (?<name>...) or when referring it \k<name>. If the name is invalid (i.e. does not start with a Latin letter, or contains wrong chars) then we may see different error messages, some of which look confusing. Here are examples of the messages produced by the current JDK: Unknown look-behind group near index 3 (?<>) ^ named capturing group is missing trailing '>' near index 4 \\k<> ^ Unknown look-behind group near index 4 (?<.>) ^ (named capturing group <.> does not exit near index 4 \\k<.> ^ named capturing group is missing trailing '>' near index 4 (?<a.>) ^ named capturing group is missing trailing '>' near index 4 \\k<a.> ^ In particular, this diversity is caused by that the internal Pattern.groupname() function lacks a check for the very first character of the name. So that when \k<name> is parsed, the first char is always accepted, no matter what it was. Some cleanup was also done along the way. Would you please help review the fix? BUGURL: https://bugs.openjdk.java.net/browse/JDK-8197462 WEBREV: http://cr.openjdk.java.net/~igerasim/8197462/00/webrev/ Thanks in advance! -- With kind regards, Ivan Gerasimov
Hi Ivan, The code handles group name was added later. So "historically" those cases trigger "unknow look-behind group" when the first character after "<" is not "=" or "?". With the addition of the group name support, it's actually hard to say which one is more accurate, incorrect group name or incorrect "looks behind". Sure with a tailing ">" it might be more desired to lean to group name. It's definitely a bug not to check whether or not the first char is alpha for \\k<. I'm fine with the proposed change. Thanks, Sherman On 2/8/18, 8:32 PM, Ivan Gerasimov wrote:
Hello!
Capturing group name can be used in a regular expression in two contexts: When introducing a group (?<name>...) or when referring it \k<name>. If the name is invalid (i.e. does not start with a Latin letter, or contains wrong chars) then we may see different error messages, some of which look confusing.
Here are examples of the messages produced by the current JDK: Unknown look-behind group near index 3 (?<>) ^ named capturing group is missing trailing '>' near index 4 \\k<> ^ Unknown look-behind group near index 4 (?<.>) ^ (named capturing group <.> does not exit near index 4 \\k<.> ^ named capturing group is missing trailing '>' near index 4 (?<a.>) ^ named capturing group is missing trailing '>' near index 4 \\k<a.> ^
In particular, this diversity is caused by that the internal Pattern.groupname() function lacks a check for the very first character of the name. So that when \k<name> is parsed, the first char is always accepted, no matter what it was.
Some cleanup was also done along the way.
Would you please help review the fix?
BUGURL: https://bugs.openjdk.java.net/browse/JDK-8197462 WEBREV: http://cr.openjdk.java.net/~igerasim/8197462/00/webrev/
Thanks in advance!
participants (2)
-
Ivan Gerasimov
-
Xueming Shen