Regex named-group and backreference syntax
Xueming Shen
Xueming.Shen at Sun.COM
Wed Sep 2 07:15:13 UTC 2009
Hi Alan,
It would be an "ambiguity" (and then confused) only if we had the \k<n>
and $<n> as the legally
supported group reference syntax:-) That said I have to admit that it
does not have any value-add
to allow the a group name begins with a digit character. So if we have a
consensus I would be
happy to change the spec/implementation to dis-allow the digit letter
started group name.
I kinda disagree that the "rest of the named-group syntax" is copied
from .Net. Actually it is
the syntax from Perl 5.10.0/named capture buffer, in which the naming
syntax is (?<NAME>....)
and to backreference it with the \k<NAME>. I did not find a "reference
of named capture buffer in replacement" from there. I did consider to use
the .Net syntax, but finally decided to go with $<name> because it is more
consistent with the (?<name>...) and \k<name> syntax.
To allow \k<n> and $<n> is a fine idea, it at least looks less "complicated"
in replacement case.
Sherman
Alan Moore wrote:
> Looking at the new named-capture feature, two things jump out at me.
> The first is that the rules governing group names make "0", "1", "2",
> etc. valid names. That's bound to cause confusion, as programmers use
> \k<1> in the regex, or $<1> in the replacement string, meaning them as
> ordinal backreferences. It will be even worse if they actually have a
> group named "1", which may or may not be the first (numbered) group.
>
> Does this ambiguity add any value to offset the potential confusion?
> Because it seems to me we could add even more value by disallowing
> names that start with digits. We could still allow \k<1> and $<1> and
> such as backreferences, but they would be aliases for \1 and $1
> respectively. The advantage is that a backreference in one of those
> forms could be followed by another digit and there would be no danger
> of forming a different capture-group reference.
>
> For example, $10 could mean group(1) followed by zero, or group(10) if
> the regex has that many groups. If it's group(1) you want, you can
> escape the zero with a backslash to make that clear. But what if you
> really mean group(10) but there's no such group? You won't be
> notified of your error, because the Matcher assumes you meant group(1)
> plus "0". But with \k<1> and $<1> there's no ambiguity and no need to
> escape anything.
>
> My other concern is the syntax of backreferences in the replacement
> string: $<name>. Surveying the other major players (i.e.,
> named-capture-enabled regex flavors associated with popular
> programming languages), ${name} seems to be the most common
> syntax--though there aren't a whole lot of data points yet, I admit.
> Most significantly, .NET does it that way, and we're copying them on
> the rest of the named-group syntax already, so why not on this? Also,
> I don't know of any other flavor that uses the $<name> syntax.
>
> To summarize, I want to:
>
> - change the replacement-string backreference syntax from $<name> to ${name}
>
> - disallow group names starting with digits
>
> - allow backreferences of the form \k<n> and ${n} where 'n' is one or
> more digits, but interpret them as ordinal instead of named references
> (and throw an exception if there's no such group).
>
> Thoughts?
>
More information about the core-libs-dev
mailing list