Regex named-group and backreference syntax

Alan Moore uncle.alice at gmail.com
Wed Sep 2 05:39:13 UTC 2009


Looking at the new named-capture feature, two things jump out at me.
The first is that the rules governing group names make "0", "1", "2",
etc. valid names.  That's bound to cause confusion, as programmers use
\k<1> in the regex, or $<1> in the replacement string, meaning them as
ordinal backreferences.  It will be even worse if they actually have a
group named "1", which may or may not be the first (numbered) group.

Does this ambiguity add any value to offset the potential confusion?
Because it seems to me we could add even more value by disallowing
names that start with digits.  We could still allow \k<1> and $<1> and
such as backreferences, but they would be aliases for \1 and $1
respectively.  The advantage is that a backreference in one of those
forms could be followed by another digit and there would be no danger
of forming a different capture-group reference.

For example, $10 could mean group(1) followed by zero, or group(10) if
the regex has that many groups.  If it's group(1) you want, you can
escape the zero with a backslash to make that clear.  But what if you
really mean group(10) but there's no such group?  You won't be
notified of your error, because the Matcher assumes you meant group(1)
plus "0".  But with \k<1> and $<1> there's no ambiguity and no need to
escape anything.

My other concern is the syntax of backreferences in the replacement
string: $<name>.  Surveying the other major players (i.e.,
named-capture-enabled regex flavors associated with popular
programming languages), ${name} seems to be the most common
syntax--though there aren't a whole lot of data points yet, I admit.
Most significantly, .NET does it that way, and we're copying them on
the rest of the named-group syntax already, so why not on this?  Also,
I don't know of any other flavor that uses the $<name> syntax.

To summarize, I want to:

- change the replacement-string backreference syntax from $<name> to ${name}

- disallow group names starting with digits

- allow backreferences of the form \k<n> and ${n} where 'n' is one or
more digits, but interpret them as ordinal instead of named references
(and throw an exception if there's no such group).

Thoughts?



More information about the core-libs-dev mailing list