<i18n dev> Suggested Perl-related updates for Pattern doc
Tom Christiansen
tchrist at perl.com
Sat Apr 23 13:04:33 PDT 2011
Sherman,
The comparison to Perl 5 in the Java Pattern class documentation needs
to be corrected. However, I would not recommend as long a laundry list
of missing features from either side as the following email might imply.
I'm just trying to be complete, but in doing so, it produces a list that
I think is too unruly for inclusion. Part of that, however, may be
because I have included a lot of auxiliarly information and examples to
show you what I mean. Those of course don't need to go in the javadoc.
My minimal suggested change would be to bring it alignment with the
current production release of Perl instead of one from the
previous millennium -- and in some cases, from much older still.
Whether you choose 5.12 or 5.14, you should clearlyi state *which*
version of Perl you're comparing yourself with: it is the lack
of reference version number that caused this to become so false.
Sherman, you do a much better than I do in patching javadoc in a way
consistent in tone and texture, so I am comfortable leaving this
to your discretion.
I hope this helps. If there's anything more I can do to help,
please do not hesitate to ask. Thank you for all your work;
I am quite enthusiastic about all of this.
--tom
> Comparison to Perl 5
This was applicable to 2000's Perl 5.6 release, and also to a
much older version of the Java Pattern class. Both have advanced
beyond what the comparison claims.
> The Pattern engine performs traditional NFA-based matching with
> ordered alternation as occurs in Perl 5.
Although I agree that Perl and Java use the same sort of matcher,
I'm not sure it is accurate to call it a traditional NFA matcher.
Both are recursive backtracking matchers, necessitated by the
backref support. The difference between these two algorithms
is well explained in Russ Cox's paper on
"Regular Expression Matching Can Be Simple And Fast
(but is slow in Java, Perl, PHP, Python, Ruby, ...)"
http://swtch.com/~rsc/regexp/regexp1.html
The Cox paper shows how pathological patterns cause a recursive
backtracking algorithm to degrade exponentially with respect to
input length, and how that does not occur under a traditional
NFA. It is easy to demonstrate this issue from the command line:
$ time perl -le 'print(("a" x 19) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
2.803u 0.000s 0:02.80
$ time perl -le 'print(("a" x 20) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
4.077u 0.002s 0:04.08
$ time perl -le 'print(("a" x 21) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
6.039u 0.003s 0:06.04
$ time perl -le 'print(("a" x 22) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
8.756u 0.000s 0:08.76
In contrast, if you swap in Cox's RE2 library (this is a CPAN module) in
place of Perl's default regex engine, that all disappears:
$ time perl -Mre::engine::RE2 -le 'print(("a" x 19) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
0.001u 0.003s 0:00.00
$ time perl -Mre::engine::RE2 -le 'print(("a" x 50) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
0.002u 0.000s 0:00.00
$ time perl -Mre::engine::RE2 -le 'print(("a" x 500) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
0.001u 0.002s 0:00.00
$ time perl -Mre::engine::RE2 -le 'print(("a" x 5000) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]i || 0)' > /dev/null
0.001u 0.000s 0:00.00
That's because Cox is using a traditional NFA, but Perl (by default)
and Java (always) are both using a recursive backtracker variant
of the same. Read Cox; he explains it more clearly than I have.
> Perl constructs not supported by this class:
> The conditional constructs (?{X}) and (?(condition)X|Y),
> The embedded code constructs (?{code}) and (??{code}),
> The embedded comment syntax (?#comment), and
> The preprocessing operations \l, \u, \L, and \U.
Well, yes, but those are string-interpolation things: they
don't happen in the regex compiler; likewise \Q. If you
pass a string with \Q or \U in it to the regex compiler
but not through the double-quote interpolation, such as
if you read it from a file, then those do not happen.
Here are other things that are missing. Perl release
numbers follow the convention that odd numbers are
developer releases and even numbers are production releases.
I shall therefore only mention even-numbered releases.
== Since the Perl 5.6 release of 2000, Perl also supports
these constructs not supported by the Java Pattern class:
* Unicode grapheme clusters via the \X.
* Unicode named characters (the Name property) using
the \N{NAME} escape via the charnames pragma.
This includes those from NameAliases.txt.
* ALL Unicode properties supported by whatever version
of the UCD is current at the time of release, not just
those from UnicodeData.txt; see
http://unicode.org/reports/tr44/#Property_Index for
the current list, or the perluniprops manpage on
perl 5.12 or better.
* Loose matching of property names and values, including
the full names plus all those defined by The Unicode
Standard as valid aliases/shortcuts for the same;
see also PropertyAliases.txt and PropValueAliases.txt.
* User-defined \p{PROP} properties: you get to make
up your own property names and definitions for use
in regexes. This tailoring is quite useful.
* Full Unicode casefolding (multichar folds),
not just simple casefolding where all folds are
to a single code point alone.
== Since the Perl 5.8 release of 2002, Perl also supports
these constructs not supported by the Java Pattern class:
* Custom user-defined named characters va \N{NAME}.
== Since the Perl 5.10 release of 2007, Perl also supports
these constructs not supported by the Java Pattern class:
* Horizontal Unicode whitespace via \h and \H.
* Vertical Unicode whitespace via \v and \V.
* Any Unicode linebreak sequence via \R.
* The \K "keep this" escape to not include anything
to its left in what gets matched; works like a
variable-width lookbehind, which is otherwise
disallowed.
* The \g{GROUP} notation for backrefs, including
normal \g{1}, relative \g{-1}, and named \g{NAME}.
This allows you to avoid octal ambiguity and makes
for more robustly embeddable patterns.
* The branch-reset operator, (?| (.)(.) | (.)(.) | (.)(.) ),
which causes group numbering to restart at each | branch.
* Multiple named groups by the same name:
(?<NAME>...) ... (?<NAME>...)
After the match, both those are accessible.
* Recursive patterns through buffer recursion.
For example, to match for nested parens:
\((?:[^()]*+|(?0))*\)
Yes, Perl patterns are now equivalent to recursive-
descent parsing, a quantum leap forward. See also
the DEFINE block two items below.
* Backtracking control verbs like (*SKIP) and (*MARK)
* Definition-only groups via (?(DEFINE)...) for later
execution via (?&NAME), like a regex subroutine:
(?x)
(?<NAME>(?&NAME_PAT))
(?<ADDR>(?&ADDRESS_PAT))
(?(DEFINE)
(?<NAME_PAT>....)
(?<ADRESS_PAT>....)
This lets you separate declaration from execution,
reuse named abstractions, etc. It is extremely
powerful and extremely useful.
Note that is was this release in which Perl gained:
* Named groups via (?<NAME>...) and \k<NAME>.
* Possessive matches via ++, *+, etc.
== Since the Perl 5.12 release of 2010, Perl also supports
these constructs not supported by the Java Pattern class:
* The new \N escape to always mean [^\n], even under
(?s) matching. This is without braces; with braces
it is of course a Unicode named character or sequence.
* The \X escape, supported since 5.6, has tracked
the Unicode standard and therefore with this release
now matches an extended grapheme cluster per UAX#29.
== Since the Perl 5.14 release of 2011, Perl also supports
these constructs not supported by the Java Pattern class:
* The new-to-Unicode-6.0 "named sequences" via \N{NAME}.
See NamedSequences.txt.
* The \o{...} octal escape to guarantee that you not only never
have any \1-style ambiguities with backref \g{10} vs octal
\o{10}, but also so you can abut an octally specified code point
number against other unrelated digits without mistakenly
incorporating them into the octoal.
BTW, here are which Perl release tracked which Unicode release:
Perl Unicode
version version
5.6 3.0.0
5.8 3.2.0
5.8.1 4.0.0
5.8.9 5.1.0
5.12 5.2.0
5.14 6.0.0
(I've obviously omitted lots of intermediate releases)
> Constructs supported by this class but not by Perl:
> Possessive quantifiers, which greedily match as much as they can
> and do not back off, even when doing so would allow the overall
> match to succeed.
Perl has been able to do that for some years now.
> Character-class union and intersection as described above.
This is kinda true and kinda not; in the core regex library, we implement
this not by using the Unicode syntax, but rather with either lookaheads or
user-defined character properties. To get the full Unicode syntax requires
the Unicode::Regex::Set module, which is not part of the core regex engine.
Speaking of which, Perl has quite a few modules that implement various
portions of The Unicode Standard, especially the annexes:
Unicode::Casing - Perl extension to override system case changing functions
Unicode::Collate - Unicode Collation Algorithm
Unicode::Collate::Locale - Linguistic tailoring for DUCET via Unicode::Collate
Unicode::GCString - String as Sequence of UAX #29 Grapheme Clusters
Unicode::LineBreak - UAX #14 Unicode Line Breaking Algorithm
Unicode::Normalize - Unicode Normalization Forms
Unicode::Regex::Set - Subtraction and Intersection of Character Sets in Unicode Regular Expressions
Unicode::Stringprep - Preparation of Internationalized Strings (RFC 3454)
Unicode::UCD - Unicode character database
Unicode::Unihan - The Unihan Data Base
Many of those I use daily. Some of these could arguably be incorporated
into the core regex engine. But as even today there are still issues
involving canonical matching, it's perhaps good that they are decoupled.
> Notable differences from Perl:
> In Perl, \1 through \9 are always interpreted as back references; a
> backslash-escaped number greater than 9 is treated as a back
> reference if at least that many subexpressions exist, otherwise it is
> interpreted, if possible, as an octal escape. In this class octal
> escapes must always begin with a zero. In this class, \1 through \9
> are always interpreted as back references, and a larger number is
> accepted as a back reference if at least that many subexpressions
> exist at that point in the regular expression, otherwise the parser
> will drop digits until the number is smaller or equal to the existing
> number of groups or it is one digit.
This is still true for reasons of backwards compatibility, but new code
should always use constructs like \g{10} for the numbered group and
\o{10} for the octal code point number to remove all doubt.
> Perl uses the g flag to request a match that resumes where the last
> match left off. This functionality is provided implicitly by the
> Matcher class: Repeated invocations of the find method will resume
> where the last match left off, unless the matcher is reset.
> In Perl, embedded flags at the top level of an expression affect the
> whole expression. In this class, embedded flags always take effect at
> the point at which they appear, whether they are at the top level or
> within a group; in the latter case, flags are restored at the end of
> the group just as in Perl.
> Perl is forgiving about malformed matching constructs, as in the
> expression *a, as well as dangling brackets, as in the expression
> abc], and treats them as literals. This class also accepts dangling
> brackets but is strict about dangling metacharacters like +, ? and *,
> and will throw a PatternSyntaxException if it encounters them.
While there are indeed regex languages that work that way,
Perl is thankfully not one of them:
$ perl -le 'print if /*a/'
Quantifier follows nothing in regex; marked by <-- HERE in m/* <-- HERE a/ at -e line 1.
% perl -le 'print if /?/'
Quantifier follows nothing in regex; marked by <-- HERE in m/? <-- HERE / at -e line 1.
% perl -le 'print if /+/'
Quantifier follows nothing in regex; marked by <-- HERE in m/+ <-- HERE / at -e line 1.
$ perl -le 'print if /[abc/'
Unmatched [ in regex; marked by <-- HERE in m/[ <-- HERE abc/ at -e line 1.
The only release that I can find where something like *a was
ever accepted by Perl is 1987's initial Perl 1.0 release:
$ perl1 -e 'print "match\n" if "*a" =~ /*a/;'
match
Which is going on being a quarter-century out of date!
I don't believe there has been a release of Perl since Java
has even existed that accepted such things. Please don't
cite things from more than 20 years ago. :(
More information about the i18n-dev
mailing list