<i18n dev> Suggested Perl-related updates for Pattern doc

Tom Christiansen tchrist at perl.com
Sat Apr 23 13:04:33 PDT 2011


Sherman, 

The comparison to Perl 5 in the Java Pattern class documentation needs
to be corrected.  However, I would not recommend as long a laundry list
of missing features from either side as the following email might imply.
I'm just trying to be complete, but in doing so, it produces a list that
I think is too unruly for inclusion.  Part of that, however, may be
because I have included a lot of auxiliarly information and examples to
show you what I mean.  Those of course don't need to go in the javadoc.

My minimal suggested change would be to bring it alignment with the
current production release of Perl instead of one from the 
previous millennium -- and in some cases, from much older still. 
Whether you choose 5.12 or 5.14, you should clearlyi state *which*
version of Perl you're comparing yourself with: it is the lack
of reference version number that caused this to become so false.

Sherman, you do a much better than I do in patching javadoc in a way
consistent in tone and texture, so I am comfortable leaving this 
to your discretion.

I hope this helps.  If there's anything more I can do to help,
please do not hesitate to ask.  Thank you for all your work; 
I am quite enthusiastic about all of this.

--tom

> Comparison to Perl 5 

This was applicable to 2000's Perl 5.6 release, and also to a
much older version of the Java Pattern class.  Both have advanced
beyond what the comparison claims.

> The Pattern engine performs traditional NFA-based matching with
> ordered alternation as occurs in Perl 5.

Although I agree that Perl and Java use the same sort of matcher, 
I'm not sure it is accurate to call it a traditional NFA matcher.  
Both are recursive backtracking matchers, necessitated by the 
backref support.  The difference between these two algorithms 
is well explained in Russ Cox's paper on

    "Regular Expression Matching Can Be Simple And Fast 
     (but is slow in Java, Perl, PHP, Python, Ruby, ...)"

    http://swtch.com/~rsc/regexp/regexp1.html

The Cox paper shows how pathological patterns cause a recursive
backtracking algorithm to degrade exponentially with respect to
input length, and how that does not occur under a traditional
NFA.  It is easy to demonstrate this issue from the command line:

    $ time perl -le 'print(("a" x 19) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
    2.803u 0.000s 0:02.80 
    $ time perl -le 'print(("a" x 20) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
    4.077u 0.002s 0:04.08
    $ time perl -le 'print(("a" x 21) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
    6.039u 0.003s 0:06.04 
    $ time perl -le 'print(("a" x 22) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
    8.756u 0.000s 0:08.76 

In contrast, if you swap in Cox's RE2 library (this is a CPAN module) in
place of Perl's default regex engine, that all disappears:

    $ time perl -Mre::engine::RE2 -le 'print(("a" x 19)   =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
    0.001u 0.003s 0:00.00 
    $ time perl -Mre::engine::RE2 -le 'print(("a" x 50)   =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
    0.002u 0.000s 0:00.00
    $ time perl -Mre::engine::RE2 -le 'print(("a" x 500)  =~ /a*a*a*a*a*a*a*a*a*a*[Bb]/ || 0)' > /dev/null
    0.001u 0.002s 0:00.00
    $ time perl -Mre::engine::RE2 -le 'print(("a" x 5000) =~ /a*a*a*a*a*a*a*a*a*a*[Bb]i || 0)' > /dev/null
    0.001u 0.000s 0:00.00

That's because Cox is using a traditional NFA, but Perl (by default) 
and Java (always) are both using a recursive backtracker variant
of the same.  Read Cox; he explains it more clearly than I have.

> Perl constructs not supported by this class:
>      The conditional constructs (?{X}) and (?(condition)X|Y),
>      The embedded code constructs (?{code}) and (??{code}),
>      The embedded comment syntax (?#comment), and
>      The preprocessing operations \l, \u, \L, and \U.

Well, yes, but those are string-interpolation things: they 
don't happen in the regex compiler; likewise \Q.  If you
pass a string with \Q or \U in it to the regex compiler
but not through the double-quote interpolation, such as 
if you read it from a file, then those do not happen.

Here are other things that are missing.  Perl release
numbers follow the convention that odd numbers are 
developer releases and even numbers are production releases.
I shall therefore only mention even-numbered releases.

 == Since the Perl 5.6 release of 2000, Perl also supports
    these constructs not supported by the Java Pattern class:

      *  Unicode grapheme clusters via the \X.
      *  Unicode named characters (the Name property) using
	 the \N{NAME} escape via the charnames pragma.
	 This includes those from NameAliases.txt.
      *  ALL Unicode properties supported by whatever version 
	 of the UCD is current at the time of release, not just
	 those from UnicodeData.txt;  see 
	 http://unicode.org/reports/tr44/#Property_Index for
	 the current list, or the perluniprops manpage on 
	 perl 5.12 or better.
      *  Loose matching of property names and values, including
         the full names plus all those defined by The Unicode
         Standard as valid aliases/shortcuts for the same;
	 see also PropertyAliases.txt and PropValueAliases.txt.
      *  User-defined \p{PROP} properties: you get to make
	 up your own property names and definitions for use
	 in regexes.  This tailoring is quite useful.
      *  Full Unicode casefolding (multichar folds), 
	 not just simple casefolding where all folds are
	 to a single code point alone.

 == Since the Perl 5.8 release of 2002, Perl also supports
    these constructs not supported by the Java Pattern class:

      *  Custom user-defined named characters va \N{NAME}.

 == Since the Perl 5.10 release of 2007, Perl also supports
    these constructs not supported by the Java Pattern class:

     *  Horizontal Unicode whitespace via \h and \H.
     *  Vertical   Unicode whitespace via \v and \V.
     *  Any Unicode linebreak sequence via \R.
     *  The \K "keep this" escape to not include anything
	to its left in what gets matched; works like a 
	variable-width lookbehind, which is otherwise
	disallowed.
     *  The \g{GROUP} notation for backrefs, including
	normal \g{1}, relative \g{-1}, and named \g{NAME}.
	This allows you to avoid octal ambiguity and makes
	for more robustly embeddable patterns.
     *  The branch-reset operator, (?| (.)(.) | (.)(.) | (.)(.) ),
	which causes group numbering to restart at each | branch.
     *  Multiple named groups by the same name: 
	    (?<NAME>...)  ...  (?<NAME>...)  
	After the match, both those are accessible.
     *  Recursive patterns through buffer recursion. 
	For example, to match for nested parens:
	    \((?:[^()]*+|(?0))*\)
	Yes, Perl patterns are now equivalent to recursive-
	descent parsing, a quantum leap forward.  See also
	the DEFINE block two items below.
     *  Backtracking control verbs like (*SKIP) and (*MARK)
     *  Definition-only groups via (?(DEFINE)...) for later
	execution via (?&NAME), like a regex subroutine:
	   (?x)
	   (?<NAME>(?&NAME_PAT))
	   (?<ADDR>(?&ADDRESS_PAT))
	   (?(DEFINE)
	      (?<NAME_PAT>....)
	      (?<ADRESS_PAT>....)
	This lets you separate declaration from execution,
	reuse named abstractions, etc.  It is extremely
	powerful and extremely useful.

    Note that is was this release in which Perl gained:

     *  Named groups via (?<NAME>...) and \k<NAME>.
     *  Possessive matches via ++, *+, etc.

 == Since the Perl 5.12 release of 2010, Perl also supports
    these constructs not supported by the Java Pattern class:

     *  The new \N escape to always mean [^\n], even under
	(?s) matching.  This is without braces; with braces
	it is of course a Unicode named character or sequence.
     *  The \X escape, supported since 5.6, has tracked 
	the Unicode standard and therefore with this release 
	now matches an extended grapheme cluster per UAX#29.

 == Since the Perl 5.14 release of 2011, Perl also supports
    these constructs not supported by the Java Pattern class:

     *  The new-to-Unicode-6.0 "named sequences" via \N{NAME}.
	See NamedSequences.txt.
     *  The \o{...} octal escape to guarantee that you not only never
        have any \1-style ambiguities with backref \g{10} vs octal
        \o{10}, but also so you can abut an octally specified code point
        number against other unrelated digits without mistakenly
        incorporating them into the octoal.

BTW, here are which Perl release tracked which Unicode release:

	    Perl    Unicode
	  version   version

	  5.6       3.0.0
	  5.8       3.2.0     
	  5.8.1     4.0.0     
	  5.8.9     5.1.0     
	  5.12      5.2.0     
	  5.14      6.0.0     

    (I've obviously omitted lots of intermediate releases)

> Constructs supported by this class but not by Perl:

>      Possessive quantifiers, which greedily match as much as they can
>      and do not back off, even when doing so would allow the overall
>      match to succeed.

Perl has been able to do that for some years now.

>      Character-class union and intersection as described above.

This is kinda true and kinda not; in the core regex library, we implement
this not by using the Unicode syntax, but rather with either lookaheads or
user-defined character properties.  To get the full Unicode syntax requires
the Unicode::Regex::Set module, which is not part of the core regex engine.
Speaking of which, Perl has quite a few modules that implement various
portions of The Unicode Standard, especially the annexes:

    Unicode::Casing       - Perl extension to override system case changing functions
    Unicode::Collate      - Unicode Collation Algorithm
    Unicode::Collate::Locale - Linguistic tailoring for DUCET via Unicode::Collate
    Unicode::GCString     - String as Sequence of UAX #29 Grapheme Clusters
    Unicode::LineBreak    - UAX #14 Unicode Line Breaking Algorithm
    Unicode::Normalize    - Unicode Normalization Forms
    Unicode::Regex::Set   - Subtraction and Intersection of Character Sets in Unicode Regular Expressions
    Unicode::Stringprep   - Preparation of Internationalized Strings (RFC 3454)
    Unicode::UCD          - Unicode character database
    Unicode::Unihan       - The Unihan Data Base 

Many of those I use daily.  Some of these could arguably be incorporated
into the core regex engine.  But as even today there are still issues
involving canonical matching, it's perhaps good that they are decoupled.

> Notable differences from Perl:

>      In Perl, \1 through \9 are always interpreted as back references; a
>      backslash-escaped number greater than 9 is treated as a back
>      reference if at least that many subexpressions exist, otherwise it is
>      interpreted, if possible, as an octal escape. In this class octal
>      escapes must always begin with a zero. In this class, \1 through \9
>      are always interpreted as back references, and a larger number is
>      accepted as a back reference if at least that many subexpressions
>      exist at that point in the regular expression, otherwise the parser
>      will drop digits until the number is smaller or equal to the existing
>      number of groups or it is one digit.

This is still true for reasons of backwards compatibility, but new code
should always use constructs like \g{10} for the numbered group and
\o{10} for the octal code point number to remove all doubt.

>      Perl uses the g flag to request a match that resumes where the last
>      match left off. This functionality is provided implicitly by the
>      Matcher class: Repeated invocations of the find method will resume
>      where the last match left off, unless the matcher is reset.

>      In Perl, embedded flags at the top level of an expression affect the
>      whole expression. In this class, embedded flags always take effect at
>      the point at which they appear, whether they are at the top level or
>      within a group; in the latter case, flags are restored at the end of
>      the group just as in Perl.

>      Perl is forgiving about malformed matching constructs, as in the
>      expression *a, as well as dangling brackets, as in the expression
>      abc], and treats them as literals. This class also accepts dangling
>      brackets but is strict about dangling metacharacters like +, ? and *,
>      and will throw a PatternSyntaxException if it encounters them.

While there are indeed regex languages that work that way, 
Perl is thankfully not one of them:

    $ perl -le 'print if /*a/'
    Quantifier follows nothing in regex; marked by <-- HERE in m/* <-- HERE a/ at -e line 1.

    % perl -le 'print if /?/'
    Quantifier follows nothing in regex; marked by <-- HERE in m/? <-- HERE / at -e line 1.

    % perl -le 'print if /+/'
    Quantifier follows nothing in regex; marked by <-- HERE in m/+ <-- HERE / at -e line 1.

    $ perl -le 'print if /[abc/'
    Unmatched [ in regex; marked by <-- HERE in m/[ <-- HERE abc/ at -e line 1.

The only release that I can find where something like *a was
ever accepted by Perl is 1987's initial Perl 1.0 release:

    $ perl1 -e 'print "match\n" if "*a" =~ /*a/;'
    match

Which is going on being a quarter-century out of date!

I don't believe there has been a release of Perl since Java 
has even existed that accepted such things.  Please don't
cite things from more than 20 years ago. :(


More information about the i18n-dev mailing list