<i18n dev> j.u.r.Pattern documentation errors

Tom Christiansen tchrist at perl.com
Sun Jan 23 14:14:12 PST 2011


In this message I cover only those errors made in the final
section ("Comparison to Perl 5") of:

    http://download.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

I really hope no one is offended by this.  I don't mean to be
a nitpicker.  Technical errors in the documentation should be
very very easy to correct, since no code change is required.

========================================

>     Comparison to Perl 5

>     The Pattern engine performs traditional NFA-based matching
>     with ordered alternation as occurs in Perl 5.

>     Perl constructs not supported by this class:

>     * The conditional constructs (?{X}) and (?(condition)X|Y),

That should instead read:

    The conditional constructs (?(condition)X) and (?(condition)X|Y),

>     * The embedded code constructs (?{code}) and (??{code}),

>     * The embedded comment syntax (?#comment), and

>     * The preprocessing operations \l \u, \L, and \U.

That is no longer true, as Java supports those now.

There is quite a bit missing from the list of Perl constructs 
unsupported by this class.

    * Perl regex escapes: \x{...}, \R, \h, \H, \v, \V,
	\X, \N, \N{...}, \K, and recently \o{...}. 
	[NB: My rewrite library covers the top row.]

    * Relative buffers like \g{-2} for $-2, or the \g{NAME}
      alias for a named backref \k<NAME>.

    * The branch-reset operator: (?|...)

    * Buffer recursion (?0) (?1) (?&NAME) etc to allow
      recursive regexes; e.g. \((?:[^()]*+|(?0))*\) matches
      nested parens.

    * Non-executing definition-only blocks via (?(DEFINE)...)
      to allow the separation of execution from declaration.
      See post-sig example.

    * Backtracking control verbs like (*MARK:NAME), (*FAIL), (*SKIP)

========================================

>     Constructs supported by this class but not by Perl:

>     * Possessive quantifiers, which greedily match as much as
>       they can and do not back off, even when doing so would
>       allow the overall match to succeed.

This is not true.  Perl understands the same possessive
quantifiers that Java does.

>     * Character-class union and intersection as described above.

True.  In Perl you have to use lookahead assertions to effect
the same end.

>     Notable differences from Perl:

I would certainly put these two in the very front of this section:

    * Perl's charclass shortcuts all work **VERY DIFFERENTLY** from 
      Java's, including \w \W \s \S \d \D \b \B.  [NOTE: my rewrite
      library fixes this.]

    * Perl supports all official Unicode properties, and follows
      all strong recommendations in tr18, whereas Java does neither.

>     * In Perl, \1 through \9 are always interpreted as back
>       references; a backslash-escaped number greater than 9 is
>       treated as a back reference if at least that many
>       subexpressions exist, otherwise it is interpreted, if
>       possible, as an octal escape. In this class octal escapes
>       must always begin with a zero. In this class, \1 through \9
>       are always interpreted as back references, and a larger
>       number is accepted as a back reference if at least that
>       many subexpressions exist at that point in the regular
>       expression, otherwise the parser will drop digits until the
>       number is smaller or equal to the existing number of groups
>       or it is one digit.

I think it more important to state that Perl does not require a 0, 
and so \377 is an octal 0xFF.  BTW, the new \o{...} is unambiguously
an octal escape just as \g{...} is unambiguously a backref group.

>     * Perl uses the g flag to request a match that resumes where
>       the last match left off. This functionality is provided
>       implicitly by the Matcher class: Repeated invocations of
>       the find method will resume where the last match left off,
>       unless the matcher is reset.

I wish there were mention that the Matcher.matches() method adds
implicit boundaries, while Perl does not.

Russ Cox's strategy for RE (re)names that method matches_exactly(),
to better express what it does and clear up confusion.

>     * In Perl, embedded flags at the top level of an expression
>       affect the whole expression. In this class, embedded flags
>       always take effect at the point at which they appear,
>       whether they are at the top level or within a group; in the
>       latter case, flags are restored at the end of the group
>       just as in Perl.

>     * Perl is forgiving about malformed matching constructs, as
>       in the expression *a, as well as dangling brackets, as in
>       the expression abc], and treats them as literals. This
>       class also accepts dangling brackets but is strict about
>       dangling metacharacters like +, ? and *, and will throw a
>       PatternSyntaxException if it encounters them.

This is incorrect; Perl is not forgiving about malformed matching
constructs like the one cited above:

    % perl -e '/*a/'
    Quantifier follows nothing in regex; marked by <-- HERE in m/* <-- HERE a/ at -e line 1.

Perl also supports user-defined character name aliases for
\N{...} and user-defined character properties for \p{...} and
\P{...}, but Java supports neither.  Java doesn't even support
character names at all that I can see, and Java definitely
doesn't support the full complement of character properties as
defined by the Unicode Character Database; Perl does.

I believe that Java does not supported named character sequences,
which are part of what it takes to support Unicode 6.0 as they
are new to that release.

There may be more than this, but it's what came immediately
to mind.

Hope this helps!!

--tom

PS: Here's an example of using (?(DEFINE)...) to completely parse
    an RFC 5322 email address, including nested comments. Notice
    how much like a BNF grammar this now becomes. It's a Perl 5
    thing that we backported from Perl 6: very clean, even beautiful.

    $rfc5322 = qr{
       (?(DEFINE)
	 (?<address>         (?&mailbox) | (?&group))
	 (?<mailbox>         (?&name_addr) | (?&addr_spec))
	 (?<name_addr>       (?&display_name)? (?&angle_addr))
	 (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
	 (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
	 (?<display_name>    (?&phrase))
	 (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

	 (?<addr_spec>       (?&local_part) \@ (?&domain))
	 (?<local_part>      (?&dot_atom) | (?&quoted_string))
	 (?<domain>          (?&dot_atom) | (?&domain_literal))
	 (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
				       \] (?&CFWS)?)
	 (?<dcontent>        (?&dtext) | (?&quoted_pair))
	 (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

	 (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
	 (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
	 (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
	 (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

	 (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
	 (?<quoted_pair>     \\ (?&text))

	 (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
	 (?<qcontent>        (?&qtext) | (?&quoted_pair))
	 (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
			      (?&FWS)? (?&DQUOTE) (?&CFWS)?)

	 (?<word>            (?&atom) | (?&quoted_string))
	 (?<phrase>          (?&word)+)

	 # Folding white space
	 (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
	 (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
	 (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
	 (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
	 (?<CFWS>            (?: (?&FWS)? (?&comment))*
			     (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

	 # No whitespace control
	 (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

	 (?<ALPHA>           [A-Za-z])
	 (?<DIGIT>           [0-9])
	 (?<CRLF>            \x0d \x0a)
	 (?<DQUOTE>          ")
	 (?<WSP>             [\x20\x09])
       )

       (?&address)

    }x;


More information about the i18n-dev mailing list