<i18n dev> j.u.r.Pattern documentation errors

Xueming Shen xueming.shen at oracle.com
Sun Jan 23 23:22:38 PST 2011


Thanks Tom.

That part of doc definitely need re-visit, it was written before 2002 
(probably is
against Perl 5.6) and have not been touched since, lots are no longer 
true given
the latest 5.12.

-Sherman

On 1-23-2011 14:14 02:14 PM, Tom Christiansen wrote:
> In this message I cover only those errors made in the final
> section ("Comparison to Perl 5") of:
>
>      http://download.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
>
> I really hope no one is offended by this.  I don't mean to be
> a nitpicker.  Technical errors in the documentation should be
> very very easy to correct, since no code change is required.
>
> ========================================
>
>>      Comparison to Perl 5
>>      The Pattern engine performs traditional NFA-based matching
>>      with ordered alternation as occurs in Perl 5.
>>      Perl constructs not supported by this class:
>>      * The conditional constructs (?{X}) and (?(condition)X|Y),
> That should instead read:
>
>      The conditional constructs (?(condition)X) and (?(condition)X|Y),
>
>>      * The embedded code constructs (?{code}) and (??{code}),
>>      * The embedded comment syntax (?#comment), and
>>      * The preprocessing operations \l \u, \L, and \U.
> That is no longer true, as Java supports those now.
>
> There is quite a bit missing from the list of Perl constructs
> unsupported by this class.
>
>      * Perl regex escapes: \x{...}, \R, \h, \H, \v, \V,
> 	\X, \N, \N{...}, \K, and recently \o{...}.
> 	[NB: My rewrite library covers the top row.]
>
>      * Relative buffers like \g{-2} for $-2, or the \g{NAME}
>        alias for a named backref \k<NAME>.
>
>      * The branch-reset operator: (?|...)
>
>      * Buffer recursion (?0) (?1) (?&NAME) etc to allow
>        recursive regexes; e.g. \((?:[^()]*+|(?0))*\) matches
>        nested parens.
>
>      * Non-executing definition-only blocks via (?(DEFINE)...)
>        to allow the separation of execution from declaration.
>        See post-sig example.
>
>      * Backtracking control verbs like (*MARK:NAME), (*FAIL), (*SKIP)
>
> ========================================
>
>>      Constructs supported by this class but not by Perl:
>>      * Possessive quantifiers, which greedily match as much as
>>        they can and do not back off, even when doing so would
>>        allow the overall match to succeed.
> This is not true.  Perl understands the same possessive
> quantifiers that Java does.
>
>>      * Character-class union and intersection as described above.
> True.  In Perl you have to use lookahead assertions to effect
> the same end.
>
>>      Notable differences from Perl:
> I would certainly put these two in the very front of this section:
>
>      * Perl's charclass shortcuts all work **VERY DIFFERENTLY** from
>        Java's, including \w \W \s \S \d \D \b \B.  [NOTE: my rewrite
>        library fixes this.]
>
>      * Perl supports all official Unicode properties, and follows
>        all strong recommendations in tr18, whereas Java does neither.
>
>>      * In Perl, \1 through \9 are always interpreted as back
>>        references; a backslash-escaped number greater than 9 is
>>        treated as a back reference if at least that many
>>        subexpressions exist, otherwise it is interpreted, if
>>        possible, as an octal escape. In this class octal escapes
>>        must always begin with a zero. In this class, \1 through \9
>>        are always interpreted as back references, and a larger
>>        number is accepted as a back reference if at least that
>>        many subexpressions exist at that point in the regular
>>        expression, otherwise the parser will drop digits until the
>>        number is smaller or equal to the existing number of groups
>>        or it is one digit.
> I think it more important to state that Perl does not require a 0,
> and so \377 is an octal 0xFF.  BTW, the new \o{...} is unambiguously
> an octal escape just as \g{...} is unambiguously a backref group.
>
>>      * Perl uses the g flag to request a match that resumes where
>>        the last match left off. This functionality is provided
>>        implicitly by the Matcher class: Repeated invocations of
>>        the find method will resume where the last match left off,
>>        unless the matcher is reset.
> I wish there were mention that the Matcher.matches() method adds
> implicit boundaries, while Perl does not.
>
> Russ Cox's strategy for RE (re)names that method matches_exactly(),
> to better express what it does and clear up confusion.
>
>>      * In Perl, embedded flags at the top level of an expression
>>        affect the whole expression. In this class, embedded flags
>>        always take effect at the point at which they appear,
>>        whether they are at the top level or within a group; in the
>>        latter case, flags are restored at the end of the group
>>        just as in Perl.
>>      * Perl is forgiving about malformed matching constructs, as
>>        in the expression *a, as well as dangling brackets, as in
>>        the expression abc], and treats them as literals. This
>>        class also accepts dangling brackets but is strict about
>>        dangling metacharacters like +, ? and *, and will throw a
>>        PatternSyntaxException if it encounters them.
> This is incorrect; Perl is not forgiving about malformed matching
> constructs like the one cited above:
>
>      % perl -e '/*a/'
>      Quantifier follows nothing in regex; marked by<-- HERE in m/*<-- HERE a/ at -e line 1.
>
> Perl also supports user-defined character name aliases for
> \N{...} and user-defined character properties for \p{...} and
> \P{...}, but Java supports neither.  Java doesn't even support
> character names at all that I can see, and Java definitely
> doesn't support the full complement of character properties as
> defined by the Unicode Character Database; Perl does.
>
> I believe that Java does not supported named character sequences,
> which are part of what it takes to support Unicode 6.0 as they
> are new to that release.
>
> There may be more than this, but it's what came immediately
> to mind.
>
> Hope this helps!!
>
> --tom
>
> PS: Here's an example of using (?(DEFINE)...) to completely parse
>      an RFC 5322 email address, including nested comments. Notice
>      how much like a BNF grammar this now becomes. It's a Perl 5
>      thing that we backported from Perl 6: very clean, even beautiful.
>
>      $rfc5322 = qr{
>         (?(DEFINE)
> 	 (?<address>          (?&mailbox) | (?&group))
> 	 (?<mailbox>          (?&name_addr) | (?&addr_spec))
> 	 (?<name_addr>        (?&display_name)? (?&angle_addr))
> 	 (?<angle_addr>       (?&CFWS)?<  (?&addr_spec)>  (?&CFWS)?)
> 	 (?<group>            (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
> 	 (?<display_name>     (?&phrase))
> 	 (?<mailbox_list>     (?&mailbox) (?: , (?&mailbox))*)
>
> 	 (?<addr_spec>        (?&local_part) \@ (?&domain))
> 	 (?<local_part>       (?&dot_atom) | (?&quoted_string))
> 	 (?<domain>           (?&dot_atom) | (?&domain_literal))
> 	 (?<domain_literal>   (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
> 				       \] (?&CFWS)?)
> 	 (?<dcontent>         (?&dtext) | (?&quoted_pair))
> 	 (?<dtext>            (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])
>
> 	 (?<atext>            (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
> 	 (?<atom>             (?&CFWS)? (?&atext)+ (?&CFWS)?)
> 	 (?<dot_atom>         (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
> 	 (?<dot_atom_text>    (?&atext)+ (?: \. (?&atext)+)*)
>
> 	 (?<text>             [\x01-\x09\x0b\x0c\x0e-\x7f])
> 	 (?<quoted_pair>      \\ (?&text))
>
> 	 (?<qtext>            (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
> 	 (?<qcontent>         (?&qtext) | (?&quoted_pair))
> 	 (?<quoted_string>    (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
> 			      (?&FWS)? (?&DQUOTE) (?&CFWS)?)
>
> 	 (?<word>             (?&atom) | (?&quoted_string))
> 	 (?<phrase>           (?&word)+)
>
> 	 # Folding white space
> 	 (?<FWS>              (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
> 	 (?<ctext>            (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
> 	 (?<ccontent>         (?&ctext) | (?&quoted_pair) | (?&comment))
> 	 (?<comment>          \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
> 	 (?<CFWS>             (?: (?&FWS)? (?&comment))*
> 			     (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))
>
> 	 # No whitespace control
> 	 (?<NO_WS_CTL>        [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])
>
> 	 (?<ALPHA>            [A-Za-z])
> 	 (?<DIGIT>            [0-9])
> 	 (?<CRLF>             \x0d \x0a)
> 	 (?<DQUOTE>           ")
> 	 (?<WSP>              [\x20\x09])
>         )
>
>         (?&address)
>
>      }x;



More information about the i18n-dev mailing list