<i18n dev> j.u.r.Pattern documentation errors
Xueming Shen
xueming.shen at oracle.com
Sun Jan 23 23:22:38 PST 2011
Thanks Tom.
That part of doc definitely need re-visit, it was written before 2002
(probably is
against Perl 5.6) and have not been touched since, lots are no longer
true given
the latest 5.12.
-Sherman
On 1-23-2011 14:14 02:14 PM, Tom Christiansen wrote:
> In this message I cover only those errors made in the final
> section ("Comparison to Perl 5") of:
>
> http://download.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
>
> I really hope no one is offended by this. I don't mean to be
> a nitpicker. Technical errors in the documentation should be
> very very easy to correct, since no code change is required.
>
> ========================================
>
>> Comparison to Perl 5
>> The Pattern engine performs traditional NFA-based matching
>> with ordered alternation as occurs in Perl 5.
>> Perl constructs not supported by this class:
>> * The conditional constructs (?{X}) and (?(condition)X|Y),
> That should instead read:
>
> The conditional constructs (?(condition)X) and (?(condition)X|Y),
>
>> * The embedded code constructs (?{code}) and (??{code}),
>> * The embedded comment syntax (?#comment), and
>> * The preprocessing operations \l \u, \L, and \U.
> That is no longer true, as Java supports those now.
>
> There is quite a bit missing from the list of Perl constructs
> unsupported by this class.
>
> * Perl regex escapes: \x{...}, \R, \h, \H, \v, \V,
> \X, \N, \N{...}, \K, and recently \o{...}.
> [NB: My rewrite library covers the top row.]
>
> * Relative buffers like \g{-2} for $-2, or the \g{NAME}
> alias for a named backref \k<NAME>.
>
> * The branch-reset operator: (?|...)
>
> * Buffer recursion (?0) (?1) (?&NAME) etc to allow
> recursive regexes; e.g. \((?:[^()]*+|(?0))*\) matches
> nested parens.
>
> * Non-executing definition-only blocks via (?(DEFINE)...)
> to allow the separation of execution from declaration.
> See post-sig example.
>
> * Backtracking control verbs like (*MARK:NAME), (*FAIL), (*SKIP)
>
> ========================================
>
>> Constructs supported by this class but not by Perl:
>> * Possessive quantifiers, which greedily match as much as
>> they can and do not back off, even when doing so would
>> allow the overall match to succeed.
> This is not true. Perl understands the same possessive
> quantifiers that Java does.
>
>> * Character-class union and intersection as described above.
> True. In Perl you have to use lookahead assertions to effect
> the same end.
>
>> Notable differences from Perl:
> I would certainly put these two in the very front of this section:
>
> * Perl's charclass shortcuts all work **VERY DIFFERENTLY** from
> Java's, including \w \W \s \S \d \D \b \B. [NOTE: my rewrite
> library fixes this.]
>
> * Perl supports all official Unicode properties, and follows
> all strong recommendations in tr18, whereas Java does neither.
>
>> * In Perl, \1 through \9 are always interpreted as back
>> references; a backslash-escaped number greater than 9 is
>> treated as a back reference if at least that many
>> subexpressions exist, otherwise it is interpreted, if
>> possible, as an octal escape. In this class octal escapes
>> must always begin with a zero. In this class, \1 through \9
>> are always interpreted as back references, and a larger
>> number is accepted as a back reference if at least that
>> many subexpressions exist at that point in the regular
>> expression, otherwise the parser will drop digits until the
>> number is smaller or equal to the existing number of groups
>> or it is one digit.
> I think it more important to state that Perl does not require a 0,
> and so \377 is an octal 0xFF. BTW, the new \o{...} is unambiguously
> an octal escape just as \g{...} is unambiguously a backref group.
>
>> * Perl uses the g flag to request a match that resumes where
>> the last match left off. This functionality is provided
>> implicitly by the Matcher class: Repeated invocations of
>> the find method will resume where the last match left off,
>> unless the matcher is reset.
> I wish there were mention that the Matcher.matches() method adds
> implicit boundaries, while Perl does not.
>
> Russ Cox's strategy for RE (re)names that method matches_exactly(),
> to better express what it does and clear up confusion.
>
>> * In Perl, embedded flags at the top level of an expression
>> affect the whole expression. In this class, embedded flags
>> always take effect at the point at which they appear,
>> whether they are at the top level or within a group; in the
>> latter case, flags are restored at the end of the group
>> just as in Perl.
>> * Perl is forgiving about malformed matching constructs, as
>> in the expression *a, as well as dangling brackets, as in
>> the expression abc], and treats them as literals. This
>> class also accepts dangling brackets but is strict about
>> dangling metacharacters like +, ? and *, and will throw a
>> PatternSyntaxException if it encounters them.
> This is incorrect; Perl is not forgiving about malformed matching
> constructs like the one cited above:
>
> % perl -e '/*a/'
> Quantifier follows nothing in regex; marked by<-- HERE in m/*<-- HERE a/ at -e line 1.
>
> Perl also supports user-defined character name aliases for
> \N{...} and user-defined character properties for \p{...} and
> \P{...}, but Java supports neither. Java doesn't even support
> character names at all that I can see, and Java definitely
> doesn't support the full complement of character properties as
> defined by the Unicode Character Database; Perl does.
>
> I believe that Java does not supported named character sequences,
> which are part of what it takes to support Unicode 6.0 as they
> are new to that release.
>
> There may be more than this, but it's what came immediately
> to mind.
>
> Hope this helps!!
>
> --tom
>
> PS: Here's an example of using (?(DEFINE)...) to completely parse
> an RFC 5322 email address, including nested comments. Notice
> how much like a BNF grammar this now becomes. It's a Perl 5
> thing that we backported from Perl 6: very clean, even beautiful.
>
> $rfc5322 = qr{
> (?(DEFINE)
> (?<address> (?&mailbox) | (?&group))
> (?<mailbox> (?&name_addr) | (?&addr_spec))
> (?<name_addr> (?&display_name)? (?&angle_addr))
> (?<angle_addr> (?&CFWS)?< (?&addr_spec)> (?&CFWS)?)
> (?<group> (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
> (?<display_name> (?&phrase))
> (?<mailbox_list> (?&mailbox) (?: , (?&mailbox))*)
>
> (?<addr_spec> (?&local_part) \@ (?&domain))
> (?<local_part> (?&dot_atom) | (?"ed_string))
> (?<domain> (?&dot_atom) | (?&domain_literal))
> (?<domain_literal> (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
> \] (?&CFWS)?)
> (?<dcontent> (?&dtext) | (?"ed_pair))
> (?<dtext> (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])
>
> (?<atext> (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
> (?<atom> (?&CFWS)? (?&atext)+ (?&CFWS)?)
> (?<dot_atom> (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
> (?<dot_atom_text> (?&atext)+ (?: \. (?&atext)+)*)
>
> (?<text> [\x01-\x09\x0b\x0c\x0e-\x7f])
> (?<quoted_pair> \\ (?&text))
>
> (?<qtext> (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
> (?<qcontent> (?&qtext) | (?"ed_pair))
> (?<quoted_string> (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
> (?&FWS)? (?&DQUOTE) (?&CFWS)?)
>
> (?<word> (?&atom) | (?"ed_string))
> (?<phrase> (?&word)+)
>
> # Folding white space
> (?<FWS> (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
> (?<ctext> (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
> (?<ccontent> (?&ctext) | (?"ed_pair) | (?&comment))
> (?<comment> \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
> (?<CFWS> (?: (?&FWS)? (?&comment))*
> (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))
>
> # No whitespace control
> (?<NO_WS_CTL> [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])
>
> (?<ALPHA> [A-Za-z])
> (?<DIGIT> [0-9])
> (?<CRLF> \x0d \x0a)
> (?<DQUOTE> ")
> (?<WSP> [\x20\x09])
> )
>
> (?&address)
>
> }x;
More information about the i18n-dev
mailing list