<i18n dev> RL1.6 Line Boundaries
Tom Christiansen
tchrist at perl.com
Sun Jan 23 12:13:42 PST 2011
Java meets this requirement, but only just barely.
RL1.6 Line Boundaries
To meet this requirement, if an implementation provides for
line-boundary testing, it shall recognize not only CRLF, LF,
CR, but also NEL (U+0085), PS (U+2029) and LS (U+2028).
The reason I say "barely" is because immediately below that, tr18 reads:
Formfeed (U+000C) also normally indicates an end-of-line. For
more information, see Chapter 3 of [Unicode].
[...]
A newline sequence is defined to be any of the following:
\u000A | \u000B | \u000C | \u000D | \u0085 | \u2028 | \u2029 | \u000D\u000A
The code in j.u.r.Pattern does *not* take "\f" U+0C FORM FEED
or "\v" U+0B VERTICAL TAB into account. Both those are included
in the newline sequence definition given immediately above.
Below that is this strong recommendation, which Java also neglects:
It is strongly recommended that there be a regular expression
meta-character, such as "\R", for matching all line ending
characters and sequences listed above (e.g. in #1). It would
thus be shorthand for:
( \u000D\u000A | [\u000A\u000B\u000C\u000D\u0085\u2028\u2029] )
Perl has supported \R for some years now, and I have implemented the
strongly recommended \R metacharacter in my Java regex rewriting library
using that definition. This is much cleaner than having to deal with the
entire UNIX_LINES thing, and probably why they strongly recommended it.
--tom
More information about the i18n-dev
mailing list