<i18n dev> RL1.6 Line Boundaries

Tom Christiansen tchrist at perl.com
Sun Jan 23 12:13:42 PST 2011


Java meets this requirement, but only just barely.

    RL1.6 Line Boundaries

        To meet this requirement, if an implementation provides for
        line-boundary testing, it shall recognize not only CRLF, LF, 
	CR, but also NEL (U+0085), PS (U+2029) and LS (U+2028).

The reason I say "barely" is because immediately below that, tr18 reads:

    Formfeed (U+000C) also normally indicates an end-of-line. For
    more information, see Chapter 3 of [Unicode].

    [...]

    A newline sequence is defined to be any of the following:

	\u000A | \u000B | \u000C | \u000D | \u0085 | \u2028 | \u2029 | \u000D\u000A

The code in j.u.r.Pattern does *not* take "\f" U+0C FORM FEED
or "\v" U+0B VERTICAL TAB into account.  Both those are included
in the newline sequence definition given immediately above.

Below that is this strong recommendation, which Java also neglects:

    It is strongly recommended that there be a regular expression
    meta-character, such as "\R", for matching all line ending
    characters and sequences listed above (e.g. in #1). It would
    thus be shorthand for:

	( \u000D\u000A | [\u000A\u000B\u000C\u000D\u0085\u2028\u2029] )

Perl has supported \R for some years now, and I have implemented the
strongly recommended \R metacharacter in my Java regex rewriting library
using that definition.  This is much cleaner than having to deal with the
entire UNIX_LINES thing, and probably why they strongly recommended it.

--tom


More information about the i18n-dev mailing list