Raw String Literal Library Support
Stuart Marks
stuart.marks at oracle.com
Wed Mar 14 23:55:01 UTC 2018
Hi Jim,
Some comments (really, mainly just quibbles) about string trimming. First,
* String.trim trims characters <= \u0020 from each end of a string. I agree that
String.trim should be preserved unchanged for compatibility purposes.
* The trimLeft, trimRight, and trimWhitespace (which trims both ends) methods
make sense. These three should all use the same definition of whitespace.
* My issue concerns what definition of whitespace they use.
What you outlined in the quoted section below doesn't line up with the
definitions in the API spec.
The existing methods Character.isSpaceChar(codepoint) and
Character.isWhitespace(codepoint) are well-defined but somewhat different
notions of whitespace.
**
The Character.isSpaceChar method returns true if the code point is a member of
any of these categories:
SPACE_SEPARATOR
LINE_SEPARATOR
PARAGRAPH_SEPARATOR
In JDK 10, which conforms to Unicode 8.0.0, the SPACE_SEPARATOR category
includes the following characters:
U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+1680 OGHAM SPACE MARK
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE
The LINE_SEPARATOR category contains this one character:
U+2028 LINE SEPARATOR
And the PARAGRAPH_SEPARATOR category contains just this one character:
U+2029 PARAGRAPH SEPARATOR
**
Meanwhile, the Character.isWhitespace method returns true if the code point is
in one of these categories:
SPACE_SEPARATOR, excluding
U+00A0 NO-BREAK SPACE
U+2007 FIGURE SPACE
U+202F NARROW NO-BREAK SPACE
LINE_SEPARATOR
PARAGRAPH_SEPARATOR
or if it is one of these characters:
U+0009 HORIZONTAL TABULATION.
U+000A LINE FEED.
U+000B VERTICAL TABULATION.
U+000C FORM FEED.
U+000D CARRIAGE RETURN.
U+001C FILE SEPARATOR.
U+001D GROUP SEPARATOR.
U+001E RECORD SEPARATOR.
U+001F UNIT SEPARATOR.
**
You mentioned several different definitions of whitespace:
- trim's whitespace (TWS): chars <= U+0020
- Character's whitespace (CWS): I'm not sure what you meant by this
- union whitespace (UWS): union of TWS and CWS
I don't think we should be creating a new definition of whitespace, such as UWS,
if at all possible. TWS is strange in that it contains a bunch of control
characters that aren't necessarily whitespace, and it omits Unicode whitespace.
Character.isSpaceChar includes various no-break spaces, which I don't think
should be trimmed away, and it also omits various ASCII white space characters,
which I think most programmers would find surprising.
Finally, Character.isWhitespace includes the ASCII whitespace characters and
Unicode space separators, but excludes no-break spaces. This makes the most
sense to me. So, how about we define trimLeft, trimRight, and trimWhitespace all
in terms of Character.isWhitespace?
s'marks
On 3/13/18 6:47 AM, Jim Laskey wrote:
> B. Additions to basic trim methods. In addition to margin methods trimIndent and trimMarkers described below in Section C, it would be worth introducing trimLeft and trimRight to augment the longstanding trim method. A key question is how trimLeft and trimRight should detect whitespace, because different definitions of whitespace exist in the library.
>
> trim itself uses the simple test less than or equal to the space character, a fast test but not Unicode friendly.
>
> Character.isWhitespace(codepoint) returns true if codepoint one of the following;
>
> SPACE_SEPARATOR.
> LINE_SEPARATOR.
> PARAGRAPH_SEPARATOR.
> '\t', U+0009 HORIZONTAL TABULATION.
> '\n', U+000A LINE FEED.
> '\u000B', U+000B VERTICAL TABULATION.
> '\f', U+000C FORM FEED.
> '\r', U+000D CARRIAGE RETURN.
> '\u001C', U+001C FILE SEPARATOR.
> '\u001D', U+001D GROUP SEPARATOR.
> '\u001E', U+001E RECORD SEPARATOR.
> '\u001F', U+001F UNIT SEPARATOR.
> ' ', U+0020 SPACE.
> (Note: that non-breaking space (\u00A0) is excluded)
>
> Character.isSpaceChar(codepoint) returns true if codepoint one of the following;
>
> SPACE_SEPARATOR.
> LINE_SEPARATOR.
> PARAGRAPH_SEPARATOR.
> ' ', U+0020 SPACE.
> '\u00A0', U+00A0 NON-BREAKING SPACE.
> That sets up several kinds of whitespace; trim's whitespace (TWS), Character whitespace (CWS) and the union of the two (UWS). TWS is a fast test. CWS is a slow test. UWS is fast for Latin1 and slow-ish for UTF-16.
>
> We are recommending that trimLeft and trimRight use UWS, leave trim alone to avoid breaking the world and then possibly introduce trimWhitespace that uses UWS.
>
> public String trim()
> Removes characters less than equal to space from the beginning and end of the string. No, change except spec clarification and links to the new trim methods.
> Examples:
> "".trim(); // ""
> " ".trim(); // ""
> " abc ".trim(); // "abc"
> " \u2028abc ".trim(); // "\u2028abc"
> public String trimWhitespace()
> Removes whitespace from the beginning and end of the string.
> Examples:
>
> "".trimWhitespace(); // ""
> " ".trimWhitespace(); // ""
> " abc ".trimWhitespace(); // "abc"
> " \u2028abc ".trimWhitespace(); // "abc"
> public String trimLeft()
> Removes whitespace from the beginning of the string.
> Examples:
>
> "".trimLeft(); // ""
> " ".trimLeft(); // ""
> " abc ".trimLeft(); // "abc "
> public String trimRight()
> Removes whitespace from the end of the string.
> Examples:
>
> "".trimRight(); // ""
> " ".trimRight(); // ""
> " abc ".trimRight(); // " abc"
More information about the core-libs-dev
mailing list