Raw String Literal Library Support

Wed Mar 14 23:55:01 UTC 2018

Hi Jim,

Some comments (really, mainly just quibbles) about string trimming. First,

* String.trim trims characters <= \u0020 from each end of a string. I agree that 
String.trim should be preserved unchanged for compatibility purposes.

* The trimLeft, trimRight, and trimWhitespace (which trims both ends) methods 
make sense. These three should all use the same definition of whitespace.

* My issue concerns what definition of whitespace they use.

What you outlined in the quoted section below doesn't line up with the 
definitions in the API spec.

The existing methods Character.isSpaceChar(codepoint) and 
Character.isWhitespace(codepoint) are well-defined but somewhat different 
notions of whitespace.

**

The Character.isSpaceChar method returns true if the code point is a member of 
any of these categories:

     SPACE_SEPARATOR
     LINE_SEPARATOR
     PARAGRAPH_SEPARATOR

In JDK 10, which conforms to Unicode 8.0.0, the SPACE_SEPARATOR category 
includes the following characters:

     U+0020 SPACE
     U+00A0 NO-BREAK SPACE
     U+1680 OGHAM SPACE MARK
     U+2000 EN QUAD
     U+2001 EM QUAD
     U+2002 EN SPACE
     U+2003 EM SPACE
     U+2004 THREE-PER-EM SPACE
     U+2005 FOUR-PER-EM SPACE
     U+2006 SIX-PER-EM SPACE
     U+2007 FIGURE SPACE
     U+2008 PUNCTUATION SPACE
     U+2009 THIN SPACE
     U+200A HAIR SPACE
     U+202F NARROW NO-BREAK SPACE
     U+205F MEDIUM MATHEMATICAL SPACE
     U+3000 IDEOGRAPHIC SPACE

The LINE_SEPARATOR category contains this one character:

     U+2028 LINE SEPARATOR

And the PARAGRAPH_SEPARATOR category contains just this one character:

     U+2029 PARAGRAPH SEPARATOR

**

Meanwhile, the Character.isWhitespace method returns true if the code point is 
in one of these categories:

     SPACE_SEPARATOR, excluding
         U+00A0 NO-BREAK SPACE
         U+2007 FIGURE SPACE
         U+202F NARROW NO-BREAK SPACE
     LINE_SEPARATOR
     PARAGRAPH_SEPARATOR

or if it is one of these characters:

     U+0009 HORIZONTAL TABULATION.
     U+000A LINE FEED.
     U+000B VERTICAL TABULATION.
     U+000C FORM FEED.
     U+000D CARRIAGE RETURN.
     U+001C FILE SEPARATOR.
     U+001D GROUP SEPARATOR.
     U+001E RECORD SEPARATOR.
     U+001F UNIT SEPARATOR.

**

You mentioned several different definitions of whitespace:

  - trim's whitespace (TWS): chars <= U+0020
  - Character's whitespace (CWS): I'm not sure what you meant by this
  - union whitespace (UWS): union of TWS and CWS

I don't think we should be creating a new definition of whitespace, such as UWS, 
if at all possible. TWS is strange in that it contains a bunch of control 
characters that aren't necessarily whitespace, and it omits Unicode whitespace. 
Character.isSpaceChar includes various no-break spaces, which I don't think 
should be trimmed away, and it also omits various ASCII white space characters, 
which I think most programmers would find surprising.

Finally, Character.isWhitespace includes the ASCII whitespace characters and 
Unicode space separators, but excludes no-break spaces. This makes the most 
sense to me. So, how about we define trimLeft, trimRight, and trimWhitespace all 
in terms of Character.isWhitespace?

s'marks

On 3/13/18 6:47 AM, Jim Laskey wrote:
> B. Additions to basic trim methods. In addition to margin methods trimIndent and trimMarkers described below in Section C, it would be worth introducing trimLeft and trimRight to augment the longstanding trim method. A key question is how trimLeft and trimRight should detect whitespace, because different definitions of whitespace exist in the library.
> 
> trim itself uses the simple test less than or equal to the space character, a fast test but not Unicode friendly.
> 
> Character.isWhitespace(codepoint) returns true if codepoint one of the following;
> 
>     SPACE_SEPARATOR.
>     LINE_SEPARATOR.
>     PARAGRAPH_SEPARATOR.
>     '\t',     U+0009 HORIZONTAL TABULATION.
>     '\n',     U+000A LINE FEED.
>     '\u000B', U+000B VERTICAL TABULATION.
>     '\f',     U+000C FORM FEED.
>     '\r',     U+000D CARRIAGE RETURN.
>     '\u001C', U+001C FILE SEPARATOR.
>     '\u001D', U+001D GROUP SEPARATOR.
>     '\u001E', U+001E RECORD SEPARATOR.
>     '\u001F', U+001F UNIT SEPARATOR.
>     ' ',      U+0020 SPACE.
> (Note: that non-breaking space (\u00A0) is excluded)
> 
> Character.isSpaceChar(codepoint) returns true if codepoint one of the following;
> 
>     SPACE_SEPARATOR.
>     LINE_SEPARATOR.
>     PARAGRAPH_SEPARATOR.
>     ' ',      U+0020 SPACE.
>     '\u00A0', U+00A0 NON-BREAKING SPACE.
> That sets up several kinds of whitespace; trim's whitespace (TWS), Character whitespace (CWS) and the union of the two (UWS). TWS is a fast test. CWS is a slow test. UWS is fast for Latin1 and slow-ish for UTF-16.
> 
> We are recommending that trimLeft and trimRight use UWS, leave trim alone to avoid breaking the world and then possibly introduce trimWhitespace that uses UWS.
> 
> public String trim()
> Removes characters less than equal to space from the beginning and end of the string. No, change except spec clarification and links to the new trim methods.
>      Examples:
>          "".trim();              // ""
>          "   ".trim();           // ""
>          "  abc  ".trim();       // "abc"
>          "  \u2028abc  ".trim(); // "\u2028abc"
> public String trimWhitespace()
> Removes whitespace from the beginning and end of the string.
>       Examples:
> 
>          "".trimWhitespace();              // ""
>          "   ".trimWhitespace();           // ""
>          "  abc  ".trimWhitespace();       // "abc"
>          "  \u2028abc  ".trimWhitespace(); // "abc"
> public String trimLeft()
> Removes whitespace from the beginning of the string.
>       Examples:
> 
>          "".trimLeft();        // ""
>          "   ".trimLeft();     // ""
>          "  abc  ".trimLeft(); // "abc  "
> public String trimRight()
> Removes whitespace from the end of the string.
>       Examples:
> 
>          "".trimRight();        // ""
>          "   ".trimRight();     // ""
>          "  abc  ".trimRight(); // "  abc"