RFR - JDK-8223775 String::stripIndent (Preview)

Tue May 21 23:19:13 UTC 2019

On 5/21/2019 2:10 PM, Jim Laskey wrote:
> Updated version http://cr.openjdk.java.net/~jlaskey/8223775/webrev-02

This webrev substantially updates the API spec, which is really a topic 
for amber-spec-experts (keep reading to see why). Cross-posting between 
-dev and -spec-experts lists is not good, so maybe we can wrap this up 
here without prolonged discussion.

API spec looks good, but it was a surprise to learn that stripIndent 
performs normalization of line terminators:

"@return string with margins removed and line terminators normalized"

The processing steps in the JEP (and the JLS) are clear that 
normalization happens before incidental white space removal. I realize 
that stripIndent performs separation and joining in such a way as to 
produce a string that looks like it was normalized prior to stripIndent, 
so the @return isn't wrong, but it's still confusing to make a big deal 
of normalization-first only for stripIndent to suggest normalization-last.

I think we should leave the JEP alone, since it interleaves behavior 
with motivation and examples in order to aid the reader, but we should 
align the JLS with the API:

-----
The string represented by a text block is not the literal sequence of 
characters in the content. Instead, the string represented by a text 
block is the result of applying the following transformations to the 
content, in order:

1. _Incidental white space_ is removed and line terminators are 
_normalized_, as if by execution of String::stripIndent on the 
characters in the content. [The emphasized terms are a hint to the API 
spec to define the term, which is not currently the case for the second 
term.]

2. Escape sequences are interpreted, as in a string literal.
-----

String::indent also says "normalizes line termination characters" 
without defining it. Separately, String::stripIndent is not at all like 
the strip, stripLeading, and stripTrailing methods which sound related 
to it -- they would pointlessly strip the first row of white space dots 
from a multi-line string and leave the other rows.

Taking all this together, I think it's time to upgrade the class-level 
spec of String: to advertise the new methods added in 11+, and to show 
text blocks, and to get some terms defined for the benefit of multiple 
methods. I realize this wasn't on your radar, but it's inevitable -- the 
same thing happened for the class-level spec of Class when nestmates 
were introduced. So, here goes:

-----
The String class represents character strings. ~~All~~ String literals 
**and text blocks** in Java programs ~~, such as "abc",~~ are 
implemented as instances of this class.

The strings represented this class are constant; their values cannot be 
changed after they are created. (For mutable strings, see StringBuffer 
and StringBuilder.) Because instances of `String` are immutable, they 
can be shared. For example: ...

[The example with a char[] is quite subtle for a beginner, but I'm 
skipping over it right now.]

The class String includes methods for examining individual characters of 
the sequence, for comparing strings, for searching strings, for 
extracting substrings, and for creating a copy of a string with all 
characters translated to uppercase or to lowercase. Case mapping is 
based on the Unicode Standard version specified by the Character class.

Here are some examples of how strings can be used:

          System.out.println("abc");
          String cde = "cde";
          String c = "abc".substring(2,3);
          String d = cde.substring(1, 2);

Unless otherwise noted, methods for comparing Strings do not take locale 
into account. The Collator class provides methods for finer-grain, 
locale-sensitive String comparison.

Unless otherwise noted, passing a null argument to a constructor or 
method in this class will cause a NullPointerException to be thrown. 
[This doesn't fit anywhere. j.l.Character doesn't bother with it, even 
though its methods throw NPEs too. Maybe time to drop. We have lots more 
important stuff to say.]

### String concatenation

The Java language provides special support for the string concatenation 
operator (`+`), and for conversion of other objects to strings. For 
additional information on string concatenation and conversion, see The 
Java™ Language Specification.  [Somehow this manages to skip the valueOf 
methods, which in conjunction with things like Integer::parseInt are 
worthy of a paragraph by themselves. Future work.]

Here are some examples of string concatenation:

      String cde = "cde";
      System.out.println("abc" + cde);

[These examples are dull, and don't describe their output, and need to 
show text blocks. Future work.]

### String processing

The strings represented by this class may span multiple lines by 
including _line terminators_ among their characters. A line terminator 
is one of the following: a line feed character LF (U+000A), a carriage 
return character CR (U+000D), or a carriage return followed immediately 
by a line feed CRLF (U+000D U+000A).  [Don't want to show escape 
sequences like \n yet.]

A string has _normalized_ line terminators if LF is the only line 
terminator which appears in the string. Many methods of this class 
_normalize_ the strings they return by ensuring that CR and CRLF are 
translated to LF.

The class String also includes methods for manipulating non-alphanumeric 
characters in strings, such as converting escape sequences into 
non-graphic characters, and stripping white space.  [This paragraph is a 
jumping off point for describing stripIndent's special relationship with 
text blocks.]

### Unicode

A String represents a string in the UTF-16 format in which supplementary 
characters are represented by surrogate pairs (see the section Unicode 
Character Representations in the Character class for more information). 
Index values refer to char code units, so a supplementary character uses 
two positions in a String.

The String class provides methods for dealing with Unicode code points 
(i.e., characters), in addition to those for dealing with Unicode code 
units (i.e., char values).
-----

Alex