Raw String Literal Library Support

Tue Mar 13 13:47:29 UTC 2018

With the announcement of JEP 326 Raw String Literals, we would like to open up a discussion with regards to RSL library support. Below are several implemented String methods that are believed to be appropriate. Please comment on those mentioned below including recommending alternate names or signatures. Additional methods can be considered if warranted, but as always, the bar for inclusion in String is high.

You should keep a couple things in mind when reviewing these methods.

Methods should be applicable to all strings, not just Raw String Literals.

The number of additional methods should be minimized, not adding every possible method.

Don't put any emphasis on performance. That is a separate discussion.

Cheers,

-- Jim

A. Line support.

public Stream<String> lines()
Returns a stream of substrings extracted from this string partitioned by line terminators. Internally, the stream is implemented using a Spliteratorthat extracts one line at a time. The line terminators recognized are \n, \r\n and \r. This method provides versatility for the developer working with multi-line strings.
     Example:

        String string = "abc\ndef\nghi";
        Stream<String> stream = string.lines();
        List<String> list = stream.collect(Collectors.toList());

     Result:

     [abc, def, ghi]

     Example:

        String string = "abc\ndef\nghi";
        String[] array = string.lines().toArray(String[]::new);

     Result:

     [Ljava.lang.String;@33e5ccce // [abc, def, ghi]

     Example:

        String string = "abc\ndef\r\nghi\rjkl";
        String platformString =
            string.lines().collect(joining(System.lineSeparator()));

     Result:

     abc
     def
     ghi
     jkl

     Example:

        String string = " abc  \n   def  \n ghi   ";
        String trimmedString =
             string.lines().map(s -> s.trim()).collect(joining("\n"));

     Result:

     abc
     def
     ghi

     Example:

        String table = `First Name      Surname        Phone
                        Al              Albert         555-1111
                        Bob             Roberts        555-2222
                        Cal             Calvin         555-3333
                       `;

        // Extract headers
        String firstLine = table.lines().findFirst().orElse("");
        List<String> headings = List.of(firstLine.trim().split(`\s{2,}`));

        // Build stream of maps
        Stream<Map<String, String>> stream =
            table.lines().skip(1)
                 .map(line -> line.trim())
                 .filter(line -> !line.isEmpty())
                 .map(line -> line.split(`\s{2,}`))
                 .map(columns -> {
                     List<String> values = List.of(columns);
                     return IntStream.range(0, headings.size()).boxed()
                                     .collect(toMap(headings::get, values::get));
                 });

        // print all "First Name"
        stream.map(row -> row.get("First Name"))
              .forEach(name -> System.out.println(name));

     Result:

     Al
     Bob
     Cal
B. Additions to basic trim methods. In addition to margin methods trimIndent and trimMarkers described below in Section C, it would be worth introducing trimLeft and trimRight to augment the longstanding trim method. A key question is how trimLeft and trimRight should detect whitespace, because different definitions of whitespace exist in the library. 

trim itself uses the simple test less than or equal to the space character, a fast test but not Unicode friendly. 

Character.isWhitespace(codepoint) returns true if codepoint one of the following;

   SPACE_SEPARATOR.
   LINE_SEPARATOR.
   PARAGRAPH_SEPARATOR.
   '\t',     U+0009 HORIZONTAL TABULATION.
   '\n',     U+000A LINE FEED.
   '\u000B', U+000B VERTICAL TABULATION.
   '\f',     U+000C FORM FEED.
   '\r',     U+000D CARRIAGE RETURN.
   '\u001C', U+001C FILE SEPARATOR.
   '\u001D', U+001D GROUP SEPARATOR.
   '\u001E', U+001E RECORD SEPARATOR.
   '\u001F', U+001F UNIT SEPARATOR.
   ' ',      U+0020 SPACE.
(Note: that non-breaking space (\u00A0) is excluded) 

Character.isSpaceChar(codepoint) returns true if codepoint one of the following;

   SPACE_SEPARATOR.
   LINE_SEPARATOR.
   PARAGRAPH_SEPARATOR.
   ' ',      U+0020 SPACE.
   '\u00A0', U+00A0 NON-BREAKING SPACE.
That sets up several kinds of whitespace; trim's whitespace (TWS), Character whitespace (CWS) and the union of the two (UWS). TWS is a fast test. CWS is a slow test. UWS is fast for Latin1 and slow-ish for UTF-16. 

We are recommending that trimLeft and trimRight use UWS, leave trim alone to avoid breaking the world and then possibly introduce trimWhitespace that uses UWS.

public String trim() 
Removes characters less than equal to space from the beginning and end of the string. No, change except spec clarification and links to the new trim methods.
    Examples:
        "".trim();              // ""
        "   ".trim();           // ""
        "  abc  ".trim();       // "abc"
        "  \u2028abc  ".trim(); // "\u2028abc"
public String trimWhitespace() 
Removes whitespace from the beginning and end of the string.
     Examples:

        "".trimWhitespace();              // ""
        "   ".trimWhitespace();           // ""
        "  abc  ".trimWhitespace();       // "abc"
        "  \u2028abc  ".trimWhitespace(); // "abc"
public String trimLeft()
Removes whitespace from the beginning of the string.
     Examples:

        "".trimLeft();        // ""
        "   ".trimLeft();     // ""
        "  abc  ".trimLeft(); // "abc  "
public String trimRight()
Removes whitespace from the end of the string.
     Examples:

        "".trimRight();        // ""
        "   ".trimRight();     // ""
        "  abc  ".trimRight(); // "  abc"
C. Margin management. With introduction of multi-line Raw String Literals, developers will have to deal with the extraneous spacing introduced by indenting and formatting string bodies. 

Note that for all the methods in this group, if the first line is empty then it is removed and if the last is empty then it is removed. This removal provides a means for developers that use delimiters on separate lines to bracket string bodies. Also note, that all line separators are replaced with \n.

public String trimIndent()
This method determines a representative line in the string body that has a non-whitespace character closest to the left margin. Once that line has been determined, the number of leading whitespaces is tallied to produce a minimal indent amount. Consequently, the result of the method is a string with the minimal indent amount removed from each line. The first line is unaffected since it is preceded by the open delimiter. The type of whitespace used (spaces or tabs) does not affect the result as long as the developer is consistent with the whitespace used.
     Example:

        String x = `
                   This is a line
                      This is a line
                          This is a line
                      This is a line
                   This is a line
                   `.trimIndent();

     Result:

     This is a line
         This is a line
             This is a line
         This is a line
     This is a line
public String trimMarkers(String leftMarker, String rightMarker)
Each line of the multi-line string is first trimmed. If the trimmed line contains the leftMarker at the beginning of the string then it is removed. Finally, if the line contains the rightMarker at the end of line, it is removed.
     Example:

         String x = `|This is a line|
                     |This is a line|
                     |This is a line|`.trimMarkers("|", "|");
     Result:

     This is a line
     This is a line
     This is a line

     Example:

         String x = `>> This is a line
                     >> This is a line
                     >> This is a line`.trimMarkers(">> ", "");
     Result:

     This is a line
     This is a line
     This is a line
D. Escape management. Since Raw String Literals do not interpret Unicode escapes (\unnnn) or escape sequences (\n, \b, etc), we need to provide a scheme for developers who just want multi-line strings but still have escape sequences interpreted.

public String unescape() throws MalformedEscapeException
Translates each Unicode escape or escape sequence in the string into the character represented by the escape. @jls 3.3, 3.10.6
     Example:

         `abc\u2022def\nghi`.unescape();

     Result:

     abc•def
     ghi
public String unescape(EscapeType... escape) throws MalformedEscapeException
Selectively translates Unicode escape or escape sequence based on the escape type flags provided.
       public enum EscapeType {
            /** Backslash escape sequences based on section 3.10.6 of the
             * <cite>The Java™ Language Specification</cite>.
             * This includes sequences for backspace, horizontal tab,
             * line feed, form feed, carriage return, double quote,
             * single quote, backslash and octal escape sequences.
             */
            BACKSLASH, //

            /** Unicode sequences based on section 3.3 of the
             * <cite>The Java™ Language Specification</cite>.
             * This includes sequences in the form {@code \u005Cunnnn}.
             */
            UNICODE
        }

     Example:

         `abc\u2022def\nghi`.unescape(EscapeType.BACKSLASH);

     Result:

     abc\u2022def
     ghi

     Example:

         `abc\u2022def\nghi`.unescape(EscapeType.UNICODE);

     Result:

     abc•def\nghi
Conversely, there are circumstances where the inverse is required

public String escape()
Translates each quote, backslash, non-graphic character or non-ASCII character into an Unicode escape or escape sequence. The method is equivalent to escape(BACKSLASH, UNICODE) .
     Example:

         `abc•def
         ghi`.escape();

     Result:

     abc\u2022def\nghi
public String escape(EscapeType... escape)
Selectively translates each quote, backslash, non-graphic character or non-ASCII character into an Unicode escape or escape sequence based on the escape type flags provided.
     Example:

         `abc•def
         ghi`.escape(EscapeType.BACKSLASH);

     Result:

     abc•def\nghi

     Example:

         `abc•def
         ghi`.escape(EscapeType.UNICODE);

     Result:

     abc\u2022def
     ghi