Enhancing Java String Literals Round 2

Jim Laskey james.laskey at oracle.com
Wed Jan 2 19:00:05 UTC 2019


> 
> http://cr.openjdk.java.net/~jlaskey/Strings/RTL2/index.html <http://cr.openjdk.java.net/~jlaskey/Strings/RTL2/index.html>
> http://cr.openjdk.java.net/~jlaskey/Strings/RTL2.pdf <http://cr.openjdk.java.net/~jlaskey/Strings/RTL2.pdf>
> First of all, I would like to apologize for leading us down the garden path re Java Raw String Literals. I jumped into this feature fully enamoured with the JavaScript equivalent and, "why can't we have this in Java?"  As the proposal evolved, it became clear that what we came up with was not a good Java solution. I underestimated the concern that the original proposal was too left field and did not fit into Java very well. It's somewhat ironic that the backtick looks like a thorn.
> 
> So, let's start the new year with a structured approach to the enhance string literal design. Brian gave a summary of why the old design fails. Starting with this summary, Brian and I talked out a series of critical decision points that should be given thought, if not answers, before we propose a new design. As an exercise, I supplemented these points and created a series of small decision trees (a full on decision tree would be complex and not very helpful.) I found these trees good intuition pumps for getting the design at least 80% there. Hopefully, this exercise will help you in the same way.
> 
> 
> 
> 
> Even the label Raw String Literal put the emphasis on the wrong part of the feature. What developers really want is multi-line strings. They want to be able to paste alien source into their Java programs with as little fuss as possible.
> 
> String raw-ness (not translating escapes) is a tangential aspect, that may or may not be needed to implement multi-line strings. Yes, the regex and Window's file path arguments in JEP 326 are still valid, but this aspect needs to be separated from the main part of the design. Further in the discussion, we'll see that raw-ness is really a many-headed hydra, best slain one head at a time.
> 
> 
> 
> 
> We have to be honest. We know Java's primary market. Sure we want to embed Java in Java for writing tests. Sure there is JavaScript and CSS in web pages. Nevertheless, most uses of multi-line will be for non-complex grammars. Specifically, grammars that don't require special handling of multi-character delimiter sequences. If you can accept this, then the solution set is much smaller.
> 
> 
> 
> 
> This is an easy one. Familiarity is key to feature education. Radical wandering off with new syntax is not helpful to anyone but bloggers and authors.
> 
> 
> 
> 
> If you buy into the familiarity argument, then double quote is really only choice for a delimiter. Double quote already indicates a string literal. Single quote indicates a character. We don’t want to gratuitously burn unused symbols like backtick. Backslash works for regex but maybe not for others. Combinations and nonces just introduce new noise when our original goal was to reduce noise and complexity.
> 
> 
> 
> 
> Other languages avoid delimiter escape sequences by doubling up. Example, "abc""def" -> abc"def. This concept is unfamiliar to Java developers, why change now. Escape sequences are what we know.
> 
> 
> 
> 
> Language designers got very nervous when I suggested infinite delimiter sequences in the original proposal; lexically sacrilegious. I felt strongly that it was easy to explain and only 1 in 1M developers would ever use more than 4-5 character delimiter sequences. In round two, I have come to agree. This was taking on more complexity than is really warranted, for a use case that doesn’t come along very often. I suggest we only need single and triple double quotes. A single double quote works today, so no argument there. Double double quotes means empty string, no problem. Triple double quotes are only necessary to avoid having to escape quotes in alien source.
> 
> String json = """
>                 {
>                   "name": "Jean Smith",
>                   "age": 32,
>                   "location": "San Jose"
>                 }
>               """;
> 
> versus
> 
> String json = "
>                 {
>                   \"name\": \"Jean Smith\",
>                   \"age\": 32,
>                   \"location\": \"San Jose\"
>                 }
>               ";
> 
> This second case is where we wandered off the tracks with raw-ness. We assumed raw-ness is necessary to avoid all the backslashes. Most cases can be handled with triple double quotes.
> 
> Okay, so why not more combinations? Simply because, most of the time they are not needed. On the rare occasion we do have nested triple double quotes, we can then use escape sequences.
> 
> String nestedJSON = """
>                         \"\"\"
>                         {
>                           "name": "Jean Smith",
>                           "age": 32,
>                           "location": "San Jose"
>                         }
>                         \"\"\";
>                     """;
> 
> or better yet, you only have to escape every third double quote
> 
> String nestedJSON = """
>                         \"""
>                         {
>                           "name": "Jean Smith",
>                           "age": 32,
>                           "location": "San Jose"
>                         }
>                         \""";
>                     """;
> 
> Not so evil and it's familiar.
> 
> 
> 
> 
> Meaning, you can only use single quotes for simple strings and triple quotes for multi-line strings. I don't have a strong opinion other than it seems like an unneeded restriction. The only argument I've heard has been for better error recovery when missing a close delimiter during parsing. My counter for that argument is that if you are processing multi-line strings then you can easily track the first newline after the opening delimiter and recover from there. I implemented that recovery in javac and worked out well.
> 
> 
> 
> 
> 
> Cooked (translated escape sequences) should be the default. Why should a multi-line string be different than a simple string?  We have a solution for embedding double quote. Single quotes don't require escaping. Tabs and newlines can exist as is. Unicode characters can be either an escape sequence or the unicode character. So the only problem case is backslash. I would argue that the rare backslash can be escaped. If not, then the developer can use the raw-ness solution.
> 
> 
> 
> 
> If we don't translate newlines, then source is not transferable across platforms. That is, a source from one platform may not execute the same way on another platform. Translating consistently guarantees execution consistency. As a note, programming languages that didn't translate newlines in multi-line string literals typically regretted it later (Python.)
> 
> 
> 
> 
> With the original Raw String Literal proposal, there was concern about leading and trailing nested delimiters. If we default to cooked strings, then we use can use \".
> 
> 
> 
> 
> These questions have been answered numerous times and fall into the realm of library support. Same arguments as before, same outcome.
> 
> 
> To summarize the bold paths at this point;
> 	- multi-line strings are an extension of traditional simple strings
> 	- newlines in a string are no longer an error and the string can extend across several lines
> 	- error recovery can pick up at the first newline after the opening delimiter 
> 	- multi-line strings process escape sequences (including unicode) in the same way as simple strings
> 	- multiple double quotes are handled with escape sequences
> 	- triple double quote delimiter is introduced to avoid escaping simple double quote sequences
> 	
> Generally, I think this is very much in the traditional Java spirit.
> 
> 
> Now, let's move on to the lesser but more interesting issue. As I stated above, raw-ness is a multi-headed beast. Raw-ness involves the turning off the translation of 
> 	- escape sequences
> 	- unicode escapes
> 	- delimiter sequences
> 	- escape sequence prefix (backslash)
> 	- tabs and newlines (control characters in general)
> 
> Sometimes we need all of the translations, sometimes few and sometimes none. In the multi-line discussion above, we see we don't need raw as much as we might have expected. Maybe for occasional backslashes, as in regex and Windows paths strings.
> 
> 
> 
> 
> 
> The original Raw String Literal proposal suggested that raw-ness was a property of the whole string literal and thus we proposed an alternate delimiter syntax just to emphasize that fact. If we accept the bold path of multi-line discussion above, then alternate delimiter is out. This leaves prefixing as the best option to bless a string literal with raw-ness. 
> 
> At this point, I would like to suggest an alternate, maybe progressive way to think of raw-ness. Since the original proposal, I have been thinking of raw-ness as a state of processing the literal. State is certainly obvious in the scanner implementation, why not raise that to the language level? If it is a state then we should be able to enter and leave that state in some way. Escape sequences are an obvious way of transitioning translation in the string. \- and \+ are available and not currently recognized as valid escape sequences, why not \- and \+ to toggle escape processing?
> 
>     String a = "cooked \-raw\+ cooked";  // cooked raw cooked - a little odd but not so much so
>     String b = "abc\-\\\\\+def";         // abc\\\\def - struggling
>     String c = "\-abc\\\\def";           // abc\\\\def - more readable as an inner prefix
>     String d = "abc\-\-def\+\+ghi";      // abc\-def\+ghi - raw on "\-" is "\" and "-", raw off "\+" is "\" and "+"
>     String e = """\-"abc"\+""";          // "abc" - \- and \+ act a no-ops of sorts
> 
> Comparing property vs state:
>     
>     Runtime.getRuntime().exec(R""" "C:\Program Files\foo" bar""".strip());
>     Runtime.getRuntime().exec("""\-"C:\Program Files\foo" bar""");
>     
>     System.out.println("this".matches(R"\w\w\w\w"));
>     System.out.println("this".matches("\-\w\w\w\w"));
> 
>     String html = R"""
>                        <html>
>                            <body>
>                                <p>Hello World.</p>
>                            </body>
>                        </html>
>                   """.align();
>     String html = """\-
>                        <html>
>                            <body>
>                                <p>Hello World.</p>
>                            </body>
>                        </html>
>                   """.align();
> 
> 
>     String nested = """
>                         String EXAMPLE_TEST = "This is my small example "
>                                 + "string which I'm going to "
>                                 + "use for pattern matching.";
>                     """ +
>                     R"""
>                         System.out.println(EXAMPLE_TEST.replaceAll("\\s+", "\t"));
>                     """;
>     String nested = """
>                         String EXAMPLE_TEST = "This is my small example "
>                                 + "string which I'm going to "
>                                 + "use for pattern matching.";
>                         \-
>                         System.out.println(EXAMPLE_TEST.replaceAll("\\s+", "\t"));
>                         \+
>                     """;
> 
> Hopefully, this is a good starting point for discussion. As before, I'm pragmatic about which direction we go, so feel free to comment.
> 
> Cheers,
> 
> -- Jim
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 


More information about the amber-spec-experts mailing list