Enhanced Java String Literals round 2

Reinier Zwitserloot reinier at zwitserloot.com
Fri Jan 4 15:28:27 UTC 2019


This is feedback on James Laskey's ideas[1] which has been posted to
amber-spec-experts[2].

>  If we accept the bold path of multi-line discussion above, then
alternate delimiter is out. This leaves prefixing as the best option to
bless a string literal with raw-ness.

It's an interesting solution. At first I was enamoured by its elegance,
but, thinking on it some more, perhaps this is a turn towards the same
mistakes as the first attempt at string literals: Catering to trying to
cover all use cases, instead of focussing on the actually relevant ones. I
bring, specifically, an anecdote.

In python, R"foo" is a raw-string. And yet, at least half the time someone
explains it to me, they call it a 'regex string'. I generally do not ask if
they are merely filling in the most common use case for raw strings to make
it easier to explain to me, or if they are unaware that the R actually
stands for 'raw' and not 'regex'.

Here are the current reasons why java programmers desire raw strings, in
order of importance, as informed by common sense, feedback received when I
talk about these proposals to others, looking at my own code base, and in
no small part, that python anecdote:

1. As Brian said[3], we know our audience: Multi-line strings to make
inclusion of structured XML, JSON, etc a lot easier than it is now.

2. Regular expressions.

3. Avoid the need to escape a single double-quote symbol, generally in
combination with #1, because XML, JSON, etc, tend to contain lots of these
(but rarely 3 in a row, fortunately).

4. Windows file paths.

5. Other situations where backslashes come up and it'd be annoying if java
treated them as escapes.

6. The need to be able to paste just about _ANYTHING_ with the guarantee
that the sequence as pasted into the java source file shows up,
byte-for-byte, in the resulting string at runtime.

With a very big gap between #3 and #4. I'm not sure about how to order 1-3,
but I am quite sure that 1-3 cover the vast majority of use cases, and 4-6
are very distant also-ran arguments.

Let's go through them:

Use case 1 and 3 do not require raw strings; the triple-quote aspect of the
proposal fully caters to these. Use case 6 is _NOT_ going to be covered by
_ANY_ raw string proposal: The first attempt at raw strings tried and got
mired down with variable-length delimiters and such. Also, James Laskey
pronounced a strong preference for having the compiler cook the newlines in
multiline strings, and once you go down that path (which, to be clear, I
agree with!) you have to accept that you just can't fully cater to this
use-case; you're going to have to require the java programmer to be aware
that they need to do a tiny bit of massaging to their string literal if
they care about exact byte-for-byte copying.

That leaves 2, 4, and 5. The point of my python anecdote is that once we're
down to just those, 95 out of 100 times, it's regex, 4 out of 100 times
it's a windows path and that leaves just one in a very large number when
it's something else.

There is absolutely no reason to want to turn cooked mode back on halfway
through your string literal for regular expressions; these just about never
occur as part of a larger string. In the vast majority of cases, windows
paths are intended to be a path for code purposes, they also do not show up
as part of a larger string: A windows path would have to be part of a
larger block of text, say, the content of a tooltip for a file entry user
interface widget which includes a windows-style path for explanatory
purposes. Honestly, how often does that come up? It doesn't even make sense
unless it's a java written app targeted only at windows users. If it was
targeted at multiple OSes, that tooltip string would be constructed from
parts, and that example path would be obtained via a j.n.f.Path instance.

And it gets worse: If you are making regexp literals, the actual sequence
of [backslash, plus] comes up! That's a real thing: If you are trying to
write a regexp that matches a literal plus, that exact sequence of
backslash-plus is precisely what you need to put in your regexp string.

Given all that, if the \- \+ proposal is what we go with, I would bet you a
significant sum that we end up in a future where a majority of java
programmers, if they are aware of the notion of string rawness at all, are
aware that you should start any string containing a regular expression with
backslash-dash, and that's the full extent of their knowledge. No awareness
of either \+ nor the notion that this is toggling the rawness state of the
parser. Brian Goetz's argument that any syntax we choose here will be
familiar in the future merely by the virtue of the fact that it'll be
official java, and java is very popular, does _not_ work here: That
argument does require for the language feature to come up more than once a
decade for Average Joe Programmer, and the ability to switch rawness
parsing is not going to come up more than once a decade for poor Joe. To
back this up with example: This is legal java: public int
returnsAnIntArray() [] {return new int[0];} – and if I take that snippet
and walk the floor on javaone or devoxx showing that to people, maybe 1 in
10 people I speak to will know that. It's not familiar. Even though it is
legal java.

The vast majority of the time that \+ has any effect at all, it is as an
annoyance or a bug: Someone is writing a regular expression in java,
therefore they start the string with \-, and then they type their regexp as
normal. And every time they wish to match an actual literal plus character,
they just type \+ because that's what you do with regular expressions,
except that won't work, as that'll turn raw-ness back off. Either they know
that and escape it (which is an annoyance) or they don't and they spend
some minutes bughunting their regex.

I'm not actually proposing the following, but I merely present it as a way
string literals could work that I would not at all be surprised if it would
end up being more pragmatically useful in the future than this rawness
switching concept:

1. triple-quotes as per brian/james's proposals.
2. raw strings as a concept don't exist at all.
3. The syntactic structure R"stuff here" is officially known as the 'regexp
literal syntax'... and the type of such an expression is j.u.r.Pattern and
not j.l.String. If the content in between the quotes is not a valid regexp
(which, given the cartoon-swearing nature of regular expressions is
actually difficult to pull off, but you can do it if you have a mismatched
number of unescaped parentheses for example), it's a compiler error. In
this future world, IDEs paint the literal knowing that the content is a
regexp and will even offer helpful popups to let you test regexes.
Confusion about "".replaceAll vs. "".replace disappear (they both replace
all occurrences! Most java programmers don't know that and given those
names who can blame them), as the replaceAll(String, String) method will be
obsoleted in favour of a newly created replaceAll(Pattern, String).

 --Reinier Zwitserloot

[1] http://cr.openjdk.java.net/~jlaskey/Strings/RTL2/index.html
[2]
https://mail.openjdk.java.net/pipermail/amber-spec-experts/2019-January/000933.html
[3]
https://mail.openjdk.java.net/pipermail/amber-spec-experts/2019-January/000931.html


More information about the amber-dev mailing list