Raw String Literals (RSL) - indent stripping of multi-line strings

Kevin Bourrillion kevinb at google.com
Fri Apr 27 00:35:44 UTC 2018


I've actually been thinking about this, digging through more code examples,
discussing with a widening circle of colleagues, etc. all week.



On Mon, Apr 23, 2018 at 10:04 AM, Jim Laskey <james.laskey at oracle.com>
wrote:

> Let me try and summarize the discussion related to RSL "indent stripping”
> of multi-line strings.
>
> - There are at least two distinct use case groups for RSL; single line raw
> strings and multi-line strings (raw or otherwise).
>
> - A multi-line RSL is indicated by the presence of at least one new line
> in the body of the RSL.
>
> - There is an assumption that uses of multi-line strings will be dominated
> by code snippets.
>

Can confirm that in our codebase it is mostly snippets of
xml/html/json/shell/proto/etc.; but there is also appreciable occurrence
of: console output, and "golden" data for a unit test. (These latter are a
somewhat smaller case, but crucial use cases because they are the ones that
really must be formatted in some very exact way.)


- There may be some circumstances where here-document style (bodies aligned
> to left margin) is needed/chosen.
>
> - Most developers will likely choose to indent/format the body of their
> RSLs to align with neighbouring Java code.
>

I am not sure the first style would ever be *needed*, but, chosen, sure. It
seems like a decision between *trivially* easy paste-in-and-out vs.
still-pretty-easy paste-in-and-out. Weighed against the readability impact
of punching a hole through the file's current indentation level.

I believe you are correct that most developers will prefer to indent their
multi-line RSLs, and it is at least *likely* that we will opt to require
this in our style guide.


- This incidental indentation may add whitespace that the developer does
> not want including in the body of the string.
>
> - Incidental indentation may consist of spaces and tabs, and, not all tabs
> are treated equally when displayed.
>

I think we're approaching a design that will work perfectly for the two
"hygienic" usage modes (1. tab-free, and 2. leading spaces are
*consistently* maximally tabified). For the case where spaces and tabs can
freely mix, I believe a proper solution is impossible - more discussion on
this case separately.


Samples of multi-line RSL styles (periods represent incidental indentation):
>
>     String a = `line one
> ................line two`;
>
>     String b = `
> ...............line one
> ...............line two
> ...............`;
>
>     String c = `
> ....    line one
> ....    line two
> ....`;
>
>     String d = `
> ........line one
> ........line two
> ....`;
>
>     String e = `
> ........line one
> ........line two
> ........`;
>
>     String f = `
> line one
> line two
> `;
>

One more, similar to b (not that it throws any wrench in the works):

String g =
    `
....line one
....line two
....`;

I want to point out that all of these examples obey what we might call the
"rectangle rule": there is always *some* rectangle that can be drawn such
that the actual data at runtime is exactly what you see in that region.
(Actually, (a) somewhat violates that because the closing delimiter and
following code might be in that rectangle. I very much want to remove the
need for (a) by eating a trailing newline, but if not that, well, at least
you can visualize a rectangle with a strip sliced out of the last line.)

I think supporting the rectangle rule would be widely appreciated, and this
argues why we must take at *least* one further step away from "raw means
raw"; we must eat a leading newline if present, always, automatically.
Otherwise, since users rarely *intend* a leading newline in their string
constant, "raw means raw" leads to this:

    void displayHeader() {
      System.out.println(
`+--------------+----------+
| Name         | Date     |
+--------------+----------+`);
    }

That misalignment between lines 1 and 2 is what a violation of the
rectangle rule looks like, and it gets worse as more backticks are
required. The only remedy would be to* always* use some amount of
indentation, and stripIndent(). That would seem a sad outcome for the
feature.


To avoid imposing a style on developers by way of the JLS, we opted to
> define RSLs as raw, allow the developer to tailor their own incidental
> indentation stripping technique and presupply best technique guesses via
> String instance methods.
>
> As an example, the String.stripIndent method was defined to remove the
> incidental indentation using the following rules;
>
> - a determining line is any non-blank line that is not the first line
> - the last line is also a determining line
> - calculate the least amount of leading whitespace used on determining
> lines
> - remove that least amount of leading whitespace from each determining line
> - if the first line is blank remove it
> - if the last line is blank remove it
>
> Two additional rules will be added, based on the e-mail discussion;
>
> - trailing whitespace is not removed (was a side effect of detecting blank
> lines)
> - only remove leading whitespace of a determining line if the line's
> leading whitespace is the same sequence of spaces and tabs used on the
> representative line deemed to have the least amount of leading whitespace
>

Holy moly, I had completely overlooked bullet #2. This whole time I have
been under the impression that stripIndent() can not yield a result where
every nonblank line retains nonzero indentation. We were studying the 5% of
our use cases that need this behavior and fretting... but they are actually
going to be fine, it appears. Nice.



> This works for all samples except d. For d we would have to drop the "the
> last line is also a determining line" rule, but then that would break c.
>

(d) is a fairly pleasing style, and frankly, because I misunderstood the
stripIndent behavior, the style I was myself using was a variant of d. I'm
a bit sad to throw it under the bus, but it's nice to rescue the support
for all-indented strings.

It's also nice because (referring back to the "rectangle rule") I think
it's a little easier to visualize exactly where that rectangle is. That is,
at least so long as the user does include the leading and trailing
newlines, and if we outlaw John's "case z". In that event, I think you can
always picture the leftmost edge of the rectangle resting on top of the
closing delimiter, and the line containing the opening delimiter resting on
top of the rectangle. Being able to easily picture that rectangle is a good
thing, I think.



> The possibilty of varying the composition of leading whitespace also leads
> to a complication. Hence, the need for something like String.stripMarkers
> where the body of the RSL is framed via marker sequences and leading
> whitespace matters not.
>

AFAIK the mixed-tabs-and-spaces case is the only thing that stripMarkers is
really needed for. And, they need to make sure their marker sequence ends
at a tab stop (or that there are never tabs after it) or they can be
surprised anyway.

(But the marker approach is unpleasant enough (for the damage it does to
paste-in, paste-out, and rewrap) that I'm really not sure why they would
choose it over just fixing their issues. Or, if they refuse to fix their
issues, maybe they're motivated enough to write stripMarkers themselves.
Does it pay for itself as a String method? Okay, tangent.)


After thinking that we have settled, a survey of a very large code base
> (many 100Mlcs) leads us to wonder if String.stripIndent would be invoked in
> almost every case of multi-line RSL, with a few cases of here-document.
> Note that String.stripIndent does not affect here-document if the close
> backtick is on a newline. If String.stripIndent would almost always be
> called, why not always apply the generic incidental indentation stripping
> at compile time? We’re not looking for a change of plan, just a discussion
> of pros and cons.
>

What I can say so far is that:

* Our style rule has been that continuation lines must be indented at least
+4 from the start line, and SunOracle's since the 90s has been the same but
+8. Nearly everyone has always accepted and appreciated that rule, for tons
of very good reasons, and none of those reasons actually vanish just
because now you have a multi-line string to express. The complication to
pasting is probably too slight in comparison given that it is easy to shift
lines over in any editor. Therefore, the style you refer to as `here
document` will *likely* be banned in our style guide (you could say it
already is).

* On the other hand, I'm sure our developers would be unhappy with a rule
saying that .stripIndent() must *always* be called. It's just too ugly when
done regularly and on top of the other methods like format() you also need
to use. The outcome will be that every single time I write an RSL I have to
judge how secure I feel that the things I'm passing it to will safely
ignore the stowaway indentation characters. I am still researching what
percentage of the time I think that will be the case. It is certainly
appreciable, so I will get to save the .stripIndent() a fair amount of the
time. That creates some risk of bugs, but the second-order effect is
perhaps *more* sad: We have a formatter-first culture here, using
google-java-format, and we'll have google-java-format happily maintain your
indentation for you if it sees that you are calling stripIndent() or
similarly indentation-neutralizing method. But if you're not, it will have
to throw up its hands and refuse. Any other formatter would probably have
to do the same; it can't risk changing program behavior. There *should be* no
conflict between source file layout and program behavior; these should be
independent... and we have a way to make them so!


QUESTIONS:
>
> - Should the javac compiler remove incidental indentation at compile time?
>

I may have a more convincing argument in the future, but provisionally,
*yes*, I think we have a solid *pragmatic *case for making stripIndent()
behavior automatic. It is basically never harmful, my evidence suggests it
is usually what you want, it lets annotation fields in on the fun, rescues
the starts-with-backtick case as John points out, and it isolates program
behavior from source code reindentation, which lets tools like
google-java-format and IDEs do their job of laying out your code for you.

I acknowledge that the pragmatic impact is not the only consideration; it
is true that the language feature becomes harder to explain, and the
precise stripping behavior would no longer be just a javadoc click away.
Most other languages seem to have gone with raw-is-raw and sure, they
didn't explode.



> - What is the rule set used?
>
>     - Should the last line be a determining line?
>

I think yes. For one thing, it is quite beautiful that to opt out of
stripping is as simple as kicking the closing tick all the way left.



>     - Should trailing whitespace be stripped?
>

It seems a completely orthogonal concern, so I would say no.

I have already hit a case where a trailing whitespace character was
intended, and to be honest, I'm not actually sure whether that makes me
feel better or worse.  It's mostly nice that this code still gets to use an
RSL, but it's also scary that it's depending on something being there that
is invisible.


    - Should the first or last line be removed if blank?
>

I believe if the first newline has nothing but whitespace before it, then
it and the preceding whitespace should be removed. For reasons explained
above. (Or just forbid that whitespace as John suggested, I don't feel
strongly either way about that.)

I tend to think it is the right call that if the final newline of the
content has nothing but whitespace after it, then the newline and the
following whitespace should be removed.  That's after that whitespace was
already examined to determine indentation removal.


-- 
Kevin Bourrillion | Java Librarian | Google, Inc. | kevinb at google.com


More information about the amber-dev mailing list