[raw-string] indentation stripping

John Rose john.r.rose at oracle.com
Wed May 2 05:31:59 UTC 2018


On May 1, 2018, at 2:19 PM, Guy Steele <guy.steele at oracle.com> wrote:
> 
> the convention that if the last line consists entirely of whitespace and does not
> end in a newline, then it should be stripped  _and furthermore the exact same
> amount of whitespace should be stripped from all other lines in the literal_

Seconded.  (And see discussion of case y in my earlier note to amber-dev,
where the final line is the control line.  Relevant part copied below.)

I keep coming back to the idea that the final line of the quote is the
best place to control indentation stripping.

Here's a rule we could make:  If the trailing line of the literal is blank
(except for indentation) then it is treated as part of the payload delimiter.
In that case, that whitespace must be uniformly present as leading
indentation on all other lines, which is also stripped from every line
of the quote body.  The leading newline (if any) is also stripped.

   String y = `
..___line one
..line fifty-two
..___line ninety-nine
::`;

The payload starts with "___line one" and ends with "___line ninety-nine".

(Here underbar _ is a non-stripped space and colon : is the controlling
stripped space, while period are the non-controlling copies of the stripped
space.)

By declaring that the final line gets stripped, the literal's payload
is fully and exactly contained in the displayed source code lines between
the two stripped lines of the literal.  That is not possible unless the last
line is stripped as well as the first.

More:  *We can specify that it is an error if the identical leading
whitespace is not present on every payload line and also the stripped
trailing line.*  This means there are no invisible surprises:  What you
see at the end of the string is the same as everywhere else.  Previous
versions of the stripping rule distribute the responsibility across all
the lines, but make it difficult or impossible to find the line with the
shortest indent, since (a) it might be in the middle of the literal,
or (b) it might even display the same as other lines with a different
combination of spaces and tabs.  By contrast, making the trailing
line uniquely responsible for controlling the stripping removes
situation (a) and requiring other lines to have the same leading
space substantially removes situation (b).

   String y_err = `
..___line one
_line fifty-two
..___line ninety-nine
::`;  // error:  unaligned indent before "line fifty-two"

I think we should do this.  It would make it a little (a *very* little)
harder to correctly write indented 

What if the trailing line has non-space characters?  Fine; don't
exdent that literal (that's option A).  Or (option B) exdent by all
the leading whitespace characters, and tack on the remaining
part of the final line to the payload; that gives a hook for ending
a multi-line literal without a newline but keeping the indent feature.
(We have to favor one and disfavor the other, among the two
options of trailing newline and non-trailing newline.)  There's
a third choice (option E) to reserve that condition for future use.

FWIW I like option A as the simplest back-off from fancy stuff:

   String y_A = `
_____line one
__line fifty-two
_____line ninety-nine
__line 100`;  // => really raw no indent stripping

Rationale:  The trailing line controls exdenting, but *only if*
it is all whitespace:  All the exdent and nothing else.

What about the leading line?  Should it have its indent stripped?
No, because that doesn't help make clean indented rectangles of
source code; stripping that space would be pure puzzler with no upside.

In fact, any non-empty first line is *not* going to align with the rest of
the lines, if indentation is in play.  Therefore, we have similar options
as dealing with non-blanks in the trailing line:  Option A1 is to turn off
exdenting altogether if the first line is non-empty (that's S. Colebourne's
proposal too I think).  Option B1 is keep the first line as-is, even though
it won't align with the rest of the rectangle, and exdent the rest.
Option E1 is to disallow a non-empty leading line.  For completeness,
option E1A is to disallow a leading line *if it begins with whitespace*,
but if it begins with non-whitespace turn off exdenting.  (Note that
under these rules if the trailing line begins with non-whitespace
exdenting is also turned off.)

I think B1 is bad:  It breaks up the rectangle.  I'd like to say that we
don't ever break up rectangles; if a proper text rectangle can't be
formed in the source code, then exdenting is turned off.  No partial
exdents.  I guess A1 is consistent with the previous A, but so is E1A.

   String y_E1A = `__spacey
..___line one
..line fifty-two
..___line ninety-nine
::line 100`;  // => error: unaligned indent before "spacey"

Bottom line suggestions:

1. Control indent/exdent string by defining it precisely as the *trailing* line.
2. Omit that trailing line (if it is all-blank), because it is pure control, not payload.
3. If the trailing line has non-blanks, it's not indent control so don't strip or omit anything.
   (B: Or split such a trailing line into leading blanks for indent control and payload.)
4. If stripping, require that *every* payload line without exception have the same prefix.
5. If the leading line is not empty, don't strip anything.  (Rectangles wouldn't align anyway.)
6. Conversely, if stripping, omit the leading line:  It can't contribute anything to a rectangle.
7. Make some edge conditions errors (as in 4) and others "do not strip" cases (as in 3, 5).

Net model is you are either raw-means-raw or raw-is-a-rectangle.  The latter mode
is the only way lines are omitted or left-indents are stripped.  To get into the latter
mode, you have to have a well-formed rectangle with no oddities.  If there's an oddity,
you get an error (if it would be hard to read) or you back off to raw-means-raw.
A multi-line string that doesn't have a leading newline is raw-means-raw, no exceptions.

One downside to putting all the weight on the trailing line:  You don't get all of
Kevin's style choices.  You have to indent the trailing quote the amount you expect
to have stripped.  But on balance this is IMO a feature not a bug:  The exdent
level is defined in one unique place.

— John

P.S. And for the record, here's my errant message to amber-dev:

From: John Rose <john.r.rose at oracle.com>
Subject: Re: Raw String Literals (RSL) - indent stripping of multi-line strings
Date: April 23, 2018 at 12:20:02 PM PDT
To: Jim Laskey <james.laskey at oracle.com>
Cc: amber-dev <amber-dev at openjdk.java.net>

> 
>    - Should trailing whitespace be stripped?

As with the "all-indented" case above, trailing space should be
stripped only if there is a way to opt out of stripping.  I think the
trimMarkers API is the way to cover this use case, since it is
rather specialized.

>    - Should the first or last line be removed if blank?

Yes.  In essence, the syntax of a quote sequence includes
a line terminator.  This BTW allows non-periodic quote sequences,
which as a corollary allows leading and trailing quote sequences
to be encoded in the RSL:

var hasLeadingAndTrailingTick = ``
   `I went for a walk in the tall brush and picked up some riders.`
   ``;

Also, the removal of leading and trailing blank lines gives users
some degrees of stylistic freedom that seem to be customary,
along with the indent-stripping.

Here's a new point along these lines, if I may be so bold:

If we are sticking in non-payload stylistic inputs into RSLs,
we should consider opening up a reservation for future use,
in the form of RSL configurations which are declared to be
illegal.  We could declare that some obviously pathological
subset of near-misses to an indent-stripped RSLs is illegal,
and reserved for future extension.

On the other hand, we are trying very hard to accept every
RSL the user could randomly type in, which is incompatible
with reserving a set of constructs for future use.  This isn't
logically necessary in the style-control use cases; we
can simply declare that some style-control is just illegal,
if we think there's a chance of using that coding space
in the future.

By obviously pathological I mean something like one or
all of these:

   String x = `_
..line one
..line two
..`;

   String y = `
..___line one
..line fifty-two
..___line ninety-nine
..`;

   String z = `
..line one
..line two
..___`;

(Here underbar _ is a non-stripped space.)

In case x, the is a whitespace on the non-determining blank first line.
Surprisingly, this space doesn't get stripped (under the proposed rules).

In case y, line 52 determines the indent to strip, and this is true even
if it is buried in the middle of 100 lines.  Luckily, in this case, the determining
last line (just before the close-quote) ratifies this choice, so there is a
unique place to look for the stripped indent, without searching the whole string.

In case z, the stripped last line, while a determining line, has extra
whitespace.  This is easy to miss.

I suggest placing a structural constraint on stripped indents, that the
last line, if blank, is stripped, and if stripped, must be of length exactly
zero after the leading indent is trimmed.  That would rule out z and
ameliorate y.

I also suggest ruling out x by requiring that the first line, which is
non-determining, must not have leading whitespace at all.
This doesn't break any of your examples a-f.

Removing cases x and z might remove a class of puzzler about the
significance of leading white spaces near the ends of RSLs.
(Can anyone see a positive use case for them that can't be easily
adjusted to a less pathological form?)

And (getting back to extensions) ruling out x also gives us a tidy
little subspace of RSLs to reserve for future use.  In other words,
an RSL with multiple lines whose leading line begins with a space
can be defined, in future iterations of this feature, to include
envelope information about the RLS, after that space.
Something like this:

   String q = `_{cool RSL header invented by our successors}
..line one
..line two
..`;

This envelope information would *not* be included in the payload,
but would be stripped as if the leading line were purely blank.
It would somehow control the processing of the RSL payload,
and/or the parsing of the rest of the RSL.

So in this future feature, the first line would still not be a determining
line, and would be stripped completely, and the stuff between
braces would be used in some way we can't define at present.

I suppose it could have to do with processing embedded escapes.

   String r = `_{cool RSL header invented by our successors}
..line one
..line two {cool embedded stuff enabled by RSL header}
..`;

But there's no way to say at this moment what such a future syntax
would look like, and that's my point:  For now we can reserve a corner
of the RSL encoding space for futures.

We might never exercise the option, but it seems wise to buy the
option, if it can be bought cheaply as a side effect of restrictions
on pathological indent management.

I didn't raise this earlier though it was on my mind, but as you see
the complexity trade-offs change with built-in indent stripping.
And, obviously, there are other ways to extend RSLs in the future
which may seem better, such as by adding prefixes before the string
quote.  If we don't put constraints on cases x and z above, we still
have options for future extension.

Conversely, even if we are sure we want to make other choices
regarding futures, I think it is a safe move to exclude x and z above.

— John

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/amber-spec-experts/attachments/20180501/6a2dbeb9/attachment-0001.html>


More information about the amber-spec-experts mailing list