Wrapping up the first two courses
John Rose
john.r.rose at oracle.com
Thu Apr 25 21:37:59 UTC 2019
On Apr 25, 2019, at 8:55 AM, Brian Goetz <brian.goetz at oracle.com> wrote:
>
> A few more questions have been raised:
>
> - Do we do alignment before, or after, escape processing?
> - What about single-line “fat” strings?
> - What is the effect of text on the first line on alignment?
> - What about opt-out?
> - What about \<newline> ?
TL;DR: - Before, and watch out for \u00XX, - Disallow,
- Disallow, - (see next) - Support \LineTerminator as
both explicit layout control and opt-out (no \-).
> Suggested answers:
>
> 1. Escape processing. If alignment is about removing _incidental indentation_, it seems hard to believe that a \t escape is intended to be incidental; this feels like payload, not envelope. Which suggests to me that we should be doing alignment on the escaped string, and then doing escape processing.
I agree with this; I think it is much more intuitive to make sure
that any escaped thing is classified as payload. Basically, if it
doesn't look like whitespace, it won't be treated as the envelope
of the rectangle but rather as payload inside the rectangle.
What could be simpler?
At first I thought this might make implementation and specification
more complex, but actually it makes it simpler. Here's why: If
you treat rectangle extraction as a process of grabbing a bunch
of escape-sequence-laden payload, you can treat the expansion
of escape sequences as a pure library function, a mapping from
String to String, where LineTerminator shows up as \n (\u000A).
I think Jim may already favor this approach?
Making a clean separation between rectangle extraction (first)
and escape sequence expansion (second) may also clarify the
opt-out question; see below.
One confounding factor we've hesitated to touch is the status
of \uXXXX escapes, which look the same as \OOO escapes to
most users but are completely different in order of processing.
We could make our lives simpler with respect to \uXXXX escapes
if we were to modify the rules for them inside of fat strings,
so that (somehow) they were always interpreted as payload,
and not as envelope. (We can't modify the rules inside of
plain strings, sadly.)
The JLS warns about \uXXXX escapes aliasing to surprising
syntax characters, in 3.3 (\u005c = \), 3.10.4 (\u0027 = '),
and 3.10.5 (\u0022 = ""). The net result is that you can
obfuscate your Java program horribly if you use any of those
unicode escapes. With the rectangle extraction feature of
fat strings, the list grows to include \u0020 and other
whitespace.
As a matter of style programmers should scrupulously
avoid unicode escapes for lexically significant code points.
(Some puzzlers: What role can the unicode escape \u000A
plan, in Java program text today? Hint: It's not a LineTerminator.
What could it mean in a multi-line string? Same questions about
\u000D? How should those characters interact with rectangle
extraction?)
At this point we could consider going farther, and make
a mechanically checked guarantee against puzzlers
in fat strings. I'm not sure about this, but I want to
put a proposal out there FTR:
Limitation on \uXXXX escapes: Inside of fat strings, any
unicode escape sequence (which is necessarily of the form
\u*XXXX repeated u followed by four hex digits) is forbidden
to specify a hexidecimal number in the range of 0000 to
001F inclusive. (Reduced limitation: U-escapes must not
alias to characters significant to the envelope, which are
those in "\"\\ \t", quote+backslash+space+tab.)
Effect: All remaining \uXXXX escapes are safe to retain
during rectangle extraction and can be interpreted along
with other string escapes in the same post-pass. In
particular, a String library method can handle such escapes
along with other C-like string escapes. Processing \u
at the same time and in the same method as other escapes
seems like a win to me, independently of the exclusion
of puzzlers. This extra win made me speak up, in fact.
Also, a coordinated limitation on fat string delimiters:
The opening triple-quote of a fat string must not be
derived from a unicode escape, which would have been
of the form \u0022, \uu0022, etc.
> For 2/3, here’s a radical suggestion. Our theory is, a “fat” string is one that is is co-mingled with the indentation of the surrounding code, and one which we usually wish the compiler to disentangle for us. By this interpretation, fat single-line strings make no sense, so let’s ban them, and similarly, text on the first line similarly makes little sense, so let’s ban that too. In other words, fat strings (with the possible exception of the trailing delimiter) must exist within a “Kevin Rectangle.”
Yep. Put rectangle extraction front and center.
Alternative theory for 2: Allow single-line fat strings.
Perform analogous "line extraction" on them, by removing
all unescaped whitespace after the open quote and before
the close quote. This is like rectangle extraction, but in
one dimension.
> For 4 (opt out), I think it is OK to allow a self-stripping escape on the first line (e.g., \-), which expands to nothing, but suppresses stripping. This effectively becomes a “here doc”.
I agree with the desire for a clear opt-out.
Here's a question we should answer: When a user opts
out of 2D layout with rectangle extraction, what should
we call the alternative? Surely it comes with more intensive
control from the user. Maybe that leads to odder-looking
code, but maybe also it leads to code which the user has
"beautified" in some way apart from rectangle extraction.
I'd like to think of this opt-out scenario not just negatively
("don't auto-strip that white space") but positively ("I want
to organize the form of my program more freely"). Not
sure if that's possible, but read on.
Any, I think an ad hoc escape \- at the front of the string
is not such a clear win, and if we tweak the rules we can
gain more than just a single dead-end quasi-escape.
> For 5, since \<newline> is not valid today, we don’t have to decide this now, we can add it later if desired.
(This is more accurately called \LineTerminator, since
escapes are processed after <newline> has been tokenized.)
It's true we can defer this, but let's look at combining it with
the opt-out feature and see if we like what we get.
Thesis: The opt-out feature, which asks for all leading
blanks (and bracketing newlines) to be retained is a special
case of intensified user control over 2D program layout.
Such intensified user control over 2D layout very often
(in languages we all know, like makefiles and shell) often
includes breaking of long lines, using escape sequences
or other special control over the envelope (as opposed
to payload). The user is taking more control over a
complex payload, not just giving up on the rectangle rule.
Proposal: Allow newlines to be marked (somehow) as
non-payload, so users can have more intense control over
program layout without "leaking" newlines used for layout
into their payloads (string body characters).
If we frame this feature as an escape sequence, which
marks newlines for elision, then it can be rolled into the
escape processing pass. If (see above) escape processing
comes *after* rectangle extraction, then newline control
could potentially co-exist with rectangle extraction, depending
on the presence or absence of an opt-out condition. I think
that could be a bonus, although that could be misused also.
There are a range of possible rules for the opt-out from
rectangle extraction, all with slightly different outcomes:
- Opt out if the string body contains \LineTerminator anywhere.
- Opt out if the string body contains \n or \r anywhere.
- Opt out if the string body contains \n or \r or \LineTerminator
anywhere.
- Any of the previous rules, applied only between the open
triple-quote and first LineTerminator.
- Opt out if any visible character (not whitespace) occurs
between the open triple-quote and first LineTerminator.
- Allow any single escape sequence, possibly accompanied by
whitespace, between the open triple-quote and first
LineTerminator, and opt out if that occurs.
(As you can see, the opt-out rule can be more or less specific,
and can either co-exist with arbitrary "stuff" appearing after
the open-quote, or with restrictions that allow only an opt-out
to occur in the privileged position.)
Specific proposal: The sequence \ LineTerminator followed
by any amount of unescaped spaces and tabs is elided.
This happens during escape processing, which means after
rectangle extraction.
Rectangle extraction is inhibited (opted out) by the presence
of any escape sequence between the open triple-quote and
the first following LineTerminator.
Optionally: Other than whitespace and escape sequences,
nothing is allowed between the open triple-quote and the
first following LineTerminator.
If rectangle extraction occurs, and escape processing
encounters \ LineTerminator sequences, then additional
leading whitespace is stripped. The escape sequence is
ignorant of whether any leading whitespace (or none)
was removed during rectangle extraction (if it occurred).
Such two-step removal seems complicated but is easy
to justify: The rectangle extraction isolates a visible
block of source code from the containing context, and
then the escape sequences do their work. If rectangle
extraction is opted out of, the escape sequences would
do the same work anyway.
I think a set of decisions like this would hang together nicely
and give users very good control over the layout of their
programs. The resulting programs would (barring intentional
obfuscation) read clearly, in both rectangular layouts and
more ad hoc free-flowing formats.
More information about the amber-spec-experts
mailing list