[raw-strings] Indentation problem
Brian Goetz
brian.goetz at oracle.com
Mon Feb 5 20:55:24 UTC 2018
OK, let's take a step back. We have identified at least three degrees
of freedom that have been sources of friction with existing string literals:
- Sometimes we don't want traditional escaping (\n, etc);
- Sometimes we don't want unicode escaping (\unnnn);
- Sometimes we want to represent multiple lines of text as a single
String.
Traditional strings could be described as (false, false, false) on these
axes; the propose raw strings are (true, true, true). As a first
evaluation (if these really are the axes), this is encouraging; if
you're going to pick 2 of 2^N prepackaged options, its often best to
pick the ones with the biggest hamming distance.
I have a hard time imagining that people really need, for example,
traditional escaping but not unicode escaping, with any frequency. So
offering all 2^n combinations is not likely to carry its weight.
I think what you are suggesting is that its fine to lump the first two,
but it might have been a premature move to lump them with the third. (A
second question is: are these the only axes we should be concerned with
right now.) So, let's examine that.
We explored allowing double-quoted strings to span lines too; this gives
you a different stacking: { escaping multi-line, raw multi-line }. But
I think the part that's still unexplored is: do we need to explicitly
surface how source lines are combined into strings?
The assumption we've been working off of is: \n has won (this wasn't
true when Java got started.) Is this wishful thinking? And if not, can
the library approach serve this purpose here too:
`a long
string`.toPlatformLineEnding()
(which, as has been observed, can be optimized either by compile-time
evaluation or by link-time evaluation using LDC and ConstantDynamic, so
I think we can ignore the "but then I'm doing work at runtime" aspect of
this.)
On 2/5/2018 1:39 PM, Guy Steele wrote:
>> On Feb 5, 2018, at 1:39 PM, Brian Goetz <brian.goetz at oracle.com> wrote:
>>
>>
>>> However, I also note that the broad problem may two or three distinct symptoms, and:
>>> (1) A solution that addresses one symptom may not address the others, and
>>> (2) On the other hand, it may (or may not) be perfectly reason to address the most painful symptoms in different ways, rather than insisting that a single solution cover them all.
>> Indeed so. This is one reason why we resisted the call to do string interpolation (which many developers conflate with multi-line strings, as many languages with one also have the other) at the same time. Another way to ask this question is: are we yet sufficiently minimal? We boiled it down quite a lot already, but are we at "minimal" yet? Or, did we take a wrong turn in boiling it down, and find ourselves only a local minimum?
>>
>>> In particular, I happen to think that the problem of distinguishing snippet indentation from encoding-program indentation may require a rather different kind of solution from the problem of escape characters in embedded snippets. The reason is that in both these cases the painful symptom is visual in nature rather than logical. That’s why I can understand what drove Tagir to pursue the pipe-character approach (even though I think it may not be the best solution to the problem). We may want to use ```…``` to enclose regexes but also want to use some other approach to solve the multi-line / indentation problems.
>> OK, so what you're saying here is that it might be a clever self-deception to count newline handling as "just another aspect of raw-ness"?
> Bingo.
>
> Back in the day (I’m talking 1960s) it was ugly and wasteful but predictable: if there were line breaks at all (as opposed to record-oriented I/O), they were represented by two characters, CR and then LF, held over from the mechanical abilities/requirements of Teletype machines.
>
> Then in mid-1960s an ISO standard allowed plain LF (eventually semi-renamed Newline) as an alternative, and Multics and then Unix spread this idea (and eventually to Apple).
>
> But another branch of the world, notably the CP/M to MS-DOS to Windows line, continued to use CR/LF. Worse yet, some software came to use CR along (perhaps a natural enough theory when you consider that the “Return” key on keyboards usually generates the CR character rather than the LF character).
>
> It is simply impossible to be compatible with everyone on this issue, and we are fooling ourselves if we think that raw string representations can solve this problem in all contexts. Much better, I think, in the absence of consensus to have explicit software gatekeepers at the points where data transitions among these disparate worlds.
>
More information about the amber-spec-experts
mailing list