Raw string literals -- where we are, how we got here
Stephen Colebourne
scolebourne at joda.org
Wed Mar 28 13:25:49 UTC 2018
On 27 March 2018 at 20:15, Brian Goetz <brian.goetz at oracle.com> wrote:
> ## The proposal
>
> Where we are now is that a raw string literal consists of an opening
> delimiter which is a sequence of N consecutive backticks, for some N > 0, a
> body which may contain any characters (including newlines) except for a
> sequence of N consecutive backticks, and a closing delimiter of N
> consecutive backticks. Any line-end sequences (CR, LF, CRLF) are normalized
> to a single newline (LF), and the remainder of the body is treated without
> any further transformation (including without unicode escape processing),
> and placed in a String. No other processing is done on the contents.
> #### Can't these be fixed?
>
> Because we start with such a simple rule (any number of consecutive ticks),
> pretty much any tweak is going to be complexity-increasing. It seems a poor
> tradeoff to make the feature more complex and less convenient for everyone,
> just to cater to empty strings.
Brian has asked me privately to record an alternate proposal I put
forward on amber-dev.
Raw strings would be formed of two variants - single line and extended.
The single line form has:
- a single backtick delimiter
- no new lines
- no escaping
- cannot embed backtick
- may be empty
It serves the needs of regular expressions, Windows file paths etc.
The single line form has one difference to a single line instance of
the current proposal:
1) a tick cannot be embedded (the extended form is for that)
The extended form has:
- 3 or more backticks as a delimiter
- may include new lines, normalised to LF
- no escaping
- can embed backticks by having more in delimiter than in content
- cannot be empty
It serves the needs of DSLs, code snippets, etc.
The extended form has two differences to the current proposal:
1) a minimum of 3 ticks
2) the delimiter must be longer than embedded contiguous ticks, not shorter
I like this approach more than the current proposal for various reasons:
1) It scales down to empty, avoiding nasty edge cases when reading code.
2) It supports every raw string that the current proposal can handle
(with a different number of delimiters), plus the empty string.
3) Style guides and teaching would focus on there being two distinct
variants - a 1-tick single line and 3+ tick multi line form. While you
can have a single line 3+ raw string, it would be very rare to
actually need it.
4) The two forms match the two strands of use cases for raw strings.
While clearly subjective, I find that helpful to encourage sensible
use of raw strings.
5) When defining a raw string over multiple lines, a 3-tick variant
has a greater weight than the 1-tick form of the current proposal.
While clearly subjective, I find this helpful when identifying
literals spread over multiple lines which necessarily fit poorly into
Java code (C-style curly brace languages in particular). Forcing
multi-line raw strings to have at least 3 ticks is a positive for me,
not a negative.
The question is whether this makes the alternate proposal "more
complex and less convenient". It is clearly more complex in the sense
that there are more rules. The question is whether those additional
rules justify themselves.
My judgement is that if the alternate proposal only provided empty
strings, it would not justify itself. But because it has other effects
which I subjectively view as positive, I find the balance is tipped in
its favour.
thanks
Stephen
More information about the amber-spec-observers
mailing list