Raw strings starting/ending with backtick

John Rose john.r.rose at oracle.com
Mon Nov 26 07:28:01 UTC 2018


On Nov 24, 2018, at 11:39 PM, Cay Horstmann <cay.horstmann at sjsu.edu> wrote:
> 
> I agree that it is inelegant that there is no good syntax for raw strings starting with a backtick. Some time ago (http://horstmann.com/unblog/2018-06-01 <http://horstmann.com/unblog/2018-06-01>), I suggested that an initial newline after the backticks could count as part of the raw string delimiter:
> 
>      String myNameInABox = ```
> +-----+
> | Cay |
> +-----+```; // This string starts with +
> 
> Ok, maybe it's not brilliant, but it solves two problems: (1) how to format multiline strings that should be aligned, without having to strip out the initial newline (2) how to declare strings that start with a backtick.

The basic reality here is that we are trying to keep the quotes as simple as possible,
while allowing them to quote anything at all, including their shorter siblings.
A close-quote can't both appear in a string and end that string for obvious reasons.

Result:  There must be an infinite set of close-quotes available, so that even if a
string has the first N-1 close-quotes inside it, it can be terminated with the Nth one.
This also implies there must be a corresponding infinite set of open-quotes.
(Opens and closes can be pairwise identical, as in the proposed feature.)

Next, we have the problem of designing a set of open quotes which can
be differentiated from each other before the string body proper is scanned.
(You have to determine the close-quote before scanning the string body.)

This really means that open-quotes must be self-delimiting, or else that
there are some substrings that are forbidden to follow at least some
open-quotes.  If an open-quote syntax is not fully self-delimiting, there
are two open-quotes Q, QR for which Q is a proper prefix of QR.  In this
case, a quoted string body cannot begin with R and be quoted with Q.

In the present case, we allow the open-quotes to be composed of an
alphabet of only one letter, the backtick, but allow any positive number
of them.  That's pretty good (and really, really simple) but it does have
the observed defect, that none of the open-quotes are self-delimiting,
because for any N>0, "`".times(N) is a proper prefix of the next open-quote,
"`".times(N+1).  Thus, for no open-quote (in the present scheme) can
the string body begin with backtick.

(There is a mirror-image problem with the end-quotes, if they are
not self-delimiting.  It must be possible, for any given string, to
choose an end-quote which (a) isn't in the string, and (b) when
appended to the string does not create an earlier instance of itself.
Again, having an alphabet of one character Q for the end-quotes
means that the string cannot end in Q.)

Can an infinite set of strings which are repetitions of a single character
be made self-delimiting?  Never, since any given member of that set
is the proper prefix of some longer member.

Making such a set self-delimiting is simple:  Add another character,
and allow it to be a terminating character for the open-quote.
Or, allow the open-quote to include an optional numeric count
that determines the length of the rest of the quote.  Or, allow
the open-quote to have arbitrary (quoted) substructure.

(And for each open-quote define a corresponding close-quote.
Then given a string, choose the shortest close-quote that does not
occur in the string, and which when appended to the string will
not create an earlier instance of itself.  Begin the quoted string
what that close-quote's open-quote.)

Supposing that Q is the main quote character and R is a helper
(or two or more) which helps size the end-quote.  Examples of
these three approaches would be:

  OQ1 = { Q.times(N) + R | N > 0 }
  OQ2 = { R + String.valueOf(N) + Q.times(N) | N > 0 }
  OQ3 = { Q + S + R | S in (Universe - R).star() }

Such schemes are more powerful, but much harder to describe than
what we have now:

  OQ0 = { Q.times(N) | N > 0, Q = \" }

Coming up with these schemes is simple.  Coming up with a scheme
that feels simple to use seems to be impossible.  Tuning and tweaking
these schemes is *NOT* a hill-climbing activity that ascends to better
and better solutions.  Creating self-delimiting string syntaxes is a
frustrating exercise in pushing the complexities and corner cases
into darker and darker holes.

We settled on OQ0 (alphabet of one character) because it is simple
and easy to understand.  We looked carefully at other OQ schemes
and did not find that their specification and learning complexity
was paid for by removing the practical complexity of encoding a
few odd-looking strings.  OQ1, etc., have their own sharp edges
which we think users will run into more often than they will run
into the sharp edges of OQ0.  Trying to "fix" OQ0 just makes it
messier, like rubbing your finger over that single speck of lint on the
lens of your new binoculars.

— John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/compiler-dev/attachments/20181125/20cb813d/attachment.html>


More information about the compiler-dev mailing list