From brian.goetz at oracle.com  Mon Feb  5 15:53:41 2018
From: brian.goetz at oracle.com (Brian Goetz)
Date: Mon, 5 Feb 2018 10:53:41 -0500
Subject: [raw-strings] Indentation problem
In-Reply-To: <CAE+3fjZGiMoQ7SRgb6LEoo4eqiRS2Gs7YFgGmy4DAizmjkDjRQ@mail.gmail.com>
References: <CAE+3fjZGiMoQ7SRgb6LEoo4eqiRS2Gs7YFgGmy4DAizmjkDjRQ@mail.gmail.com>
Message-ID: <7a14daa0-1eca-23b4-839d-a5942d51a0cb@oracle.com>

Sorry for the delay getting back to this.

> Hello!
>
> Every language which implements the multiline strings has problems
> with indentation.

Indeed.? The fundamental problem here is that the indentation of 
embedded snippets is serving two masters; the nesting of the surrounding 
code, and the snippet itself.? Sometimes the user cares about one; 
sometimes the other, and there's no one-size-fits-all set of rules that 
any language has come up with that doesn't make both camps happy.

Sometimes it doesn't really matter; a few extra spaces in an HTML 
document or SQL query is often an acceptable price to pay for 
clean-looking code.? But sometimes it does matter.? Which raises two 
questions:
 ?- What should programmers do?
 ?- What should the language help them do?

> E.g. consider something like this:

So, in light of the above questions, let's ask: is this the right way to 
generate a HTML document?? It not only has "holes" to be filled in, but 
it has entire sections whose presence or absence depends on state.? I 
think the mess of this example goes far deeper than indentation.? (But 
yes, people will write code like this, with whatever tools we give 
them.)? To the second question, what should the language do to help this 
code?? Some would say "of course, the problem is you don't support 
interpolation."? But as this example shows, interpolation only helps 
with the trivial bits; it doesn't help with the conditional inclusion, 
so it only gets you a small part of the way to this example.? For that, 
you either need something with more structure, or a templating engine, 
or a builder, or one of the zillion other tools we've invented for this 
sort of thing.

So, without ignoring your fundamental question about indentation, I'll 
just point out that this example is about way more than indentation, and 
move on ...

> Now we have broken formatting in the generated HTML, which ruins the
> idea of multiline strings

I think "multiline strings" (or even "raw strings") are a bit of a 
misleading name.? What we're going for here is the ability to embed an 
arbitrary snippet of a "program" (shell script, SQL query, JSON doc) in 
a Java program, without having to mangle the embedded snippet.? This 
enhances readability (not mucked up with escapes and extra quotes) and 
reduces errors (because you can just cut and paste that snippet of 
script from the editor in which you've probably already written it, 
without risking breaking it via syntactic mangling.)? But, as you say, 
there are issues with indentation, when it matters.? (Surely it matters 
for snippets of python.)

Secondarily, the design center for this feature is: _short_ snippets -- 
those for which putting them in a separate document would be 
obfuscatory.? To see this, we have to approach it from both sides. On 
the short side, imagine Java didn't have string literals at all. Having 
to read "yes" and "no" out of a file would be ridiculously obfuscatory; 
eliminating this indirection makes code easier to read and less 
error-prone.? But on the long side, using raw strings to embed a 
million-line snippet in a Java program is also ridiculous; it would be 
far easier for maintainers of both the Java part and the embedded part 
to have their own uniform artifacts to maintain.? So the sweet spot for 
this feature is somewhere in the middle -- snippets that are short 
enough that indirecting to a file impairs readability, but not so long 
that there's any question where the embedded snippet ends and the Java 
code resumes.? (Subjectively, I'd say that this sweet spot is in the 
5-10 line range.)

> (why bother to generate \n in output HTML if
> it looks like a mess anyways?) Moreover, the structure of Java program
> now affects the output. E.g. if you add several more nested "if" or
> "switch" statement, you will need to indent <p> even more.

My answer to those people is: then don't do that ;)? They're already 
well outside the design center (as outlined above).? They should be 
using a templating mechanism, a builder, or something else to decouple 
the static content from the dynamic content.? Of course, they will, but 
I'm not sure bending over backwards to accomodate them is the winning move.

> Many languages provide library methods to handle this.

Good, now we're back to indentation.? All things being equal, it is 
better to do things in libraries than in the language; it is cheaper, 
more flexible, faster to market, less risky, and can support a broader 
range of preferences (you can have different libraries for different 
preferences.)? So I like this direction.

> E.g.
> trimIndent() could be provided to remove leading spaces of every line,
> but this would kill the HTML indents at all. Another possibility is to
> provide a method like trimMargin() on Kotlin [1] which trims all
> spaces before a special character (pipe by default) including a
> special character itself.

Now that we're in library world, we can have _all_ of these.? We can 
trim indents to the first indent, or trim a specified number of spaces 
off, or trim to a user-selected marker.? And if the users don't like the 
ones we include, they can write their own.

> This is almost nice. Even without syntax highlighting you can easily
> distinguish between Java code and injected HTML code, you can indent
> Java and HTML independently and HTML code does not clash with Java
> code structure.

Pushing this to a library gives users the option, but not the obligation 
to do this.? That's good.

> The only problem is the necesity to call the
> trimMargin() method.

For some meaning of "only" :)?? Like most syntactic conventions, some 
users will say "this is great" and others will say "yuck".? I prefer the 
semantic transparency of calling a method that has a clear specification 
-- especially when there are multiple possible options.

Remember that we're already in a corner case with respect to indentation 
-- in many cases, the users don't care at all about the extra spaces, 
they're just building up a SQL query that is going to be sent to a 
database, and the database doesn't care either.

> This means that original line is preserved in the
> bytecode and during runtime and the trimming is processed every time
> the method is called causing performance and memory handicap. This
> problem could be minimized making trimMargin() a javac intrinsic.

There are multiple layers at which this can be optimized (the JIT may be 
able to observe that this a pure function applied to a constant), but 
indeed, this is a great candidate for compile-time constant folding.? 
(You can even see experiments related to compile-time constant folding 
going on in the condy-folding branch of the amber repo.)? Note too that 
we're now in corner-case-of-corner-case territory -- those who care 
about the indentation and the cost of runtime string processing.

> Hoever even in this case it would be hard to enforce usage of this
> method and I expect that tons of hard-to-read Java code will appear in
> the wild, despite I believe that Java is about readability.

Developers ability to combine simple features to produce unreadable code 
far outstrips the ability of language designers to do anything about it ...

> So I propose to enforce such (or similar) format on language level
> instead of adding a library method like "trimMargin()".

I think this would be a language design mistake.? This is taking one 
arbitrary convention and burning it into the language.? That convention 
might be fine for some situations, but terrible for others; not only 
might it not be the most readable choice in all cases, but it could be 
an actual conflict -- what if the | character is meaningful in the 
embedded language, such as Markdown tables? Now we're back to escaping 
-- which we were trying to avoid.

The language shouldn't pick favorites here; it should provide a simple, 
clear mechanism, which can be usefully composed with other mechanisms to 
get the job done.? Polluting the language to avoid the method call is a 
bad trade.

> I see some advantages with such syntax:
> 1. You can comment (or comment out!) a part of multiline string
> without terminating it

Rather than framing this as a property of a proposed solution, let's 
frame it as a question.? What should be the interaction with comments in 
a raw string?? Should you be able to embed comments? Should you be able 
to comment lines out?? (Note that many languages support comments, so it 
may be possible to do this by embedding a comment, rather than using the 
Java-level commenting.)? While I can surely see the utility of 
interaction with commenting, I also think that these "requirements" are 
only in play when the string in question is too long in the first place.

> 2. Looking into code fragment out of context (e.g. diff log) you
> understand that you are inside a multiline literal.
> reviewing a diff like
>
>              | x++;
> +           | if (x == 10) break;
>              | foo(x);
>
> Without pipes you could think that it's Java code without any further
> consideration.

This is true, but this is also true of large block comments; you can't 
tell whether the added line is part of a commented out block or of 
executable code.

Again, with raw strings, this is more of a problem when used with 
too-long blocks.

So, there are two things I don't like about this proposal: it's too 
"opinionated", and at the same time, it loses the fundamental goal we 
were trying to get to -- not having to muck up an embedded block with 
escaping.? (Sure, IDEs could (and should) help on pasting here, but that 
only helps writing, not reading.)

> The only disadvantage I see in forcing a pipe prefix is inability to
> just paste a big snippet from somewhere to the middle of Java program
> in a plain text editor.

As mentioned, we think this is most of the point, so this is a pretty 
big disadvantage indeed.


From james.laskey at oracle.com  Mon Feb  5 16:58:48 2018
From: james.laskey at oracle.com (Jim Laskey)
Date: Mon, 5 Feb 2018 12:58:48 -0400
Subject: Raw String Literals Revisions
Message-ID: <BA391506-C7DC-4448-B9FE-D785FB89E667@oracle.com>

Based on input received since the reveal of https://bugs.openjdk.java.net/browse/JDK-8196004
Raw String Literals, we propose making the following changes.

- We will be extending the definition of raw string literals to allow
  repeating backticks (ala Markdown.)
  - A raw string literal will begin with a sequence of one or more
    backticks,  The raw string literal will end when a matching sequence of
    backticks is encountered. Any other sequence of backticks is treated
    as raw characters.
  - There will no longer be an empty raw string literal (redundant and
    conflicting.)
  - There will no longer be a need for a embedded backtick sequence (double
    backtick.)

  Ex.
    String a = `xyz`;              // "xyz"
    String b = ``xyz``;            // "xyz"
    String c = ```x`y``z```;       // "x`y``z"
    String d = ``;                 // unmatched opening raw string sequence?
    String e = ``
               SELECT `EMP_ID`, `LAST_NAME` FROM `EMPLOYEE_TB`
               WHERE `CITY` = ?INDIANAPOLIS'
               ORDER BY `EMP_ID`, `LAST_NAME`;
               ``;
    String f = ```````````````````````````````````````````````
               SELECT `EMP_ID`, `LAST_NAME` FROM `EMPLOYEE_TB`
               WHERE `CITY` = ?INDIANAPOLIS'
               ORDER BY `EMP_ID`, `LAST_NAME`;
               ```````````````````````````````````````````````;

- The naming of the the ?escape" and ?unescape" String methods will be
  reversed such that ?unescape" converts escape sequences to characters
  and ?escape" converts worthy characters to escape sequences. Some more
  thought could be given to these names 1) to address the overloaded use
  in other languages, ex. JavaScript HTML escaping, 2) truly make the
  direction of conversion clear.

- There was some good discussion about allowing multi-line traditional
  strings. The best argument was using the multi-line traditional strings
  as a stepping stone to multi-line raw strings; simple -> multi-line ->
  raw. It was also mentioned that multi-line traditional strings would
  lessen the need for the unescape/escape methods.
  
  Ex.
  
  String g = "
              public class Example{  
                public static void main(String... args){  
                  System.out.println(\"Hello World\");  
                }
              }
             ";
             
  Ultimately, ignoring escapes, we would end up with two ways of do the
  same thing. Thus, we will not be supporting multi-line traditional
  strings at this time.

I will be updating the JEP accordingly.

Cheers,

? Jim


From brian.goetz at oracle.com  Mon Feb  5 17:09:33 2018
From: brian.goetz at oracle.com (Brian Goetz)
Date: Mon, 5 Feb 2018 12:09:33 -0500
Subject: Raw String Literals Revisions
In-Reply-To: <BA391506-C7DC-4448-B9FE-D785FB89E667@oracle.com>
References: <BA391506-C7DC-4448-B9FE-D785FB89E667@oracle.com>
Message-ID: <b17dfcbb-608a-662b-5deb-482c21a5ade4@oracle.com>


> Based on input received since the reveal of https://bugs.openjdk.java.net/browse/JDK-8196004
> Raw String Literals, we propose making the following changes.
>
> - We will be extending the definition of raw string literals to allow
>    repeating backticks (ala Markdown.)

The benefit of this is that, for a suitably chosen delimiter, any 
document can be embedded with no loss of fidelity.? For embedded 
documents that use ` in them, choose a suitable delimiter (usually `` 
will be enough) and paste away.? We stated earlier that it was a goal to 
make raw string literals truly free of interpretation by the lexer; this 
removes one of the remaining bits of non-raw-ness, that embedded 
backticks required some minor escaping. ? (The other remaining bit is 
the treatment of newlines; not sure how much its worth doing here to 
support platform-specific line endings.)? If people really want 
platform-specific newlines, they can toss a .replace("\n", "\r\n") on 
the end (which is amenable to the same optimizations as .trimIndent()).

> - The naming of the the ?escape" and ?unescape" String methods will be
>    reversed such that ?unescape" converts escape sequences to characters
>    and ?escape" converts worthy characters to escape sequences. Some more
>    thought could be given to these names 1) to address the overloaded use
>    in other languages, ex. JavaScript HTML escaping, 2) truly make the
>    direction of conversion clear.

I think this is a better polarity, but I think this exercise shows that 
"escape" and "unescape" may still be too-confusing names.? I suggest we 
continue the search for names that shout out their directionality.

>              
>    Ultimately, ignoring escapes, we would end up with two ways of do the
>    same thing. Thus, we will not be supporting multi-line traditional
>    strings at this time.

I think this is the right move, though there are arguments on both sides 
here.? The new feature is raw string literals, which are designed for 
embedding longer (but not too long) snippets of text free of Java 
interpretation.? For the rare cases where people want both string 
escaping and multi-line, there are library tools for adding back the 
escaping.


From guy.steele at oracle.com  Mon Feb  5 16:48:19 2018
From: guy.steele at oracle.com (Guy Steele)
Date: Mon, 5 Feb 2018 11:48:19 -0500
Subject: [raw-strings] Indentation problem
In-Reply-To: <7a14daa0-1eca-23b4-839d-a5942d51a0cb@oracle.com>
References: <CAE+3fjZGiMoQ7SRgb6LEoo4eqiRS2Gs7YFgGmy4DAizmjkDjRQ@mail.gmail.com>
 <7a14daa0-1eca-23b4-839d-a5942d51a0cb@oracle.com>
Message-ID: <EE1AD4A3-B24F-4061-BA8B-66D45E6730DE@oracle.com>

While the proposal to use pipe characters in multiline literals is ingenious, I guess I am astonished that the discussion so far has not at least mentioned and compared a solution that has been available in C for more than two decades: implicit concatenation of string literals.

public class Multiline {
static String createHtml(String message) {
  String html = "<html>\n"
                "  <head>\n"
                "    <title>Message</title>\n"
                "  </head>\n"
                "  <body>\n";
  if (message != null) {
    html += "    <p>\n"
            "      Message: "+message+"\n"
            "    </p>\n";
  }
  html += "  </body>\n"
          "</html>\n";
  return html;
}
}

But, wait!  We don?t even have to add that to Java, because we have a string concatenation operator, `+`:

public class Multiline {
static String createHtml(String message) {
  String html = "<html>\n"+
                "  <head>\n"+
                "    <title>Message</title>\n"+
                "  </head>\n"+
                "  <body>\n";
  if (message != null) {
    html += "    <p>\n"+
            "      Message: "+message+"\n"+
            "    </p>\n?;
  }
  html += "  </body>\n"+
          "</html>\n";
  return html;
}
}

It?s dead simple:
Whitespace inside double quotes belongs to the included code snippet. 
Whitespace outside double quotes belongs to the containing program.

No need for a trimming method.  No changes needed to the Java language.

Any IDE smart enough to provide pipe characters in a special pasting operation could just as easily provide the necessary double quotes and newline escapes and plus signs.

And while we are at it, we can get even more creative with the indentation of the containing program to better highlight the relative indentation of the included code snippets:

public class Multiline {
static String createHtml(String message) {
  String html =           "<html>\n"+
                          "  <head>\n"+
                          "    <title>Message</title>\n"+
                          "  </head>\n"+
                          "  <body>\n";
  if (message != null) {
    html +=               "    <p>\n"+
                          "      Message: "+message+"\n"+
                          "    </p>\n?;
  }
  html +=                 "  </body>\n"+
                          "</html>\n";
  return html;
}
}

Does this strategy not meet all the stated desiderata?

?Guy

P.S. I have to admit that all the \n escapes are a big ugly.  In Fortress, we explored having several string concatenation operators, one of which would supply an extra space character and one of which would supply an extra newline character.  Supposing `/` to be a string concatenation operator that adds a space character, and `//` to be a string concatenation operator that adds a newline character (in Fortress, we also allowed it to be a postfix operator that just adds a newline character), then we could write:

public class Multiline {
static String createHtml(String message) {
  String html =           "<html>"                     //
                          "  <head>"                   //
                          "    <title>Message</title>" //
                          "  </head>"                  //
                          "  <body>"                   //;
  if (message != null) {
    html +=               "    <p>"                    //
                          "      Message:" / message   //
                          "    </p>"                   //;
  }
  html +=                 "  </body>"                  //
                          "</html>"                    //;
  return html;
}
}

Or we could be even more clever, and say that the // operator has the additional effect of trimming trailing spaces and tabs from its left-hand operand.  Then we get those trailing double quotes out of the way by writing:

public class Multiline {
static String createHtml(String message) {
  String html =           "<html>                     " //
                          "  <head>                   " //
                          "    <title>Message</title> " //
                          "  </head>                  " //
                          "  <body>                   " //;
  if (message != null) {
    html +=               "    <p>                    " //
                          "      Message:" / message    //
                          "    </p>                   " //;
  }
  html +=                 "  </body>                  " //
                          "</html>                    " //;
  return html;
}
}

I am not advocating adding such operators to the Java language.  I am just pointing out that there are other interesting parts of the design space that might address the indentation-of-nested-snippets problem while requiring either fewer changes to the language (possibly none) or changes that might have broader applicability.


From brian.goetz at oracle.com  Mon Feb  5 17:37:31 2018
From: brian.goetz at oracle.com (Brian Goetz)
Date: Mon, 5 Feb 2018 12:37:31 -0500
Subject: [raw-strings] Indentation problem
In-Reply-To: <EE1AD4A3-B24F-4061-BA8B-66D45E6730DE@oracle.com>
References: <CAE+3fjZGiMoQ7SRgb6LEoo4eqiRS2Gs7YFgGmy4DAizmjkDjRQ@mail.gmail.com>
 <7a14daa0-1eca-23b4-839d-a5942d51a0cb@oracle.com>
 <EE1AD4A3-B24F-4061-BA8B-66D45E6730DE@oracle.com>
Message-ID: <a6d9afad-c152-0c2d-635f-a5b1789d56f4@oracle.com>


> But, wait! ?We don?t even have to add that to Java, because we have a 
> string concatenation operator, `+`:

For years, this was exactly my answer regarding "Why don't we have 
multi-line strings."? What turned me around was Jim convincing me that 
the problem was not the inability to express a multi-line string (which 
we've been able to do since day 1, as you point out), but the 
higher-level issue -- the accidental friction of embedding a (small) 
foreign document (JSON snippet, SQL snippet, etc) in a Java program 
without Java's string proclivities mangling the embedded document.? 
Multi-line is one aspect of this, but if this were all there was, I'd 
still be with you on "we already have this, let's move on."? The bigger 
aspect is intrusion on things like regexes (lots of double-escaping, 
since \ is used extensively by regex), which are not even multi-line, 
and the introduction of errors into embedded documents while trying to 
turn them into something the Java lexer will accept.

So, I prefer to think of this feature not as "multi-line strings" or 
even as "raw strings", but "embedded strings"; things that look like 
strings from the outside but look like whatever you want them to on the 
inside.


From guy.steele at oracle.com  Mon Feb  5 17:31:43 2018
From: guy.steele at oracle.com (Guy Steele)
Date: Mon, 5 Feb 2018 12:31:43 -0500
Subject: [raw-strings] Indentation problem
In-Reply-To: <a6d9afad-c152-0c2d-635f-a5b1789d56f4@oracle.com>
References: <CAE+3fjZGiMoQ7SRgb6LEoo4eqiRS2Gs7YFgGmy4DAizmjkDjRQ@mail.gmail.com>
 <7a14daa0-1eca-23b4-839d-a5942d51a0cb@oracle.com>
 <EE1AD4A3-B24F-4061-BA8B-66D45E6730DE@oracle.com>
 <a6d9afad-c152-0c2d-635f-a5b1789d56f4@oracle.com>
Message-ID: <5D30B3C8-6CDB-4CB9-8456-ECEC24CD73A5@oracle.com>


> On Feb 5, 2018, at 12:37 PM, Brian Goetz <brian.goetz at oracle.com> wrote:
> 
> 
>> But, wait!  We don?t even have to add that to Java, because we have a string concatenation operator, `+`:
> 
> For years, this was exactly my answer regarding "Why don't we have multi-line strings."  What turned me around was Jim convincing me that the problem was not the inability to express a multi-line string (which we've been able to do since day 1, as you point out), but the higher-level issue -- the accidental friction of embedding a (small) foreign document (JSON snippet, SQL snippet, etc) in a Java program without Java's string proclivities mangling the embedded document.  Multi-line is one aspect of this, but if this were all there was, I'd still be with you on "we already have this, let's move on."  The bigger aspect is intrusion on things like regexes (lots of double-escaping, since \ is used extensively by regex), which are not even multi-line, and the introduction of errors into embedded documents while trying to turn them into something the Java lexer will accept.
> 
> So, I prefer to think of this feature not as "multi-line strings" or even as "raw strings", but "embedded strings"; things that look like strings from the outside but look like whatever you want them to on the inside.

Good, that?s a better characterization of the broad problem.

However, I also note that the broad problem may two or three distinct symptoms, and:
(1) A solution that addresses one symptom may not address the others, and
(2) On the other hand, it may (or may not) be perfectly reason to address the most painful symptoms in different ways, rather than insisting that a single solution cover them all.

In particular, I happen to think that the problem of distinguishing snippet indentation from encoding-program indentation may require a rather different kind of solution from the problem of escape characters in embedded snippets.  The reason is that in both these cases the painful symptom is visual in nature rather than logical.  That?s why I can understand what drove Tagir to pursue the pipe-character approach (even though I think it may not be the best solution to the problem).  We may want to use ```?``` to enclose regexes but also want to use some other approach to solve the multi-line / indentation problems.


From brian.goetz at oracle.com  Mon Feb  5 18:39:53 2018
From: brian.goetz at oracle.com (Brian Goetz)
Date: Mon, 5 Feb 2018 13:39:53 -0500
Subject: [raw-strings] Indentation problem
In-Reply-To: <5D30B3C8-6CDB-4CB9-8456-ECEC24CD73A5@oracle.com>
References: <CAE+3fjZGiMoQ7SRgb6LEoo4eqiRS2Gs7YFgGmy4DAizmjkDjRQ@mail.gmail.com>
 <7a14daa0-1eca-23b4-839d-a5942d51a0cb@oracle.com>
 <EE1AD4A3-B24F-4061-BA8B-66D45E6730DE@oracle.com>
 <a6d9afad-c152-0c2d-635f-a5b1789d56f4@oracle.com>
 <5D30B3C8-6CDB-4CB9-8456-ECEC24CD73A5@oracle.com>
Message-ID: <b86fabeb-9965-09cc-5981-14fe618cf887@oracle.com>


> However, I also note that the broad problem may two or three distinct symptoms, and:
> (1) A solution that addresses one symptom may not address the others, and
> (2) On the other hand, it may (or may not) be perfectly reason to address the most painful symptoms in different ways, rather than insisting that a single solution cover them all.

Indeed so.? This is one reason why we resisted the call to do string 
interpolation (which many developers conflate with multi-line strings, 
as many languages with one also have the other) at the same time.? 
Another way to ask this question is: are we yet sufficiently minimal?? 
We boiled it down quite a lot already, but are we at "minimal" yet?? Or, 
did we take a wrong turn in boiling it down, and find ourselves only a 
local minimum?

> In particular, I happen to think that the problem of distinguishing snippet indentation from encoding-program indentation may require a rather different kind of solution from the problem of escape characters in embedded snippets.  The reason is that in both these cases the painful symptom is visual in nature rather than logical.  That?s why I can understand what drove Tagir to pursue the pipe-character approach (even though I think it may not be the best solution to the problem).  We may want to use ```?``` to enclose regexes but also want to use some other approach to solve the multi-line / indentation problems.

OK, so what you're saying here is that it might be a clever 
self-deception to count newline handling as "just another aspect of 
raw-ness"?

From guy.steele at oracle.com  Mon Feb  5 18:39:04 2018
From: guy.steele at oracle.com (Guy Steele)
Date: Mon, 5 Feb 2018 13:39:04 -0500
Subject: [raw-strings] Indentation problem
In-Reply-To: <b86fabeb-9965-09cc-5981-14fe618cf887@oracle.com>
References: <CAE+3fjZGiMoQ7SRgb6LEoo4eqiRS2Gs7YFgGmy4DAizmjkDjRQ@mail.gmail.com>
 <7a14daa0-1eca-23b4-839d-a5942d51a0cb@oracle.com>
 <EE1AD4A3-B24F-4061-BA8B-66D45E6730DE@oracle.com>
 <a6d9afad-c152-0c2d-635f-a5b1789d56f4@oracle.com>
 <5D30B3C8-6CDB-4CB9-8456-ECEC24CD73A5@oracle.com>
 <b86fabeb-9965-09cc-5981-14fe618cf887@oracle.com>
Message-ID: <8A0CACD4-A850-45E0-A00B-D0F8A959432C@oracle.com>


> On Feb 5, 2018, at 1:39 PM, Brian Goetz <brian.goetz at oracle.com> wrote:
> 
> 
>> However, I also note that the broad problem may two or three distinct symptoms, and:
>> (1) A solution that addresses one symptom may not address the others, and
>> (2) On the other hand, it may (or may not) be perfectly reason to address the most painful symptoms in different ways, rather than insisting that a single solution cover them all.
> 
> Indeed so.  This is one reason why we resisted the call to do string interpolation (which many developers conflate with multi-line strings, as many languages with one also have the other) at the same time.  Another way to ask this question is: are we yet sufficiently minimal?  We boiled it down quite a lot already, but are we at "minimal" yet?  Or, did we take a wrong turn in boiling it down, and find ourselves only a local minimum?
> 
>> In particular, I happen to think that the problem of distinguishing snippet indentation from encoding-program indentation may require a rather different kind of solution from the problem of escape characters in embedded snippets.  The reason is that in both these cases the painful symptom is visual in nature rather than logical.  That?s why I can understand what drove Tagir to pursue the pipe-character approach (even though I think it may not be the best solution to the problem).  We may want to use ```?``` to enclose regexes but also want to use some other approach to solve the multi-line / indentation problems.
> 
> OK, so what you're saying here is that it might be a clever self-deception to count newline handling as "just another aspect of raw-ness"?

Bingo.

Back in the day (I?m talking 1960s) it was ugly and wasteful but predictable: if there were line breaks at all (as opposed to record-oriented I/O), they were represented by two characters, CR and then LF, held over from the mechanical abilities/requirements of Teletype machines.

Then in mid-1960s an ISO standard allowed plain LF (eventually semi-renamed Newline) as an alternative, and Multics and then Unix spread this idea (and eventually to Apple).

But another branch of the world, notably the CP/M to MS-DOS to Windows line, continued to use CR/LF.  Worse yet, some software came to use CR along (perhaps a natural enough theory when you consider that the ?Return? key on keyboards usually generates the CR character rather than the LF character).

It is simply impossible to be compatible with everyone on this issue, and we are fooling ourselves if we think that raw string representations can solve this problem in all contexts.  Much better, I think, in the absence of consensus to have explicit software gatekeepers at the points where data transitions among these disparate worlds.


From brian.goetz at oracle.com  Mon Feb  5 20:55:24 2018
From: brian.goetz at oracle.com (Brian Goetz)
Date: Mon, 5 Feb 2018 15:55:24 -0500
Subject: [raw-strings] Indentation problem
In-Reply-To: <8A0CACD4-A850-45E0-A00B-D0F8A959432C@oracle.com>
References: <CAE+3fjZGiMoQ7SRgb6LEoo4eqiRS2Gs7YFgGmy4DAizmjkDjRQ@mail.gmail.com>
 <7a14daa0-1eca-23b4-839d-a5942d51a0cb@oracle.com>
 <EE1AD4A3-B24F-4061-BA8B-66D45E6730DE@oracle.com>
 <a6d9afad-c152-0c2d-635f-a5b1789d56f4@oracle.com>
 <5D30B3C8-6CDB-4CB9-8456-ECEC24CD73A5@oracle.com>
 <b86fabeb-9965-09cc-5981-14fe618cf887@oracle.com>
 <8A0CACD4-A850-45E0-A00B-D0F8A959432C@oracle.com>
Message-ID: <39218091-4267-59e1-1de1-a792c1702cbb@oracle.com>

OK, let's take a step back.? We have identified at least three degrees 
of freedom that have been sources of friction with existing string literals:

 ?- Sometimes we don't want traditional escaping (\n, etc);
 ?- Sometimes we don't want unicode escaping (\unnnn);
 ?- Sometimes we want to represent multiple lines of text as a single 
String.

Traditional strings could be described as (false, false, false) on these 
axes; the propose raw strings are (true, true, true).? As a first 
evaluation (if these really are the axes), this is encouraging; if 
you're going to pick 2 of 2^N prepackaged options, its often best to 
pick the ones with the biggest hamming distance.

I have a hard time imagining that people really need, for example, 
traditional escaping but not unicode escaping, with any frequency.? So 
offering all 2^n combinations is not likely to carry its weight.

I think what you are suggesting is that its fine to lump the first two, 
but it might have been a premature move to lump them with the third.? (A 
second question is: are these the only axes we should be concerned with 
right now.)? So, let's examine that.

We explored allowing double-quoted strings to span lines too; this gives 
you a different stacking: { escaping multi-line, raw multi-line }.? But 
I think the part that's still? unexplored is: do we need to explicitly 
surface how source lines are combined into strings?

The assumption we've been working off of is: \n has won (this wasn't 
true when Java got started.)? Is this wishful thinking? And if not, can 
the library approach serve this purpose here too:

 ??? `a long
 ???? string`.toPlatformLineEnding()

(which, as has been observed, can be optimized either by compile-time 
evaluation or by link-time evaluation using LDC and ConstantDynamic, so 
I think we can ignore the "but then I'm doing work at runtime" aspect of 
this.)


On 2/5/2018 1:39 PM, Guy Steele wrote:
>> On Feb 5, 2018, at 1:39 PM, Brian Goetz <brian.goetz at oracle.com> wrote:
>>
>>
>>> However, I also note that the broad problem may two or three distinct symptoms, and:
>>> (1) A solution that addresses one symptom may not address the others, and
>>> (2) On the other hand, it may (or may not) be perfectly reason to address the most painful symptoms in different ways, rather than insisting that a single solution cover them all.
>> Indeed so.  This is one reason why we resisted the call to do string interpolation (which many developers conflate with multi-line strings, as many languages with one also have the other) at the same time.  Another way to ask this question is: are we yet sufficiently minimal?  We boiled it down quite a lot already, but are we at "minimal" yet?  Or, did we take a wrong turn in boiling it down, and find ourselves only a local minimum?
>>
>>> In particular, I happen to think that the problem of distinguishing snippet indentation from encoding-program indentation may require a rather different kind of solution from the problem of escape characters in embedded snippets.  The reason is that in both these cases the painful symptom is visual in nature rather than logical.  That?s why I can understand what drove Tagir to pursue the pipe-character approach (even though I think it may not be the best solution to the problem).  We may want to use ```?``` to enclose regexes but also want to use some other approach to solve the multi-line / indentation problems.
>> OK, so what you're saying here is that it might be a clever self-deception to count newline handling as "just another aspect of raw-ness"?
> Bingo.
>
> Back in the day (I?m talking 1960s) it was ugly and wasteful but predictable: if there were line breaks at all (as opposed to record-oriented I/O), they were represented by two characters, CR and then LF, held over from the mechanical abilities/requirements of Teletype machines.
>
> Then in mid-1960s an ISO standard allowed plain LF (eventually semi-renamed Newline) as an alternative, and Multics and then Unix spread this idea (and eventually to Apple).
>
> But another branch of the world, notably the CP/M to MS-DOS to Windows line, continued to use CR/LF.  Worse yet, some software came to use CR along (perhaps a natural enough theory when you consider that the ?Return? key on keyboards usually generates the CR character rather than the LF character).
>
> It is simply impossible to be compatible with everyone on this issue, and we are fooling ourselves if we think that raw string representations can solve this problem in all contexts.  Much better, I think, in the absence of consensus to have explicit software gatekeepers at the points where data transitions among these disparate worlds.
>


From daniel.smith at oracle.com  Wed Feb  7 00:42:16 2018
From: daniel.smith at oracle.com (Dan Smith)
Date: Tue, 6 Feb 2018 17:42:16 -0700
Subject: Specification for JEP 323: Local-Variable Syntax for Lambda Parameters
Message-ID: <575B787D-3350-4414-8DE0-91D9467AFB30@oracle.com>

Here's a proposed specification for JEP 323. Mainly a few small tweaks to allow use of 'var' in the lambda syntax.

http://cr.openjdk.java.net/~dlsmith/lambda-parameters.html

From john.r.rose at oracle.com  Tue Feb 13 17:53:18 2018
From: john.r.rose at oracle.com (John Rose)
Date: Tue, 13 Feb 2018 09:53:18 -0800
Subject: Raw String Literals Revisions
In-Reply-To: <BA391506-C7DC-4448-B9FE-D785FB89E667@oracle.com>
References: <BA391506-C7DC-4448-B9FE-D785FB89E667@oracle.com>
Message-ID: <AB12FE93-7998-474B-8E1E-765AECB142AA@oracle.com>

On Feb 5, 2018, at 8:58 AM, Jim Laskey <james.laskey at oracle.com> wrote:
> 
> Based on input received since the reveal of https://bugs.openjdk.java.net/browse/JDK-8196004 <https://bugs.openjdk.java.net/browse/JDK-8196004>
> Raw String Literals, we propose making the following changes.

I have written a draft edit to the JLS to support this feature.
Please find it here:

http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v4.pdf <http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v4.pdf>

Best wishes,
? John

From alex.buckley at oracle.com  Tue Feb 13 17:58:55 2018
From: alex.buckley at oracle.com (Alex Buckley)
Date: Tue, 13 Feb 2018 09:58:55 -0800
Subject: Raw string literals and Unicode escapes
Message-ID: <5A83275F.80802@oracle.com>

I suspect the trickiest part of specifying raw string literals will be 
the lexer's modal behavior for Unicode escapes. As such, I am going to 
put the behavior under the microscope. Here is what the JEP has to say:

-----
Unicode escapes, in the form \uxxxx, are processed as part of character 
input prior to interpretation by the lexer. To support the raw string 
literal as-is requirement, Unicode escape processing is disabled when 
the lexer encounters an opening backtick and reenabled when encountering 
a closing backtick.
-----

I would like to assume that if the lexer comes across the six tokens \ u 
0 0 6 0  then it should interpret them as a Unicode escape representing 
a backtick _and then continue as if consuming the tokens of a raw string 
literal_. However, the mention of _an_ opening backtick and _a_ closing 
backtick gave me pause, given that repeated backticks can serve as the 
opening delimiter and the closing delimiter. For absolute clarity, let's 
write out examples to confirm intent: (Jim, please confirm or deny as 
you see fit!)

1.  String s = \u0060`;

Illegal. The RHS is lexed as ``;   which is disallowed by the grammar.

2.  String s = \u0060Hello\u0060;

Illegal. The RHS is lexed as `Hello\u0060;   and so on for the rest of 
the compilation unit -- the six tokens \ u 0 0 6 0 are not treated as a 
Unicode escape since we're lexing a raw string literal. And without a 
closing delimiter before the end of the compilation unit, a compile-time 
error occurs.

3a.  String s = \u0060Hello`;

Legal. The RHS is lexed as `Hello`;   which is well formed.

3b.  String s = \u0060\u0060Hello`;

Depends! If you take the JEP literally, then just the Unicode escape 
which serves as the first opening backtick ("_an_ opening backtick") is 
enough to enter raw-string mode. That makes the code legal: the RHS is 
lexed as `\u0060Hello`;   which is well formed. On the other hand, you 
might think that we shouldn't enter raw-string mode until the lexer in 
traditional mode has lexed the opening delimiter fully (i.e. ALL the 
opening backticks). Then, the code in 3b is illegal, because the opening 
delimiter (``) and the closing delimiter (`) are not symmetric.

I think we should take the JEP literally, so that 3b is legal. And then, 
some more examples:

4a.  String s = \u0060`Hello``;

Legal. The RHS is lexed as ``Hello``;   which is well formed.

4b.  String s = \u0060\u0060Hello``;

Illegal. The RHS is lexed as `\u0060Hello``;   which is disallowed by 
the grammar. A raw string literal containing 11 tokens is immediately 
followed by a ` token and a ; token which are not expected.

4c.  String s = \u0060\u0060Hello`\u0060;

Depends! If you take the JEP literally, where _a_ closing backtick is 
enough to re-enable Unicode escape processing, then the RHS is lexed as 
`\u0060Hello``;  which is illegal per 4b. On the other hand, if you 
think that we shouldn't re-enter traditional mode until the lexer in 
raw-string mode has lexed the closing delimiter fully (i.e. ALL the 
closing backticks), then presumably you think analogously about the 
opening delimiter, so the RHS would be lexed as ``Hello`\u0060;   which 
is illegal per 2 (no closing delimiter `` before the end of the 
compilation unit).

5.  String s = \u0060`Hello`\u0060;

I put this here because it looks nice. It hits the same issues as 3b and 4c.

Alex

From john.r.rose at oracle.com  Tue Feb 13 22:11:06 2018
From: john.r.rose at oracle.com (John Rose)
Date: Tue, 13 Feb 2018 14:11:06 -0800
Subject: Raw string literals and Unicode escapes
In-Reply-To: <5A83275F.80802@oracle.com>
References: <5A83275F.80802@oracle.com>
Message-ID: <B1C373CE-006A-493B-AFCA-F837B7CC3326@oracle.com>

On Feb 13, 2018, at 9:58 AM, Alex Buckley <alex.buckley at oracle.com> wrote:
> 
> I suspect the trickiest part of specifying raw string literals will be the lexer's modal behavior for Unicode escapes. As such, I am going to put the behavior under the microscope.

For an approach to this see:
  http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v4.pdf <http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v4.pdf>

In short:  We define a so-called "preimage" for each token,
which is the unambiguously defined sequence of UTF-16
code points that translate to that token via \u substitution
and line terminator normalization.

For raw strings (only) the preimage of a token is significant.
The backticks of a raw string (both opening and closing)
are required to be their own preimage (no \u0060 allowed).
And the raw string body contents are the preimage of the
string token, not the normal token image.

I think preimage is the trick we need here, and it settles
a number of questions, such as those you raised.
All of the tricky examples you raised are uniformly illegal,
under the preimage rule for raw-string quotes.

? John

From james.laskey at oracle.com  Tue Feb 13 22:19:03 2018
From: james.laskey at oracle.com (Jim Laskey)
Date: Tue, 13 Feb 2018 18:19:03 -0400
Subject: Raw string literals and Unicode escapes
In-Reply-To: <5A83275F.80802@oracle.com>
References: <5A83275F.80802@oracle.com>
Message-ID: <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>

10a. String s = `abc`;
10b. String s = \u0060abc`;

As it stands both are legal. This decision has been mostly taken away from us because the lookahead of the previous token has ?consumed" the character. There is little hope of finding out which form the backtick was derived. Not technically true in javac since we can sift back through the input buffer. Other tools may differ.  I?m going to ignore this remark in a second.

Choice: do we turn off escape processing on the first open backtick or the last open backtick? It doesn?t really matter as long as we do it before consuming the first non-backtick character.

Choice: do we turn on escape processing on the first close backtick or the last close backtick? It doesn?t matter as long as we do it before consuming the next non-backtick character. If we have an aborted close sequence (too few or too many backticks) then we have to turn it off again.

What about embedding \u0060 in a raw string?  If we treat them the same as backtick then the user is limited in the ways to express untranslated escapes. Note: We can always convert manually in the scanner by looking ahead for ?\?, ?u?, ?0?, ?0?, ?6?, ?0?.

That all said, I think we should not allow \u0060 to represent a backtick in a raw string literal, ever. It complicates things unnecessarily and limits what the user can embed in the raw string.

So, change the scanner to

A) Peek back to make sure the first open backtick was exactly a backtick.
B) Turn off Unicode escapes immediately so that only backtick characters can be part of the delimiter.
C) Turn on Unicode escapes only after a valid closing delimiter is encountered.

Based on this all your examples are illegal.

? Jim


> On Feb 13, 2018, at 1:58 PM, Alex Buckley <Alex.Buckley at oracle.com> wrote:
> 
> I suspect the trickiest part of specifying raw string literals will be the lexer's modal behavior for Unicode escapes. As such, I am going to put the behavior under the microscope. Here is what the JEP has to say:
> 
> -----
> Unicode escapes, in the form \uxxxx, are processed as part of character input prior to interpretation by the lexer. To support the raw string literal as-is requirement, Unicode escape processing is disabled when the lexer encounters an opening backtick and reenabled when encountering a closing backtick.
> -----
> 
> I would like to assume that if the lexer comes across the six tokens \ u 0 0 6 0  then it should interpret them as a Unicode escape representing a backtick _and then continue as if consuming the tokens of a raw string literal_. However, the mention of _an_ opening backtick and _a_ closing backtick gave me pause, given that repeated backticks can serve as the opening delimiter and the closing delimiter. For absolute clarity, let's write out examples to confirm intent: (Jim, please confirm or deny as you see fit!)
> 
> 1.  String s = \u0060`;
> 
> Illegal. The RHS is lexed as ``;   which is disallowed by the grammar.
> 
> 2.  String s = \u0060Hello\u0060;
> 
> Illegal. The RHS is lexed as `Hello\u0060;   and so on for the rest of the compilation unit -- the six tokens \ u 0 0 6 0 are not treated as a Unicode escape since we're lexing a raw string literal. And without a closing delimiter before the end of the compilation unit, a compile-time error occurs.
> 
> 3a.  String s = \u0060Hello`;
> 
> Legal. The RHS is lexed as `Hello`;   which is well formed.
> 
> 3b.  String s = \u0060\u0060Hello`;
> 
> Depends! If you take the JEP literally, then just the Unicode escape which serves as the first opening backtick ("_an_ opening backtick") is enough to enter raw-string mode. That makes the code legal: the RHS is lexed as `\u0060Hello`;   which is well formed. On the other hand, you might think that we shouldn't enter raw-string mode until the lexer in traditional mode has lexed the opening delimiter fully (i.e. ALL the opening backticks). Then, the code in 3b is illegal, because the opening delimiter (``) and the closing delimiter (`) are not symmetric.
> 
> I think we should take the JEP literally, so that 3b is legal. And then, some more examples:
> 
> 4a.  String s = \u0060`Hello``;
> 
> Legal. The RHS is lexed as ``Hello``;   which is well formed.
> 
> 4b.  String s = \u0060\u0060Hello``;
> 
> Illegal. The RHS is lexed as `\u0060Hello``;   which is disallowed by the grammar. A raw string literal containing 11 tokens is immediately followed by a ` token and a ; token which are not expected.
> 
> 4c.  String s = \u0060\u0060Hello`\u0060;
> 
> Depends! If you take the JEP literally, where _a_ closing backtick is enough to re-enable Unicode escape processing, then the RHS is lexed as `\u0060Hello``;  which is illegal per 4b. On the other hand, if you think that we shouldn't re-enter traditional mode until the lexer in raw-string mode has lexed the closing delimiter fully (i.e. ALL the closing backticks), then presumably you think analogously about the opening delimiter, so the RHS would be lexed as ``Hello`\u0060;   which is illegal per 2 (no closing delimiter `` before the end of the compilation unit).
> 
> 5.  String s = \u0060`Hello`\u0060;
> 
> I put this here because it looks nice. It hits the same issues as 3b and 4c.
> 
> Alex


From john.r.rose at oracle.com  Tue Feb 13 22:27:30 2018
From: john.r.rose at oracle.com (John Rose)
Date: Tue, 13 Feb 2018 14:27:30 -0800
Subject: Raw string literals and Unicode escapes
In-Reply-To: <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
Message-ID: <E674FC28-A966-4743-B12E-05FE88EE4508@oracle.com>

On Feb 13, 2018, at 2:19 PM, Jim Laskey <james.laskey at oracle.com> wrote:
> 
> So, change the scanner to
> 
> A) Peek back to make sure the first open backtick was exactly a backtick.
> B) Turn off Unicode escapes immediately so that only backtick characters can be part of the delimiter.
> C) Turn on Unicode escapes only after a valid closing delimiter is encountered.
> 
> Based on this all your examples are illegal.

+1

I think this is also the simplest behavior to explain to users.


From alex.buckley at oracle.com  Wed Feb 14 19:46:23 2018
From: alex.buckley at oracle.com (Alex Buckley)
Date: Wed, 14 Feb 2018 11:46:23 -0800
Subject: Raw string literals and Unicode escapes
In-Reply-To: <B1C373CE-006A-493B-AFCA-F837B7CC3326@oracle.com>
References: <5A83275F.80802@oracle.com>
 <B1C373CE-006A-493B-AFCA-F837B7CC3326@oracle.com>
Message-ID: <5A84920F.20201@oracle.com>

On 2/13/2018 2:11 PM, John Rose wrote:
> On Feb 13, 2018, at 9:58 AM, Alex Buckley <alex.buckley at oracle.com
> <mailto:alex.buckley at oracle.com>> wrote:
>>
>> I suspect the trickiest part of specifying raw string literals will be
>> the lexer's modal behavior for Unicode escapes. As such, I am going to
>> put the behavior under the microscope.
>
> For an approach to this see:
> http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v4.pdf
>
> In short:  We define a so-called "preimage" for each token,
> which is the unambiguously defined sequence of UTF-16
> code points that translate to that token via \u substitution
> and line terminator normalization.
>
> For raw strings (only) the preimage of a token is significant.
> The backticks of a raw string (both opening and closing)
> are required to be their own preimage (no \u0060 allowed).
> And the raw string body contents are the preimage of the
> string token, not the normal token image.
>
> I think preimage is the trick we need here, and it settles
> a number of questions, such as those you raised.
> All of the tricky examples you raised are uniformly illegal,
> under the preimage rule for raw-string quotes.

I agree that holding on to the preimage of each InputElement (JLS 3.5) 
is necessary because ` can legitimately appear in some kinds of 
InputElement as an ordinary InputCharacter (derived from either the 
RawInputCharacter ` or the UnicodeEscape \u0060):

1.  Comment

     // This Markdown processor treats ` specially.
     /* This Markdown processor treats \u0060 specially. */

2.  Token (and more specifically, StringLiteral)

     "Hi `Bob`"
     "Hi \u0060Bob\u0060"

Only if the InputElement is a Token, and more specifically a 
RawStringLiteral, do we need to take the sequence of InputCharacters and 
LineTerminators that constitute its RawStringBody and replace that 
sequence with its preimage.

I want to say something about the delimiters of the raw string literal 
now, but I'll do that in response to Jim's mail.

Alex

From alex.buckley at oracle.com  Wed Feb 14 20:24:36 2018
From: alex.buckley at oracle.com (Alex Buckley)
Date: Wed, 14 Feb 2018 12:24:36 -0800
Subject: Raw string literals and Unicode escapes
In-Reply-To: <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
Message-ID: <5A849B04.8030503@oracle.com>

On 2/13/2018 2:19 PM, Jim Laskey wrote:
> 10a. String s = `abc`; 10b. String s = \u0060abc`;
>...
> So, change the scanner to
>
> A) Peek back to make sure the first open backtick was exactly a
> backtick. B) Turn off Unicode escapes immediately so that only
> backtick characters can be part of the delimiter. C) Turn on Unicode
> escapes only after a valid closing delimiter is encountered.
>
> Based on this all your examples are illegal.

I am not opposed to saying that a delimiter must be constructed from 
actual ` characters (that is, the RawInputCharacter ` rather than the 
UnicodeEscape \u0060). It would be silly if the opening delimiter was 
\u0060 because the closing delimiter cannot be identical -- that hurts 
readability. (Clearly the six characters \ u 0 0 6 0 inside a raw string 
literal get no special processing.)

Unfortunately, there is nothing in the lexical grammar that prevents 
\u0060Hello` or \u0060Hello\u0060 or in fact any of the examples below
from being lexed as a RawStringLiteral. The JLS will need a semantic 
rule to force each RawStringDelimiter to be composed of actual ` 
characters. As you say, this will make all the examples below illegal.

There is plenty of precedent for semantic rules ("It is a compile-time 
error ...") in the interpretation of Literal tokens, so that's fine. In 
fact, JLS 3.10.4 already has a semantic rule that appears to constrain a 
delimiter in a CharacterLiteral token:

   It is a compile-time error for the character following the
   SingleCharacter or EscapeSequence to be other than a '.

although it doesn't mean to force an actual ' character (that is, the 
RawInputCharacter ' and not the UnicodeEscape \u0027). It means:

   It is a compile-time error for the character following the
   SingleCharacter or EscapeSequence to be other than a ' (or the
   Unicode escape thereof).

Alex

>> On Feb 13, 2018, at 1:58 PM, Alex Buckley <Alex.Buckley at oracle.com>
>> wrote:
>>
>> I suspect the trickiest part of specifying raw string literals will
>> be the lexer's modal behavior for Unicode escapes. As such, I am
>> going to put the behavior under the microscope. Here is what the
>> JEP has to say:
>>
>> ----- Unicode escapes, in the form \uxxxx, are processed as part of
>> character input prior to interpretation by the lexer. To support
>> the raw string literal as-is requirement, Unicode escape processing
>> is disabled when the lexer encounters an opening backtick and
>> reenabled when encountering a closing backtick. -----
>>
>> I would like to assume that if the lexer comes across the six
>> tokens \ u 0 0 6 0  then it should interpret them as a Unicode
>> escape representing a backtick _and then continue as if consuming
>> the tokens of a raw string literal_. However, the mention of _an_
>> opening backtick and _a_ closing backtick gave me pause, given that
>> repeated backticks can serve as the opening delimiter and the
>> closing delimiter. For absolute clarity, let's write out examples
>> to confirm intent: (Jim, please confirm or deny as you see fit!)
>>
>> 1.  String s = \u0060`;
>>
>> Illegal. The RHS is lexed as ``;   which is disallowed by the
>> grammar.
>>
>> 2.  String s = \u0060Hello\u0060;
>>
>> Illegal. The RHS is lexed as `Hello\u0060;   and so on for the rest
>> of the compilation unit -- the six tokens \ u 0 0 6 0 are not
>> treated as a Unicode escape since we're lexing a raw string
>> literal. And without a closing delimiter before the end of the
>> compilation unit, a compile-time error occurs.
>>
>> 3a.  String s = \u0060Hello`;
>>
>> Legal. The RHS is lexed as `Hello`;   which is well formed.
>>
>> 3b.  String s = \u0060\u0060Hello`;
>>
>> Depends! If you take the JEP literally, then just the Unicode
>> escape which serves as the first opening backtick ("_an_ opening
>> backtick") is enough to enter raw-string mode. That makes the code
>> legal: the RHS is lexed as `\u0060Hello`;   which is well formed.
>> On the other hand, you might think that we shouldn't enter
>> raw-string mode until the lexer in traditional mode has lexed the
>> opening delimiter fully (i.e. ALL the opening backticks). Then, the
>> code in 3b is illegal, because the opening delimiter (``) and the
>> closing delimiter (`) are not symmetric.
>>
>> I think we should take the JEP literally, so that 3b is legal. And
>> then, some more examples:
>>
>> 4a.  String s = \u0060`Hello``;
>>
>> Legal. The RHS is lexed as ``Hello``;   which is well formed.
>>
>> 4b.  String s = \u0060\u0060Hello``;
>>
>> Illegal. The RHS is lexed as `\u0060Hello``;   which is disallowed
>> by the grammar. A raw string literal containing 11 tokens is
>> immediately followed by a ` token and a ; token which are not
>> expected.
>>
>> 4c.  String s = \u0060\u0060Hello`\u0060;
>>
>> Depends! If you take the JEP literally, where _a_ closing backtick
>> is enough to re-enable Unicode escape processing, then the RHS is
>> lexed as `\u0060Hello``;  which is illegal per 4b. On the other
>> hand, if you think that we shouldn't re-enter traditional mode
>> until the lexer in raw-string mode has lexed the closing delimiter
>> fully (i.e. ALL the closing backticks), then presumably you think
>> analogously about the opening delimiter, so the RHS would be lexed
>> as ``Hello`\u0060;   which is illegal per 2 (no closing delimiter
>> `` before the end of the compilation unit).
>>
>> 5.  String s = \u0060`Hello`\u0060;
>>
>> I put this here because it looks nice. It hits the same issues as
>> 3b and 4c.
>>
>> Alex
>

From john.r.rose at oracle.com  Wed Feb 14 20:42:01 2018
From: john.r.rose at oracle.com (John Rose)
Date: Wed, 14 Feb 2018 12:42:01 -0800
Subject: Raw string literals and Unicode escapes
In-Reply-To: <5A849B04.8030503@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
Message-ID: <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>

On Feb 14, 2018, at 12:24 PM, Alex Buckley <alex.buckley at oracle.com> wrote:
> 
> There is plenty of precedent for semantic rules

In my draft version this is done with "where" clauses on the
grammar rules:

>  
> RawStringLiteral:
> 
>   RawQuote RawStringBody RawQuote
>   where the two raw-quotes are constrained to be identical
> 
> RawQuote:
>   ` {`}
>   where the preimage is constrained to be unescaped


From alex.buckley at oracle.com  Wed Feb 14 21:43:27 2018
From: alex.buckley at oracle.com (Alex Buckley)
Date: Wed, 14 Feb 2018 13:43:27 -0800
Subject: Raw string literals and Unicode escapes
In-Reply-To: <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
 <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
Message-ID: <5A84AD7F.5030803@oracle.com>

On 2/14/2018 12:42 PM, John Rose wrote:
> On Feb 14, 2018, at 12:24 PM, Alex Buckley <alex.buckley at oracle.com
> <mailto:alex.buckley at oracle.com>> wrote:
>>
>> There is plenty of precedent for semantic rules
>
> In my draft version this is done with "where" clauses on the
> grammar rules:
>
>> RawStringLiteral:
>>
>>   RawQuote RawStringBody RawQuote
>>   where the two raw-quotes are constrained to be identical
>>
>> RawQuote:
>>   ` {`}
>>   where the preimage is constrained to be unescaped

We're dancing on the head of a pin now, but as a matter of 
specificational style I'm wary of too many rules in the grammar itself, 
especially a context-sensitive rule like raw-quotes-must-balance.

JLS 3.10.5 is a good specimen to study: there is a context-free rule in 
the grammar:

   StringCharacter:
     InputCharacter but not " or \

and a context-sensitive semantic rule:

   It is a compile-time error for a line terminator to appear
   after the opening " and before the closing matching ".

Strictly speaking, the semantic rule is unnecessary because 
InputCharacter is DEFINED to exclude the CR and LF line terminators! But 
the semantic rule makes the intent very very clear. Writing rules in 
this form also prevents the spec from becoming a soup of statements that 
are more than just observations but less than full-throated assertions.

Anyway, the draft was very useful, thanks!

Alex

From john.r.rose at oracle.com  Wed Feb 14 21:48:20 2018
From: john.r.rose at oracle.com (John Rose)
Date: Wed, 14 Feb 2018 13:48:20 -0800
Subject: Raw string literals and Unicode escapes
In-Reply-To: <5A84AD7F.5030803@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
 <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
 <5A84AD7F.5030803@oracle.com>
Message-ID: <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com>

On Feb 14, 2018, at 1:43 PM, Alex Buckley <alex.buckley at oracle.com> wrote:
> 
> Strictly speaking, the semantic rule is unnecessary because InputCharacter is DEFINED to exclude the CR and LF line terminators! But the semantic rule makes the intent very very clear. Writing rules in this form also prevents the spec from becoming a soup of statements that are more than just observations but less than full-throated assertions.

That makes sense.

> Anyway, the draft was very useful, thanks!

Glad to hear it!

? John

P.S. I posted another version that takes a slightly different
tack on the restriction of "cannot begin with a backquote".
It basically lifts the whole design of Markdown code quotes.

http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v5.pdf <http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v5.pdf>


From alex.buckley at oracle.com  Wed Feb 14 22:42:50 2018
From: alex.buckley at oracle.com (Alex Buckley)
Date: Wed, 14 Feb 2018 14:42:50 -0800
Subject: Raw string literals and Unicode escapes
In-Reply-To: <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
 <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
 <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com>
Message-ID: <5A84BB6A.60102@oracle.com>

On 2/14/2018 1:48 PM, John Rose wrote:
> P.S. I posted another version that takes a slightly different
> tack on the restriction of "cannot begin with a backquote".
> It basically lifts the whole design of Markdown code quotes.
>
> http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v5.pdf

The inclusion of RawSP means that you are fully delivering on your 
trailer from Jan 30: "Spoiler: I think I can prove that Markdown code 
quoting is appropriately minimal in its design, in a way Jim's is not."

Let me first recognize the power of RawSP in lifting TWO restrictions: 
cannot begin with a backtick, and cannot end with a backtick:

   String s = ``Hi `Bob```;       // Error, unbalanced delimiters
   String s = ``Hi `Bob`` + "`";  // OK
   String s = `` Hi `Bob` ``;     // OK with RawSP trick

However, since the JEP's goal is to allow copy-paste of arbitrary text 
without interpretation, I think the RawSP trick of assigning meaning to 
whitespace is out of place. To most people, the raw string literal:

   ` and `

denotes a perfectly good five-character string that will probably be 
inserted between two other strings. Explaining that, no, it's really a 
three-character string will not be popular.

Also, the inclusion of RawSP makes the lexing of RawStringLiteral 
ambiguous, since RawStringBody allows opening and closing whitespace. No 
doubt this can be fixed with rules involving "If the first character 
after RawSP is a backtick ...", but now being like Markdown is getting 
expensive.

Alex

From john.r.rose at oracle.com  Wed Feb 14 22:46:54 2018
From: john.r.rose at oracle.com (John Rose)
Date: Wed, 14 Feb 2018 14:46:54 -0800
Subject: Raw string literals and Unicode escapes
In-Reply-To: <5A84BB6A.60102@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
 <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
 <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
Message-ID: <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>

On Feb 14, 2018, at 2:42 PM, Alex Buckley <alex.buckley at oracle.com> wrote:
> 
> Also, the inclusion of RawSP makes the lexing of RawStringLiteral ambiguous, since RawStringBody allows opening and closing whitespace. No doubt this can be fixed with rules involving "If the first character after RawSP is a backtick ...", but now being like Markdown is getting expensive.

These matters are already covered in the draft, under the blanket
provision that the RSB cannot contain a close-quote sequence.
So I don't think I swept anything under the covers there.

A similar effect could be gotten by replacing RawSP with any
other raw character (fixed by the JLS), such as period ``.asdf.``,
double-quote ``"asdf"``, etc.


From brian.goetz at oracle.com  Sun Feb 18 21:44:31 2018
From: brian.goetz at oracle.com (Brian Goetz)
Date: Sun, 18 Feb 2018 13:44:31 -0800
Subject: Updated records doc
Message-ID: <8B9F27FE-7A72-4050-B945-A85C78A610B3@oracle.com>

I?ve updated the records doc at:

    http://cr.openjdk.java.net/~briangoetz/amber/datum.html <http://cr.openjdk.java.net/~briangoetz/amber/datum.html>

to reflect comments and discussion to date.  

From brian.goetz at oracle.com  Fri Feb 23 21:00:31 2018
From: brian.goetz at oracle.com (Brian Goetz)
Date: Fri, 23 Feb 2018 16:00:31 -0500
Subject: Raw string literals and Unicode escapes
In-Reply-To: <5A84BB6A.60102@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
 <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
 <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
Message-ID: <7fd3e3a0-997b-c6c7-ee59-473e84366396@oracle.com>


>
> However, since the JEP's goal is to allow copy-paste of arbitrary text 
> without interpretation, I think the RawSP trick of assigning meaning 
> to whitespace is out of place. To most people, the raw string literal:
>
> ? ` and `
>
> denotes a perfectly good five-character string that will probably be 
> inserted between two other strings. Explaining that, no, it's really a 
> three-character string will not be popular.

+100.? The RawSP trick is clever, but too much so.? There are ample 
simpler approaches for beginning/ending with BT:

 ??? String s = BACKTICK + `a raw string` + BACKTICK;
 ??? String s = `` `a raw string` ``.trim();

These move the cognitive load on the user to the corner case, rather 
than landing it on the general case.


From guy.steele at oracle.com  Fri Feb 23 21:07:53 2018
From: guy.steele at oracle.com (Guy Steele)
Date: Fri, 23 Feb 2018 16:07:53 -0500
Subject: Raw string literals and Unicode escapes
In-Reply-To: <7fd3e3a0-997b-c6c7-ee59-473e84366396@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
 <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
 <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <7fd3e3a0-997b-c6c7-ee59-473e84366396@oracle.com>
Message-ID: <DDB63A4D-8753-4ED8-ABE4-2B1F6AF966C6@oracle.com>

+200.  Or even

    String s = ?`" + `a raw string` + ?`?;

It?s perfectly okay to use both kinds of string in one expression.
	
> On Feb 23, 2018, at 4:00 PM, Brian Goetz <brian.goetz at oracle.com> wrote:
> 
> 
> 
>> 
>> However, since the JEP's goal is to allow copy-paste of arbitrary text without interpretation, I think the RawSP trick of assigning meaning to whitespace is out of place. To most people, the raw string literal:
>> 
>>   ` and `
>> 
>> denotes a perfectly good five-character string that will probably be inserted between two other strings. Explaining that, no, it's really a three-character string will not be popular.
> 
> +100.  The RawSP trick is clever, but too much so.  There are ample simpler approaches for beginning/ending with BT:
> 
>     String s = BACKTICK + `a raw string` + BACKTICK;
>     String s = `` `a raw string` ``.trim();
> 
> These move the cognitive load on the user to the corner case, rather than landing it on the general case.
> 
> 


From john.r.rose at oracle.com  Sat Feb 24 06:34:35 2018
From: john.r.rose at oracle.com (John Rose)
Date: Fri, 23 Feb 2018 22:34:35 -0800
Subject: Raw string literals and Unicode escapes
In-Reply-To: <7fd3e3a0-997b-c6c7-ee59-473e84366396@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
 <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
 <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <7fd3e3a0-997b-c6c7-ee59-473e84366396@oracle.com>
Message-ID: <7CE813FC-53AC-4FB1-A5E6-D892E2BF702E@oracle.com>

On Feb 23, 2018, at 1:00 PM, Brian Goetz <Brian.Goetz at Oracle.COM> wrote:
> 
>> 
>> 
>> However, since the JEP's goal is to allow copy-paste of arbitrary text without interpretation, I think the RawSP trick of assigning meaning to whitespace is out of place. To most people, the raw string literal:
>> 
>>   ` and `
>> 
>> denotes a perfectly good five-character string that will probably be inserted between two other strings. Explaining that, no, it's really a three-character string will not be popular.
> 
> +100.  The RawSP trick is clever, but too much so.  There are ample simpler approaches for beginning/ending with BT:
> 
>     String s = BACKTICK + `a raw string` + BACKTICK;
>     String s = `` `a raw string` ``.trim();
> 
> These move the cognitive load on the user to the corner case, rather than landing it on the general case.

Note that the "trim" trick moves the problem elsewhere:  It can remove
more than just the one extra space, so the string "`xxx " needs a different
technique.  A bunch of only-partially-applicable tricks like that is also a kind
of cognitive load, isn't it?

Here's one I also kind of like:  If the string has no embedded newlines,
then do ``|`a raw string`|``.trimMargins().

The + operator is a more robust solution, although it requires
parentheses also if used with a postfix method of any sort.

Maybe better is trimLines, where a newline is the "guard" to
be stripped, but of which at most only one is stripped.

I suppose reasonable people might differ on whether a fixed aperiodic
quote (like "``` " or "```|") is more surprising than a grab bag of methods
for fixing edge effects.

But, I do agree that libraries can fix such edge effects.

And, I am very happy that, in lengthening the opening and
closing quotes, we are making it possible to paste an arbitrary
sequence of unicode without having to hunt around inside
the sequence to find stuff that needs extra quoting, as is
the case with today's strings.

The thing we are discussing here, the need to give special
handling to leading and trailing backticks is (crucially)
an edge effect (only at the two ends of the string) and
not a bulk effect (something that needs attention throughout
the string).  That means we have won on the key issue,
and are just disagreeing about how to collect our winnings.

(Yes you do have to look at the string bulk, but only to
choose a "strong enough fence" to enclose that bulk.
And then adjust the fence to handle edge effects from
backticks.  The emount of escaping is O(1) not O(N).)

? John

From brian.goetz at oracle.com  Sat Feb 24 15:28:22 2018
From: brian.goetz at oracle.com (Brian Goetz)
Date: Sat, 24 Feb 2018 10:28:22 -0500
Subject: Raw string literals and Unicode escapes
In-Reply-To: <7CE813FC-53AC-4FB1-A5E6-D892E2BF702E@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
 <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
 <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <7fd3e3a0-997b-c6c7-ee59-473e84366396@oracle.com>
 <7CE813FC-53AC-4FB1-A5E6-D892E2BF702E@oracle.com>
Message-ID: <96a9a046-b63f-6d8c-797b-7ca1f0535583@oracle.com>

> And, I am very happy that, in lengthening the opening and
> closing quotes, we are making it possible to paste an arbitrary
> sequence of unicode without having to hunt around inside
> the sequence to find stuff that needs extra quoting, as is
> the case with today's strings.

That's the high order bit here; paste an arbitrary snippet.

> The thing we are discussing here, the need to give special
> handling to leading and trailing backticks is (crucially)
> an edge effect (only at the two ends of the string) and
> not a bulk effect (something that needs attention throughout
> the string). ?That means we have won on the key issue,
> and are just disagreeing about how to collect our winnings.

I already collected my winnings, and am now spending them in the bar.? 
Join me for a drink :)


From forax at univ-mlv.fr  Sun Feb 25 12:19:05 2018
From: forax at univ-mlv.fr (Remi Forax)
Date: Sun, 25 Feb 2018 13:19:05 +0100 (CET)
Subject: Raw string literals and Unicode escapes
In-Reply-To: <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
 <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
 <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
Message-ID: <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>

I'm late in the game but why not using the same system as Perl, PHP, Ruby to solve the Lts [1], i.e 
you have a sequence that says this is the starts of a raw string (%Q, qq, m) then a character (in a predefined list), the raw string and at the end of the raw string the same character as at the beginning (or its mirror). 

By example, this 'raw' as prefix for a raw string 
raw`this is a raw string` 
raw'this is another raw string' 
raw[yet another raw string] 

R?mi 

[1] https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome 

> De: "John Rose" <john.r.rose at oracle.com>
> ?: "Alex Buckley" <alex.buckley at oracle.com>
> Cc: "amber-spec-experts" <amber-spec-experts at openjdk.java.net>
> Envoy?: Mercredi 14 F?vrier 2018 23:46:54
> Objet: Re: Raw string literals and Unicode escapes

> On Feb 14, 2018, at 2:42 PM, Alex Buckley < [ mailto:alex.buckley at oracle.com |
> alex.buckley at oracle.com ] > wrote:

>> Also, the inclusion of RawSP makes the lexing of RawStringLiteral ambiguous,
>> since RawStringBody allows opening and closing whitespace. No doubt this can be
>> fixed with rules involving "If the first character after RawSP is a backtick
>> ...", but now being like Markdown is getting expensive.

> These matters are already covered in the draft, under the blanket
> provision that the RSB cannot contain a close-quote sequence.
> So I don't think I swept anything under the covers there.

> A similar effect could be gotten by replacing RawSP with any
> other raw character (fixed by the JLS), such as period ``.asdf.``,
> double-quote ``"asdf"``, etc.

From alex.buckley at oracle.com  Mon Feb 26 18:43:46 2018
From: alex.buckley at oracle.com (Alex Buckley)
Date: Mon, 26 Feb 2018 10:43:46 -0800
Subject: Raw string literals and Unicode escapes
In-Reply-To: <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
 <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
 <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
 <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
Message-ID: <5A945562.8080108@oracle.com>

On 2/25/2018 4:19 AM, Remi Forax wrote:
> I'm late in the game but why not using the same system as Perl, PHP,
> Ruby to solve the Lts [1], i.e
> you have a sequence that says this is the starts of a raw string (%Q,
> qq, m) then a character (in a predefined list), the raw string and at
> the end of the raw string the same character as at the beginning (or its
> mirror).
>
> By example, this 'raw' as prefix for a raw string
> raw`this is a raw string`
> raw'this is another raw string'
> raw[yet another raw string]

See "Choice of Delimiters" in the "Alternatives" section of the JEP.

Alex

From john.r.rose at oracle.com  Mon Feb 26 20:17:13 2018
From: john.r.rose at oracle.com (John Rose)
Date: Mon, 26 Feb 2018 12:17:13 -0800
Subject: Raw string literals and Unicode escapes
In-Reply-To: <5A945562.8080108@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
 <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
 <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
 <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
 <5A945562.8080108@oracle.com>
Message-ID: <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>

On Feb 26, 2018, at 10:43 AM, Alex Buckley <alex.buckley at oracle.com> wrote:
> 
> On 2/25/2018 4:19 AM, Remi Forax wrote:
>> I'm late in the game but why not using the same system as Perl, PHP,
>> Ruby to solve the Lts [1], i.e
>> you have a sequence that says this is the starts of a raw string (%Q,
>> qq, m) then a character (in a predefined list), the raw string and at
>> the end of the raw string the same character as at the beginning (or its
>> mirror).
>> 
>> By example, this 'raw' as prefix for a raw string
>> raw`this is a raw string`
>> raw'this is another raw string'
>> raw[yet another raw string]
> 
> See "Choice of Delimiters" in the "Alternatives" section of the JEP.

The JEP doesn't clearly call out the goal of *no* escapes in the bulk
of the raw string, but that requirement (which we have adopted)
affects the choice of quotes in a decisive manner.  Let me try to
lay out the "string physics" that underly this decision.

*Any* single-character end-quote will have a significant probability
of showing up inside the bulk of a (randomly selected) raw string.

How significant?  Well, let's say conservatively that raw strings
can have all possible characters, but the end-quote sequence
only shows up one out of a hundred times, per character position,
in raw strings.  If you are using a series of ten-character raw
strings (to say nothing of bigger ones), you have about a 10%
chance for any given raw string to contain an inconvenient
end-quote.

That percentage is significant, especially given that in some
cases strings will be longer and quote characters will be more
common, both factors increasing the failure rate beyond 10%.
But even a 0.1% failure rate is noticeable to users, making a
feature feel unreliable.

This generalizes to any fixed multi-character end-quote, with a
reduction of probability exponential in the length of the end-quote,
but still with a non-zero probability, of occurring in the bulk of
a randomly selected string.  A two-character end-quote might
have a probability of 10^-4, and that means you have a more
modest but still significant chance of failure of 10% across a
suite of 100 random 10-character strings, or for one random
1000-character string.

Any *finite choice* of end-quotes has the same problem, with
a non-zero probability that decreases (but does not vanish)
with the number of available end-quotes.  The only way to
break out of the box is to allow the user an unlimited range
of successively "stronger" end-quotes (i.e., less likely ones).

(Randomly selected raw strings are easy to model, although
the numbers used above are an approximation to a binomial
distribution.  In fact, though, strings which show up non-randomly
in real code are *more* likely to mention end-quotes, since their
contents are somehow correlated to the enclosing language.)

You can easily demonstrate this issue by nesting Java code
which uses raw quotes inside of a containing raw quote.  An
easy first test of a proposed quoting mechanism is, "will it
nest?"  If not, then the quoting mechanism does not meet
a key requirement for raw quotes.

This key requirement is unconstrained pasting *without* fixups
(escape sequences embedded in the bulk of the quote).
Anything else, with some epsilon probability of requiring escapes,
is not truly raw, just "mostly raw".

In the case you propose, Remi, the probability of having an
un-quotable bulk string is quite high, since all of the end-quotes
are single characters.

Only a convention with an end-quote of arbitrary length is strong
enough to "fence in" arbitrary raw strings.  The simplest possible
such convention is to allow replication of a single character to
serve as the end-quote.  This decision toward simplicity
influences other features in Java raw strings, including the
decision to use a new character and to diasallow certain
edge cases, notably null strings.

? John

P.S. I expect IDE vendors will quickly supply useful "stretchy quotes"
which will resize themselves to contain whatever users throw into
the raw string body.  At that point backticks will feel like magic tokens
that never accidentally match raw string bodies.


From maurizio.cimadamore at oracle.com  Mon Feb 26 21:29:31 2018
From: maurizio.cimadamore at oracle.com (Maurizio Cimadamore)
Date: Mon, 26 Feb 2018 21:29:31 +0000
Subject: Raw string literals and Unicode escapes
In-Reply-To: <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
 <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
 <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
 <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
 <5A945562.8080108@oracle.com>
 <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>
Message-ID: <14db28db-0f9d-5c31-3a92-86cc96a604f8@oracle.com>


On 26/02/18 20:17, John Rose wrote:
> Any*finite choice*  of end-quotes has the same problem, with
> a non-zero probability that decreases (but does not vanish)
> with the number of available end-quotes.  The only way to
> break out of the box is to allow the user an unlimited range
> of successively "stronger" end-quotes (i.e., less likely ones).
In reality there is a 'finite' upper bound for this length, which is 
given by 2^16 /2 = 2 ^ 15. That's the maximum delimiter size you could 
encode in a Java String which you can also symmetrically close - and 
it's an edge case, as it will contain the empty string.

So, yes, on paper, I agree with the argument, in practice, I guess I'd 
me more in favor of limiting the number of repetitions - I wouldn't like 
to open the door to puzzlers:

`````````````````````````````````````````````````````````````````````````hello`````````````````````````````````````````````````````````````````````````

(which might leave some Ascii art lovers a bit unhappy :-))

I think limiting to 8 or some other reasonable small number will 
probably reduce the clash probability enough? And, even if it's not 
enough, I guess we'd still be left with the question if a long (possibly 
unbounded?) escaping sequence is something we'd like to see in Java.

Maurizio

From james.laskey at oracle.com  Mon Feb 26 21:54:01 2018
From: james.laskey at oracle.com (Jim Laskey)
Date: Mon, 26 Feb 2018 17:54:01 -0400
Subject: Raw string literals and Unicode escapes
In-Reply-To: <14db28db-0f9d-5c31-3a92-86cc96a604f8@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
 <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
 <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
 <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
 <5A945562.8080108@oracle.com>
 <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>
 <14db28db-0f9d-5c31-3a92-86cc96a604f8@oracle.com>
Message-ID: <DAFAE9E5-8726-4669-9380-29AC1DA753E1@oracle.com>

Why introduce an artificial limit? Identifiers don?t have a limit. 3.8. Identifiers An identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter.

? Jim

> On Feb 26, 2018, at 5:29 PM, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
> 
> 
> 
> On 26/02/18 20:17, John Rose wrote:
>> Any *finite choice* of end-quotes has the same problem, with
>> a non-zero probability that decreases (but does not vanish)
>> with the number of available end-quotes.  The only way to
>> break out of the box is to allow the user an unlimited range
>> of successively "stronger" end-quotes (i.e., less likely ones).
> In reality there is a 'finite' upper bound for this length, which is given by 2^16 /2 = 2 ^ 15. That's the maximum delimiter size you could encode in a Java String which you can also symmetrically close - and it's an edge case, as it will contain the empty string.
> 
> So, yes, on paper, I agree with the argument, in practice, I guess I'd me more in favor of limiting the number of repetitions - I wouldn't like to open the door to puzzlers:
> 
> `````````````````````````````````````````````````````````````````````````hello`````````````````````````````````````````````````````````````````````````
> 
> (which might leave some Ascii art lovers a bit unhappy :-))
> 
> I think limiting to 8 or some other reasonable small number will probably reduce the clash probability enough? And, even if it's not enough, I guess we'd still be left with the question if a long (possibly unbounded?) escaping sequence is something we'd like to see in Java.
> 
> Maurizio


From john.r.rose at oracle.com  Mon Feb 26 22:01:51 2018
From: john.r.rose at oracle.com (John Rose)
Date: Mon, 26 Feb 2018 14:01:51 -0800
Subject: Raw string literals and Unicode escapes
In-Reply-To: <14db28db-0f9d-5c31-3a92-86cc96a604f8@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
 <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
 <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
 <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
 <5A945562.8080108@oracle.com>
 <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>
 <14db28db-0f9d-5c31-3a92-86cc96a604f8@oracle.com>
Message-ID: <0FC7FA01-A7CB-4D67-80BE-A4B19D0306F6@oracle.com>

On Feb 26, 2018, at 1:29 PM, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
> 
> On 26/02/18 20:17, John Rose wrote:
>> Any *finite choice* of end-quotes has the same problem, with
>> a non-zero probability that decreases (but does not vanish)
>> with the number of available end-quotes.  The only way to
>> break out of the box is to allow the user an unlimited range
>> of successively "stronger" end-quotes (i.e., less likely ones).
> In reality there is a 'finite' upper bound for this length, which is given by 2^16 /2 = 2 ^ 15. That's the maximum delimiter size you could encode in a Java String which you can also symmetrically close - and it's an edge case, as it will contain the empty string.

That's only true for constant pool strings; there is no such defined limit
in the JLS.  And condy lifts the limit in the constant pool.

This is the point at which we need to note that there is a soft upper bound
to raw string literals, which is the amount of stuff you are willing to paste
into your Java source file before it isn't Java source any more.  Probably
a half-page of code will be the usual jumbo size, with occasional multi-page
outliers.  That's maybe 1kb (30 lines of 30 chars).  Still, that is plenty long
enough to encounter lots of odd corner-case end quotes.

> So, yes, on paper, I agree with the argument, in practice, I guess I'd me more in favor of limiting the number of repetitions

Pick your puzzler.  I'd rather not leave a single string un-representable; that
would lead to a different kind of puzzler, *as well as* a hard limitation.

Test question:  Does the JLS have a maximum length for identifiers?
It does not.  Even though long identifiers hypothetically lead to "puzzlers"
involving abuses of the notation.  Ridiculously long or difficult-to-read
identifiers are naturally avoided in practice, and the same will be true
for ridiculously long end-quotes.

http://merriam-webster.com/dictionary/abusus+non+tollit+usum

? John

From maurizio.cimadamore at oracle.com  Mon Feb 26 22:57:37 2018
From: maurizio.cimadamore at oracle.com (Maurizio Cimadamore)
Date: Mon, 26 Feb 2018 22:57:37 +0000
Subject: Raw string literals and Unicode escapes
In-Reply-To: <DAFAE9E5-8726-4669-9380-29AC1DA753E1@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
 <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
 <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
 <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
 <5A945562.8080108@oracle.com>
 <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>
 <14db28db-0f9d-5c31-3a92-86cc96a604f8@oracle.com>
 <DAFAE9E5-8726-4669-9380-29AC1DA753E1@oracle.com>
Message-ID: <551f3767-d7b2-6c52-b7a8-dc374c749583@oracle.com>

Of course - delimiters is not part of the string length - I see now why 
you can have (in theory) unbound prefix/suffix.

Personally, I find the argument - "because you can have unlimited-length 
identifiers" not a great fit. From a lexer writer perspective, I can see 
that it is used as a candidate - after all it is a token whose size is 
unbound. But I find it hard to ignore that the roles played by 
identifiers and delimiters in the grammar are quite different.

At least there were other cases were we found different trade off 
between expressiveness and practicality - see Project Coin's use of 
repeated underscores in binary literals (subsequently banned):

private static final int BOND =
 ?0000_____________0000________0000000000000000__000000000000000000+
 ?00000000_________00000000______000000000000000__0000000000000000000+
 ? 000____000_______000____000_____000_______0000__00______0+
 ?000______000_____000______000_____________0000___00______0+
0000______0000___0000______0000___________0000_____0_____0+
0000______0000___0000______0000__________0000___________0+
0000______0000___0000______0000_________0000__0000000000+
0000______0000___0000______0000________0000+
 ?000______000_____000______000________0000+
 ? 000____000_______000____000_______00000+
 ? ?00000000_________00000000_______0000000+
 ? ? ?0000_____________0000________000000007;

(Example courtesy of Joshua Bloch)

Maurizio


On 26/02/18 21:54, Jim Laskey wrote:
> Why introduce an artificial limit? Identifiers don?t have a 
> limit.?3.8. Identifiers?An?identifier?is an *unlimited-length 
> sequence* of?Java letters?and?Java digits, the first of which must be 
> a?Java letter.
>
> ? Jim
>
>> On Feb 26, 2018, at 5:29 PM, Maurizio Cimadamore 
>> <maurizio.cimadamore at oracle.com 
>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>
>>
>>
>> On 26/02/18 20:17, John Rose wrote:
>>> Any*finite choice*  of end-quotes has the same problem, with
>>> a non-zero probability that decreases (but does not vanish)
>>> with the number of available end-quotes.  The only way to
>>> break out of the box is to allow the user an unlimited range
>>> of successively "stronger" end-quotes (i.e., less likely ones).
>> In reality there is a 'finite' upper bound for this length, which is 
>> given by 2^16 /2 = 2 ^ 15. That's the maximum delimiter size you 
>> could encode in a Java String which you can also symmetrically close 
>> - and it's an edge case, as it will contain the empty string.
>>
>> So, yes, on paper, I agree with the argument, in practice, I guess 
>> I'd me more in favor of limiting the number of repetitions - I 
>> wouldn't like to open the door to puzzlers:
>>
>> `````````````````````````````````````````````````````````````````````````hello`````````````````````````````````````````````````````````````````````````
>>
>> (which might leave some Ascii art lovers a bit unhappy :-))
>>
>> I think limiting to 8 or some other reasonable small number will 
>> probably reduce the clash probability enough? And, even if it's not 
>> enough, I guess we'd still be left with the question if a long 
>> (possibly unbounded?) escaping sequence is something we'd like to see 
>> in Java.
>>
>> Maurizio
>


From maurizio.cimadamore at oracle.com  Mon Feb 26 23:02:10 2018
From: maurizio.cimadamore at oracle.com (Maurizio Cimadamore)
Date: Mon, 26 Feb 2018 23:02:10 +0000
Subject: Raw string literals and Unicode escapes
In-Reply-To: <551f3767-d7b2-6c52-b7a8-dc374c749583@oracle.com>
References: <5A83275F.80802@oracle.com>
 <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com>
 <5A849B04.8030503@oracle.com>
 <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com>
 <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
 <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
 <5A945562.8080108@oracle.com>
 <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>
 <14db28db-0f9d-5c31-3a92-86cc96a604f8@oracle.com>
 <DAFAE9E5-8726-4669-9380-29AC1DA753E1@oracle.com>
 <551f3767-d7b2-6c52-b7a8-dc374c749583@oracle.com>
Message-ID: <b3ca73a9-b6cd-60ae-1232-8815883ddbce@oracle.com>

I stand corrected - repeated underscores are allowed - but Josh's 
example reminded me of the state of affair with raw strings.

Maurizio


On 26/02/18 22:57, Maurizio Cimadamore wrote:
>
> At least there were other cases were we found different trade off 
> between expressiveness and practicality - see Project Coin's use of 
> repeated underscores in binary literals (subsequently banned):
>


From forax at univ-mlv.fr  Tue Feb 27 08:16:04 2018
From: forax at univ-mlv.fr (forax at univ-mlv.fr)
Date: Tue, 27 Feb 2018 09:16:04 +0100 (CET)
Subject: Raw string literals and Unicode escapes
In-Reply-To: <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>
References: <5A83275F.80802@oracle.com> <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
 <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
 <5A945562.8080108@oracle.com>
 <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>
Message-ID: <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr>

Hi John,
see below.

----- Mail original -----
> De: "John Rose" <john.r.rose at oracle.com>
> ?: "Remi Forax" <forax at univ-mlv.fr>
> Cc: "amber-spec-experts" <amber-spec-experts at openjdk.java.net>
> Envoy?: Lundi 26 F?vrier 2018 21:17:13
> Objet: Re: Raw string literals and Unicode escapes

> On Feb 26, 2018, at 10:43 AM, Alex Buckley <alex.buckley at oracle.com> wrote:
>> 
>> On 2/25/2018 4:19 AM, Remi Forax wrote:
>>> I'm late in the game but why not using the same system as Perl, PHP,
>>> Ruby to solve the Lts [1], i.e
>>> you have a sequence that says this is the starts of a raw string (%Q,
>>> qq, m) then a character (in a predefined list), the raw string and at
>>> the end of the raw string the same character as at the beginning (or its
>>> mirror).
>>> 
>>> By example, this 'raw' as prefix for a raw string
>>> raw`this is a raw string`
>>> raw'this is another raw string'
>>> raw[yet another raw string]
>> 
>> See "Choice of Delimiters" in the "Alternatives" section of the JEP.
> 
> The JEP doesn't clearly call out the goal of *no* escapes in the bulk
> of the raw string, but that requirement (which we have adopted)
> affects the choice of quotes in a decisive manner.  Let me try to
> lay out the "string physics" that underly this decision.
> 
> *Any* single-character end-quote will have a significant probability
> of showing up inside the bulk of a (randomly selected) raw string.
> 
> How significant?  Well, let's say conservatively that raw strings
> can have all possible characters, but the end-quote sequence
> only shows up one out of a hundred times, per character position,
> in raw strings.  If you are using a series of ten-character raw
> strings (to say nothing of bigger ones), you have about a 10%
> chance for any given raw string to contain an inconvenient
> end-quote.
> 
> That percentage is significant, especially given that in some
> cases strings will be longer and quote characters will be more
> common, both factors increasing the failure rate beyond 10%.
> But even a 0.1% failure rate is noticeable to users, making a
> feature feel unreliable.
> 
> This generalizes to any fixed multi-character end-quote, with a
> reduction of probability exponential in the length of the end-quote,
> but still with a non-zero probability, of occurring in the bulk of
> a randomly selected string.  A two-character end-quote might
> have a probability of 10^-4, and that means you have a more
> modest but still significant chance of failure of 10% across a
> suite of 100 random 10-character strings, or for one random
> 1000-character string.
> 
> Any *finite choice* of end-quotes has the same problem, with
> a non-zero probability that decreases (but does not vanish)
> with the number of available end-quotes.  The only way to
> break out of the box is to allow the user an unlimited range
> of successively "stronger" end-quotes (i.e., less likely ones).
> 
> (Randomly selected raw strings are easy to model, although
> the numbers used above are an approximation to a binomial
> distribution.  In fact, though, strings which show up non-randomly
> in real code are *more* likely to mention end-quotes, since their
> contents are somehow correlated to the enclosing language.)
> 
> You can easily demonstrate this issue by nesting Java code
> which uses raw quotes inside of a containing raw quote.  An
> easy first test of a proposed quoting mechanism is, "will it
> nest?"  If not, then the quoting mechanism does not meet
> a key requirement for raw quotes.
> 
> This key requirement is unconstrained pasting *without* fixups
> (escape sequences embedded in the bulk of the quote).
> Anything else, with some epsilon probability of requiring escapes,
> is not truly raw, just "mostly raw".
> 
> In the case you propose, Remi, the probability of having an
> un-quotable bulk string is quite high, since all of the end-quotes
> are single characters.
> 
> Only a convention with an end-quote of arbitrary length is strong
> enough to "fence in" arbitrary raw strings.  The simplest possible
> such convention is to allow replication of a single character to
> serve as the end-quote.  This decision toward simplicity
> influences other features in Java raw strings, including the
> decision to use a new character and to disallow certain
> edge cases, notably null strings.
> 
> ? John


I understand your point but i disagree with your analysis.
My own experience is that raw strings follow what i call the 'embedded languages' hypothesis,
i.e. for any application, there is a length such all raw strings with a length greater than this length contain only embedded programming languages.
So after this length instead of having the probability to see a character to be virtually 1, you have the opposite effect, because programming languages (a human construct) are very regular in the set of chars they use. So you do not need to a repetition of a character to avoid a statistical effect that does not occur. Being able to choose the escape character, is enough.  

> 
> P.S. I expect IDE vendors will quickly supply useful "stretchy quotes"
> which will resize themselves to contain whatever users throw into
> the raw string body.  At that point backticks will feel like magic tokens
> that never accidentally match raw string bodies.

regards,
R?mi

From maurizio.cimadamore at oracle.com  Tue Feb 27 10:55:53 2018
From: maurizio.cimadamore at oracle.com (Maurizio Cimadamore)
Date: Tue, 27 Feb 2018 10:55:53 +0000
Subject: Raw string literals and Unicode escapes
In-Reply-To: <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr>
References: <5A83275F.80802@oracle.com> <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
 <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
 <5A945562.8080108@oracle.com>
 <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>
 <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr>
Message-ID: <706e24f3-713d-4494-b6a9-ce7d9a591a00@oracle.com>


On 27/02/18 08:16, forax at univ-mlv.fr wrote:
> Hi John,
> see below.
>
> ----- Mail original -----
>> De: "John Rose" <john.r.rose at oracle.com>
>> ?: "Remi Forax" <forax at univ-mlv.fr>
>> Cc: "amber-spec-experts" <amber-spec-experts at openjdk.java.net>
>> Envoy?: Lundi 26 F?vrier 2018 21:17:13
>> Objet: Re: Raw string literals and Unicode escapes
>> On Feb 26, 2018, at 10:43 AM, Alex Buckley <alex.buckley at oracle.com> wrote:
>>> On 2/25/2018 4:19 AM, Remi Forax wrote:
>>>> I'm late in the game but why not using the same system as Perl, PHP,
>>>> Ruby to solve the Lts [1], i.e
>>>> you have a sequence that says this is the starts of a raw string (%Q,
>>>> qq, m) then a character (in a predefined list), the raw string and at
>>>> the end of the raw string the same character as at the beginning (or its
>>>> mirror).
>>>>
>>>> By example, this 'raw' as prefix for a raw string
>>>> raw`this is a raw string`
>>>> raw'this is another raw string'
>>>> raw[yet another raw string]
>>> See "Choice of Delimiters" in the "Alternatives" section of the JEP.
>> The JEP doesn't clearly call out the goal of *no* escapes in the bulk
>> of the raw string, but that requirement (which we have adopted)
>> affects the choice of quotes in a decisive manner.  Let me try to
>> lay out the "string physics" that underly this decision.
>>
>> *Any* single-character end-quote will have a significant probability
>> of showing up inside the bulk of a (randomly selected) raw string.
>>
>> How significant?  Well, let's say conservatively that raw strings
>> can have all possible characters, but the end-quote sequence
>> only shows up one out of a hundred times, per character position,
>> in raw strings.  If you are using a series of ten-character raw
>> strings (to say nothing of bigger ones), you have about a 10%
>> chance for any given raw string to contain an inconvenient
>> end-quote.
>>
>> That percentage is significant, especially given that in some
>> cases strings will be longer and quote characters will be more
>> common, both factors increasing the failure rate beyond 10%.
>> But even a 0.1% failure rate is noticeable to users, making a
>> feature feel unreliable.
>>
>> This generalizes to any fixed multi-character end-quote, with a
>> reduction of probability exponential in the length of the end-quote,
>> but still with a non-zero probability, of occurring in the bulk of
>> a randomly selected string.  A two-character end-quote might
>> have a probability of 10^-4, and that means you have a more
>> modest but still significant chance of failure of 10% across a
>> suite of 100 random 10-character strings, or for one random
>> 1000-character string.
>>
>> Any *finite choice* of end-quotes has the same problem, with
>> a non-zero probability that decreases (but does not vanish)
>> with the number of available end-quotes.  The only way to
>> break out of the box is to allow the user an unlimited range
>> of successively "stronger" end-quotes (i.e., less likely ones).
>>
>> (Randomly selected raw strings are easy to model, although
>> the numbers used above are an approximation to a binomial
>> distribution.  In fact, though, strings which show up non-randomly
>> in real code are *more* likely to mention end-quotes, since their
>> contents are somehow correlated to the enclosing language.)
>>
>> You can easily demonstrate this issue by nesting Java code
>> which uses raw quotes inside of a containing raw quote.  An
>> easy first test of a proposed quoting mechanism is, "will it
>> nest?"  If not, then the quoting mechanism does not meet
>> a key requirement for raw quotes.
>>
>> This key requirement is unconstrained pasting *without* fixups
>> (escape sequences embedded in the bulk of the quote).
>> Anything else, with some epsilon probability of requiring escapes,
>> is not truly raw, just "mostly raw".
>>
>> In the case you propose, Remi, the probability of having an
>> un-quotable bulk string is quite high, since all of the end-quotes
>> are single characters.
>>
>> Only a convention with an end-quote of arbitrary length is strong
>> enough to "fence in" arbitrary raw strings.  The simplest possible
>> such convention is to allow replication of a single character to
>> serve as the end-quote.  This decision toward simplicity
>> influences other features in Java raw strings, including the
>> decision to use a new character and to disallow certain
>> edge cases, notably null strings.
>>
>> ? John
>
> I understand your point but i disagree with your analysis.
> My own experience is that raw strings follow what i call the 'embedded languages' hypothesis,
> i.e. for any application, there is a length such all raw strings with a length greater than this length contain only embedded programming languages.
> So after this length instead of having the probability to see a character to be virtually 1, you have the opposite effect, because programming languages (a human construct) are very regular in the set of chars they use. So you do not need to a repetition of a character to avoid a statistical effect that does not occur. Being able to choose the escape character, is enough.
W/o diving too much on the repeated vs. 'single but customizable' 
choice, I'm also a bit suspicious of the fact that John's analysis 
conservatively assumes that a snippet of text embedded in a raw string 
is a random sequence of character, in the true sense. This, to me, just 
seems the wrong assumption - by definition something truly random has 
high entropy and something with high entropy is usually associated with 
low information content - which is just not compatible with the use case 
of 'pasting in a code snippet' (example: it's highly likely that the 
prefix 'cla' will be followed by 'ss' in a Java-like snippet). I would 
expect entropy of the embedded snippet to be quite low compared to the 
assumption made here, which greatly affects the probability 
calculations. For the analysis to be correct, it should take into 
account the _frequency_ by which a given delimiter can appear in the 
various kinds of snippets that could be pasted in (and there's one such 
frequency for each snippet kind) - or we're at risk of overestimating 
(if we pick a delimiter symbol whose frequency is, in reality, really 
low), or underestimating (if we pick a symbol that, conversely,? happens 
very frequently).

Maurizio
>
>> P.S. I expect IDE vendors will quickly supply useful "stretchy quotes"
>> which will resize themselves to contain whatever users throw into
>> the raw string body.  At that point backticks will feel like magic tokens
>> that never accidentally match raw string bodies.
> regards,
> R?mi


From brian.goetz at oracle.com  Tue Feb 27 19:48:04 2018
From: brian.goetz at oracle.com (Brian Goetz)
Date: Tue, 27 Feb 2018 14:48:04 -0500
Subject: Raw string literals and Unicode escapes
In-Reply-To: <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr>
References: <5A83275F.80802@oracle.com> <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
 <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
 <5A945562.8080108@oracle.com>
 <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>
 <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr>
Message-ID: <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com>


> So after this length instead of having the probability to see a character to be virtually 1, you have the opposite effect, because programming languages (a human construct) are very regular in the set of chars they use. So you do not need to a repetition of a character to avoid a statistical effect that does not occur. Being able to choose the escape character, is enough.

The problem is not that it's enough, its that it is too much. Having 
nine ways to say the same thing is too many; having infinitely many 
(e.g., nonces) is worse.? Having used the "pick your delimiter" approach 
taken by Perl, I find that you are *still* often bitten by the inability 
to find a good delimiter for embedding a snippet of a program written in 
a language similar to the outer language.? And it surely makes code less 
readable, because many more things can be interpreted as quotes.


From guy.steele at oracle.com  Tue Feb 27 19:56:57 2018
From: guy.steele at oracle.com (Guy Steele)
Date: Tue, 27 Feb 2018 14:56:57 -0500
Subject: Raw string literals and Unicode escapes
In-Reply-To: <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com>
References: <5A83275F.80802@oracle.com> <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
 <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
 <5A945562.8080108@oracle.com>
 <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>
 <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr>
 <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com>
Message-ID: <8D9A42BF-41A0-49C3-B2F8-1B556A41BBBB@oracle.com>


> On Feb 27, 2018, at 2:48 PM, Brian Goetz <brian.goetz at oracle.com> wrote:
> 
> 
>> So after this length instead of having the probability to see a character to be virtually 1, you have the opposite effect, because programming languages (a human construct) are very regular in the set of chars they use. So you do not need to a repetition of a character to avoid a statistical effect that does not occur. Being able to choose the escape character, is enough.
> 
> The problem is not that it's enough, its that it is too much. Having nine ways to say the same thing is too many; having infinitely many (e.g., nonces) is worse.  Having used the "pick your delimiter" approach taken by Perl, I find that you are *still* often bitten by the inability to find a good delimiter for embedding a snippet of a program written in a language similar to the outer language.  And it surely makes code less readable, because many more things can be interpreted as quotes.
> 

I agree with the comments that in practice many raw strings are much more likely to be some sort of code rather than relatively random strings.

That said, here is a perfectly plausible bit of Java code:

	final String uppercase = ?ABCDEFGHIJKLMONOPQRSTUVWXYZ?;
	final String lowercase = ?abcdefghijklmnopqrstuvwxyz?;
	final String enclosers = ?(){}[]?;
	final String punctuation = ?`~!@#$%^&*_+-=|\\:\?;?<>,.?/?;

I can quote it easily using `` and ``.  But it?s at least a little less easy, as John has argued, to quote it using the ?raw|?|? convention: there is no character on my keyboard that does not occur in the code to be quoted, so I have to go in and muck with the middle of the string.  Which makes it less easy to read the embedded code: in order to interpret it, one must be mindful that the syntactic requirements of the containing language may have intruded (requiring doubling or escaping of certain characters, for example).

The nice thing about

	``<body tag=?foo?>He said `<i>what</i>??</body>``

is that I can be _completely_ sure that the syntax of Java has not intruded _at all_ into the middle of the HTML syntax, so that?s one less thing to worry about while puzzling over the HTML.  This becomes even more important if the raw-string syntax is nested:

	System.out.println(?\t? + ```final String htmlSnippet = ?\t? + ``<body tag=?foo?>He said `<i>what</i>??</body>`` + ?\n";``` + ?\n?);

?Guy


From john.r.rose at oracle.com  Tue Feb 27 21:20:56 2018
From: john.r.rose at oracle.com (John Rose)
Date: Tue, 27 Feb 2018 13:20:56 -0800
Subject: Raw string literals and Unicode escapes
In-Reply-To: <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com>
References: <5A83275F.80802@oracle.com> <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
 <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
 <5A945562.8080108@oracle.com>
 <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>
 <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr>
 <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com>
Message-ID: <5FAE6846-F83D-400B-9A62-51A11CE80143@oracle.com>

On Feb 27, 2018, at 11:48 AM, Brian Goetz <brian.goetz at oracle.com> wrote:
> 
>> 
>> So after this length instead of having the probability to see a character to be virtually 1, you have the opposite effect, because programming languages (a human construct) are very regular in the set of chars they use. So you do not need to a repetition of a character to avoid a statistical effect that does not occur. Being able to choose the escape character, is enough.
> 
> The problem is not that it's enough, its that it is too much. Having nine ways to say the same thing is too many; having infinitely many (e.g., nonces) is worse.  Having used the "pick your delimiter" approach taken by Perl, I find that you are *still* often bitten by the inability to find a good delimiter for embedding a snippet of a program written in a language similar to the outer language.  And it surely makes code less readable, because many more things can be interpreted as quotes.

My experience tracks with Brian's.  That's why I think the random string
model is more robust than some vague hope that languages won't overlap.

Yes, random strings are an outlier, but less so that you'd think.  A typical
compression ratio for code is 5x, which means that if you replace "random
string of length 10" with "random code snippet of length 50" you get the
same analytic results.  In order to exclude a close-quote, you need an
additional constraint, which in practical terms results in folks having to
grub around inside their raw strings looking for accidentall quotes.

? John

From guy.steele at oracle.com  Tue Feb 27 21:12:14 2018
From: guy.steele at oracle.com (Guy Steele)
Date: Tue, 27 Feb 2018 16:12:14 -0500
Subject: Raw string literals and Unicode escapes
In-Reply-To: <5FAE6846-F83D-400B-9A62-51A11CE80143@oracle.com>
References: <5A83275F.80802@oracle.com> <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
 <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
 <5A945562.8080108@oracle.com>
 <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>
 <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr>
 <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com>
 <5FAE6846-F83D-400B-9A62-51A11CE80143@oracle.com>
Message-ID: <8C4448A0-A0DB-4DBF-A34C-2B014ADA013E@oracle.com>


> On Feb 27, 2018, at 4:20 PM, John Rose <john.r.rose at oracle.com> wrote:
> 
> On Feb 27, 2018, at 11:48 AM, Brian Goetz <brian.goetz at oracle.com <mailto:brian.goetz at oracle.com>> wrote:
>> 
>>> 
>>> So after this length instead of having the probability to see a character to be virtually 1, you have the opposite effect, because programming languages (a human construct) are very regular in the set of chars they use. So you do not need to a repetition of a character to avoid a statistical effect that does not occur. Being able to choose the escape character, is enough.
>> 
>> The problem is not that it's enough, its that it is too much. Having nine ways to say the same thing is too many; having infinitely many (e.g., nonces) is worse.  Having used the "pick your delimiter" approach taken by Perl, I find that you are *still* often bitten by the inability to find a good delimiter for embedding a snippet of a program written in a language similar to the outer language.  And it surely makes code less readable, because many more things can be interpreted as quotes.
> 
> My experience tracks with Brian's.  That's why I think the random string
> model is more robust than some vague hope that languages won't overlap.
> 
> Yes, random strings are an outlier, but less so that you'd think.  A typical
> compression ratio for code is 5x, which means that if you replace "random
> string of length 10" with "random code snippet of length 50" you get the
> same analytic results.  In order to exclude a close-quote, you need an
> additional constraint, which in practical terms results in folks having to
> grub around inside their raw strings looking for accidentall quotes.

Which leads us to the following theoretical result: the ```` mechanism does not require you to grub around in the interior of the string AT ALL if you don?t want to.  All you need to know is the length.  If the length of the raw string is n, and it does not begin or end with ` (a necessary check in any case), then using n-1 backquote characters before and after will always do the job.

In practice, many programmers (and programs) will be willing to do a quick search to see whether ?```? or failing that ?````? happens to be absent from the raw string. :-)


From brian.goetz at oracle.com  Tue Feb 27 21:33:24 2018
From: brian.goetz at oracle.com (Brian Goetz)
Date: Tue, 27 Feb 2018 16:33:24 -0500
Subject: Raw string literals and Unicode escapes
In-Reply-To: <8C4448A0-A0DB-4DBF-A34C-2B014ADA013E@oracle.com>
References: <5A83275F.80802@oracle.com> <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
 <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
 <5A945562.8080108@oracle.com>
 <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>
 <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr>
 <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com>
 <5FAE6846-F83D-400B-9A62-51A11CE80143@oracle.com>
 <8C4448A0-A0DB-4DBF-A34C-2B014ADA013E@oracle.com>
Message-ID: <8205ef25-7e3c-47dd-1849-4f71212c324b@oracle.com>


> Which leads us to the following theoretical result: the ```` mechanism 
> does not require you to grub around in the interior of the string AT 
> ALL if you don?t want to. ?All you need to know is the length. ?If the 
> length of the raw string is n, and it does not begin or end with ` (a 
> necessary check in any case), then using n-1 backquote characters 
> before and after will always do the job.
>
> In practice, many programmers (and programs) will be willing to do a 
> quick search to see whether ?```? or failing that ?````? happens to be 
> absent from the raw string. :-)
>

Or the IDE will helpfully suggest a sensible number of quotes when you 
do quote-quote-paste.

From guy.steele at oracle.com  Tue Feb 27 21:19:15 2018
From: guy.steele at oracle.com (Guy Steele)
Date: Tue, 27 Feb 2018 16:19:15 -0500
Subject: Raw string literals and Unicode escapes
In-Reply-To: <8205ef25-7e3c-47dd-1849-4f71212c324b@oracle.com>
References: <5A83275F.80802@oracle.com> <5A84AD7F.5030803@oracle.com>
 <B4F0A087-477C-49B4-9B01-172A8E67A096@oracle.com> <5A84BB6A.60102@oracle.com>
 <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com>
 <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
 <5A945562.8080108@oracle.com>
 <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>
 <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr>
 <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com>
 <5FAE6846-F83D-400B-9A62-51A11CE80143@oracle.com>
 <8C4448A0-A0DB-4DBF-A34C-2B014ADA013E@oracle.com>
 <8205ef25-7e3c-47dd-1849-4f71212c324b@oracle.com>
Message-ID: <F1291B56-790B-4E6E-A29B-E92DE2AA1BC2@oracle.com>


> On Feb 27, 2018, at 4:33 PM, Brian Goetz <brian.goetz at oracle.com> wrote:
> 
> 
>> Which leads us to the following theoretical result: the ```` mechanism does not require you to grub around in the interior of the string AT ALL if you don?t want to.  All you need to know is the length.  If the length of the raw string is n, and it does not begin or end with ` (a necessary check in any case), then using n-1 backquote characters before and after will always do the job.
>> 
>> In practice, many programmers (and programs) will be willing to do a quick search to see whether ?```? or failing that ?````? happens to be absent from the raw string. :-)
>> 
> 
> Or the IDE will helpfully suggest a sensible number of quotes when you do quote-quote-paste.

The IDE is a program.  I refer to my previous statement.  :-)  But thanks for the clarification.


From forax at univ-mlv.fr  Tue Feb 27 21:46:38 2018
From: forax at univ-mlv.fr (Remi Forax)
Date: Tue, 27 Feb 2018 22:46:38 +0100 (CET)
Subject: Raw string literals and Unicode escapes
In-Reply-To: <8C4448A0-A0DB-4DBF-A34C-2B014ADA013E@oracle.com>
References: <5A83275F.80802@oracle.com>
 <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr>
 <5A945562.8080108@oracle.com>
 <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com>
 <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr>
 <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com>
 <5FAE6846-F83D-400B-9A62-51A11CE80143@oracle.com>
 <8C4448A0-A0DB-4DBF-A34C-2B014ADA013E@oracle.com>
Message-ID: <1324412928.831260.1519767998348.JavaMail.zimbra@u-pem.fr>

> De: "Guy Steele" <guy.steele at oracle.com>
> ?: "John Rose" <john.r.rose at oracle.com>
> Cc: "amber-spec-experts" <amber-spec-experts at openjdk.java.net>
> Envoy?: Mardi 27 F?vrier 2018 22:12:14
> Objet: Re: Raw string literals and Unicode escapes

>> On Feb 27, 2018, at 4:20 PM, John Rose < [ mailto:john.r.rose at oracle.com |
>> john.r.rose at oracle.com ] > wrote:

>> On Feb 27, 2018, at 11:48 AM, Brian Goetz < [ mailto:brian.goetz at oracle.com |
>> brian.goetz at oracle.com ] > wrote:

>>>> So after this length instead of having the probability to see a character to be
>>>> virtually 1, you have the opposite effect, because programming languages (a
>>>> human construct) are very regular in the set of chars they use. So you do not
>>>> need to a repetition of a character to avoid a statistical effect that does not
>>>> occur. Being able to choose the escape character, is enough.

>>> The problem is not that it's enough, its that it is too much. Having nine ways
>>> to say the same thing is too many; having infinitely many (e.g., nonces) is
>>> worse. Having used the "pick your delimiter" approach taken by Perl, I find
>>> that you are *still* often bitten by the inability to find a good delimiter for
>>> embedding a snippet of a program written in a language similar to the outer
>>> language. And it surely makes code less readable, because many more things can
>>> be interpreted as quotes.

>> My experience tracks with Brian's. That's why I think the random string
>> model is more robust than some vague hope that languages won't overlap.

>> Yes, random strings are an outlier, but less so that you'd think. A typical
>> compression ratio for code is 5x, which means that if you replace "random
>> string of length 10" with "random code snippet of length 50" you get the
>> same analytic results. In order to exclude a close-quote, you need an
>> additional constraint, which in practical terms results in folks having to
>> grub around inside their raw strings looking for accidentall quotes.

> Which leads us to the following theoretical result: the ```` mechanism does not
> require you to grub around in the interior of the string AT ALL if you don?t
> want to. All you need to know is the length. If the length of the raw string is
> n, and it does not begin or end with ` (a necessary check in any case), then
> using n-1 backquote characters before and after will always do the job.

> In practice, many programmers (and programs) will be willing to do a quick
> search to see whether ?```? or failing that ?````? happens to be absent from
> the raw string. :-)

Ok, i'm clearly in minority here, the repetition pattern wins. 

R?mi