From brian.goetz at oracle.com Mon Feb 5 15:53:41 2018 From: brian.goetz at oracle.com (Brian Goetz) Date: Mon, 5 Feb 2018 10:53:41 -0500 Subject: [raw-strings] Indentation problem In-Reply-To: References: Message-ID: <7a14daa0-1eca-23b4-839d-a5942d51a0cb@oracle.com> Sorry for the delay getting back to this. > Hello! > > Every language which implements the multiline strings has problems > with indentation. Indeed.? The fundamental problem here is that the indentation of embedded snippets is serving two masters; the nesting of the surrounding code, and the snippet itself.? Sometimes the user cares about one; sometimes the other, and there's no one-size-fits-all set of rules that any language has come up with that doesn't make both camps happy. Sometimes it doesn't really matter; a few extra spaces in an HTML document or SQL query is often an acceptable price to pay for clean-looking code.? But sometimes it does matter.? Which raises two questions: ?- What should programmers do? ?- What should the language help them do? > E.g. consider something like this: So, in light of the above questions, let's ask: is this the right way to generate a HTML document?? It not only has "holes" to be filled in, but it has entire sections whose presence or absence depends on state.? I think the mess of this example goes far deeper than indentation.? (But yes, people will write code like this, with whatever tools we give them.)? To the second question, what should the language do to help this code?? Some would say "of course, the problem is you don't support interpolation."? But as this example shows, interpolation only helps with the trivial bits; it doesn't help with the conditional inclusion, so it only gets you a small part of the way to this example.? For that, you either need something with more structure, or a templating engine, or a builder, or one of the zillion other tools we've invented for this sort of thing. So, without ignoring your fundamental question about indentation, I'll just point out that this example is about way more than indentation, and move on ... > Now we have broken formatting in the generated HTML, which ruins the > idea of multiline strings I think "multiline strings" (or even "raw strings") are a bit of a misleading name.? What we're going for here is the ability to embed an arbitrary snippet of a "program" (shell script, SQL query, JSON doc) in a Java program, without having to mangle the embedded snippet.? This enhances readability (not mucked up with escapes and extra quotes) and reduces errors (because you can just cut and paste that snippet of script from the editor in which you've probably already written it, without risking breaking it via syntactic mangling.)? But, as you say, there are issues with indentation, when it matters.? (Surely it matters for snippets of python.) Secondarily, the design center for this feature is: _short_ snippets -- those for which putting them in a separate document would be obfuscatory.? To see this, we have to approach it from both sides. On the short side, imagine Java didn't have string literals at all. Having to read "yes" and "no" out of a file would be ridiculously obfuscatory; eliminating this indirection makes code easier to read and less error-prone.? But on the long side, using raw strings to embed a million-line snippet in a Java program is also ridiculous; it would be far easier for maintainers of both the Java part and the embedded part to have their own uniform artifacts to maintain.? So the sweet spot for this feature is somewhere in the middle -- snippets that are short enough that indirecting to a file impairs readability, but not so long that there's any question where the embedded snippet ends and the Java code resumes.? (Subjectively, I'd say that this sweet spot is in the 5-10 line range.) > (why bother to generate \n in output HTML if > it looks like a mess anyways?) Moreover, the structure of Java program > now affects the output. E.g. if you add several more nested "if" or > "switch" statement, you will need to indent

even more. My answer to those people is: then don't do that ;)? They're already well outside the design center (as outlined above).? They should be using a templating mechanism, a builder, or something else to decouple the static content from the dynamic content.? Of course, they will, but I'm not sure bending over backwards to accomodate them is the winning move. > Many languages provide library methods to handle this. Good, now we're back to indentation.? All things being equal, it is better to do things in libraries than in the language; it is cheaper, more flexible, faster to market, less risky, and can support a broader range of preferences (you can have different libraries for different preferences.)? So I like this direction. > E.g. > trimIndent() could be provided to remove leading spaces of every line, > but this would kill the HTML indents at all. Another possibility is to > provide a method like trimMargin() on Kotlin [1] which trims all > spaces before a special character (pipe by default) including a > special character itself. Now that we're in library world, we can have _all_ of these.? We can trim indents to the first indent, or trim a specified number of spaces off, or trim to a user-selected marker.? And if the users don't like the ones we include, they can write their own. > This is almost nice. Even without syntax highlighting you can easily > distinguish between Java code and injected HTML code, you can indent > Java and HTML independently and HTML code does not clash with Java > code structure. Pushing this to a library gives users the option, but not the obligation to do this.? That's good. > The only problem is the necesity to call the > trimMargin() method. For some meaning of "only" :)?? Like most syntactic conventions, some users will say "this is great" and others will say "yuck".? I prefer the semantic transparency of calling a method that has a clear specification -- especially when there are multiple possible options. Remember that we're already in a corner case with respect to indentation -- in many cases, the users don't care at all about the extra spaces, they're just building up a SQL query that is going to be sent to a database, and the database doesn't care either. > This means that original line is preserved in the > bytecode and during runtime and the trimming is processed every time > the method is called causing performance and memory handicap. This > problem could be minimized making trimMargin() a javac intrinsic. There are multiple layers at which this can be optimized (the JIT may be able to observe that this a pure function applied to a constant), but indeed, this is a great candidate for compile-time constant folding.? (You can even see experiments related to compile-time constant folding going on in the condy-folding branch of the amber repo.)? Note too that we're now in corner-case-of-corner-case territory -- those who care about the indentation and the cost of runtime string processing. > Hoever even in this case it would be hard to enforce usage of this > method and I expect that tons of hard-to-read Java code will appear in > the wild, despite I believe that Java is about readability. Developers ability to combine simple features to produce unreadable code far outstrips the ability of language designers to do anything about it ... > So I propose to enforce such (or similar) format on language level > instead of adding a library method like "trimMargin()". I think this would be a language design mistake.? This is taking one arbitrary convention and burning it into the language.? That convention might be fine for some situations, but terrible for others; not only might it not be the most readable choice in all cases, but it could be an actual conflict -- what if the | character is meaningful in the embedded language, such as Markdown tables? Now we're back to escaping -- which we were trying to avoid. The language shouldn't pick favorites here; it should provide a simple, clear mechanism, which can be usefully composed with other mechanisms to get the job done.? Polluting the language to avoid the method call is a bad trade. > I see some advantages with such syntax: > 1. You can comment (or comment out!) a part of multiline string > without terminating it Rather than framing this as a property of a proposed solution, let's frame it as a question.? What should be the interaction with comments in a raw string?? Should you be able to embed comments? Should you be able to comment lines out?? (Note that many languages support comments, so it may be possible to do this by embedding a comment, rather than using the Java-level commenting.)? While I can surely see the utility of interaction with commenting, I also think that these "requirements" are only in play when the string in question is too long in the first place. > 2. Looking into code fragment out of context (e.g. diff log) you > understand that you are inside a multiline literal. > reviewing a diff like > > | x++; > + | if (x == 10) break; > | foo(x); > > Without pipes you could think that it's Java code without any further > consideration. This is true, but this is also true of large block comments; you can't tell whether the added line is part of a commented out block or of executable code. Again, with raw strings, this is more of a problem when used with too-long blocks. So, there are two things I don't like about this proposal: it's too "opinionated", and at the same time, it loses the fundamental goal we were trying to get to -- not having to muck up an embedded block with escaping.? (Sure, IDEs could (and should) help on pasting here, but that only helps writing, not reading.) > The only disadvantage I see in forcing a pipe prefix is inability to > just paste a big snippet from somewhere to the middle of Java program > in a plain text editor. As mentioned, we think this is most of the point, so this is a pretty big disadvantage indeed. From james.laskey at oracle.com Mon Feb 5 16:58:48 2018 From: james.laskey at oracle.com (Jim Laskey) Date: Mon, 5 Feb 2018 12:58:48 -0400 Subject: Raw String Literals Revisions Message-ID: Based on input received since the reveal of https://bugs.openjdk.java.net/browse/JDK-8196004 Raw String Literals, we propose making the following changes. - We will be extending the definition of raw string literals to allow repeating backticks (ala Markdown.) - A raw string literal will begin with a sequence of one or more backticks, The raw string literal will end when a matching sequence of backticks is encountered. Any other sequence of backticks is treated as raw characters. - There will no longer be an empty raw string literal (redundant and conflicting.) - There will no longer be a need for a embedded backtick sequence (double backtick.) Ex. String a = `xyz`; // "xyz" String b = ``xyz``; // "xyz" String c = ```x`y``z```; // "x`y``z" String d = ``; // unmatched opening raw string sequence? String e = `` SELECT `EMP_ID`, `LAST_NAME` FROM `EMPLOYEE_TB` WHERE `CITY` = ?INDIANAPOLIS' ORDER BY `EMP_ID`, `LAST_NAME`; ``; String f = ``````````````````````````````````````````````` SELECT `EMP_ID`, `LAST_NAME` FROM `EMPLOYEE_TB` WHERE `CITY` = ?INDIANAPOLIS' ORDER BY `EMP_ID`, `LAST_NAME`; ```````````````````````````````````````````````; - The naming of the the ?escape" and ?unescape" String methods will be reversed such that ?unescape" converts escape sequences to characters and ?escape" converts worthy characters to escape sequences. Some more thought could be given to these names 1) to address the overloaded use in other languages, ex. JavaScript HTML escaping, 2) truly make the direction of conversion clear. - There was some good discussion about allowing multi-line traditional strings. The best argument was using the multi-line traditional strings as a stepping stone to multi-line raw strings; simple -> multi-line -> raw. It was also mentioned that multi-line traditional strings would lessen the need for the unescape/escape methods. Ex. String g = " public class Example{ public static void main(String... args){ System.out.println(\"Hello World\"); } } "; Ultimately, ignoring escapes, we would end up with two ways of do the same thing. Thus, we will not be supporting multi-line traditional strings at this time. I will be updating the JEP accordingly. Cheers, ? Jim From brian.goetz at oracle.com Mon Feb 5 17:09:33 2018 From: brian.goetz at oracle.com (Brian Goetz) Date: Mon, 5 Feb 2018 12:09:33 -0500 Subject: Raw String Literals Revisions In-Reply-To: References: Message-ID: > Based on input received since the reveal of https://bugs.openjdk.java.net/browse/JDK-8196004 > Raw String Literals, we propose making the following changes. > > - We will be extending the definition of raw string literals to allow > repeating backticks (ala Markdown.) The benefit of this is that, for a suitably chosen delimiter, any document can be embedded with no loss of fidelity.? For embedded documents that use ` in them, choose a suitable delimiter (usually `` will be enough) and paste away.? We stated earlier that it was a goal to make raw string literals truly free of interpretation by the lexer; this removes one of the remaining bits of non-raw-ness, that embedded backticks required some minor escaping. ? (The other remaining bit is the treatment of newlines; not sure how much its worth doing here to support platform-specific line endings.)? If people really want platform-specific newlines, they can toss a .replace("\n", "\r\n") on the end (which is amenable to the same optimizations as .trimIndent()). > - The naming of the the ?escape" and ?unescape" String methods will be > reversed such that ?unescape" converts escape sequences to characters > and ?escape" converts worthy characters to escape sequences. Some more > thought could be given to these names 1) to address the overloaded use > in other languages, ex. JavaScript HTML escaping, 2) truly make the > direction of conversion clear. I think this is a better polarity, but I think this exercise shows that "escape" and "unescape" may still be too-confusing names.? I suggest we continue the search for names that shout out their directionality. > > Ultimately, ignoring escapes, we would end up with two ways of do the > same thing. Thus, we will not be supporting multi-line traditional > strings at this time. I think this is the right move, though there are arguments on both sides here.? The new feature is raw string literals, which are designed for embedding longer (but not too long) snippets of text free of Java interpretation.? For the rare cases where people want both string escaping and multi-line, there are library tools for adding back the escaping. From guy.steele at oracle.com Mon Feb 5 16:48:19 2018 From: guy.steele at oracle.com (Guy Steele) Date: Mon, 5 Feb 2018 11:48:19 -0500 Subject: [raw-strings] Indentation problem In-Reply-To: <7a14daa0-1eca-23b4-839d-a5942d51a0cb@oracle.com> References: <7a14daa0-1eca-23b4-839d-a5942d51a0cb@oracle.com> Message-ID: While the proposal to use pipe characters in multiline literals is ingenious, I guess I am astonished that the discussion so far has not at least mentioned and compared a solution that has been available in C for more than two decades: implicit concatenation of string literals. public class Multiline { static String createHtml(String message) { String html = "\n" " \n" " Message\n" " \n" " \n"; if (message != null) { html += "

\n" " Message: "+message+"\n" "

\n"; } html += " \n" "\n"; return html; } } But, wait! We don?t even have to add that to Java, because we have a string concatenation operator, `+`: public class Multiline { static String createHtml(String message) { String html = "\n"+ " \n"+ " Message\n"+ " \n"+ " \n"; if (message != null) { html += "

\n"+ " Message: "+message+"\n"+ "

\n?; } html += " \n"+ "\n"; return html; } } It?s dead simple: Whitespace inside double quotes belongs to the included code snippet. Whitespace outside double quotes belongs to the containing program. No need for a trimming method. No changes needed to the Java language. Any IDE smart enough to provide pipe characters in a special pasting operation could just as easily provide the necessary double quotes and newline escapes and plus signs. And while we are at it, we can get even more creative with the indentation of the containing program to better highlight the relative indentation of the included code snippets: public class Multiline { static String createHtml(String message) { String html = "\n"+ " \n"+ " Message\n"+ " \n"+ " \n"; if (message != null) { html += "

\n"+ " Message: "+message+"\n"+ "

\n?; } html += " \n"+ "\n"; return html; } } Does this strategy not meet all the stated desiderata? ?Guy P.S. I have to admit that all the \n escapes are a big ugly. In Fortress, we explored having several string concatenation operators, one of which would supply an extra space character and one of which would supply an extra newline character. Supposing `/` to be a string concatenation operator that adds a space character, and `//` to be a string concatenation operator that adds a newline character (in Fortress, we also allowed it to be a postfix operator that just adds a newline character), then we could write: public class Multiline { static String createHtml(String message) { String html = "" // " " // " Message" // " " // " " //; if (message != null) { html += "

" // " Message:" / message // "

" //; } html += " " // "" //; return html; } } Or we could be even more clever, and say that the // operator has the additional effect of trimming trailing spaces and tabs from its left-hand operand. Then we get those trailing double quotes out of the way by writing: public class Multiline { static String createHtml(String message) { String html = " " // " " // " Message " // " " // " " //; if (message != null) { html += "

" // " Message:" / message // "

" //; } html += " " // " " //; return html; } } I am not advocating adding such operators to the Java language. I am just pointing out that there are other interesting parts of the design space that might address the indentation-of-nested-snippets problem while requiring either fewer changes to the language (possibly none) or changes that might have broader applicability. From brian.goetz at oracle.com Mon Feb 5 17:37:31 2018 From: brian.goetz at oracle.com (Brian Goetz) Date: Mon, 5 Feb 2018 12:37:31 -0500 Subject: [raw-strings] Indentation problem In-Reply-To: References: <7a14daa0-1eca-23b4-839d-a5942d51a0cb@oracle.com> Message-ID: > But, wait! ?We don?t even have to add that to Java, because we have a > string concatenation operator, `+`: For years, this was exactly my answer regarding "Why don't we have multi-line strings."? What turned me around was Jim convincing me that the problem was not the inability to express a multi-line string (which we've been able to do since day 1, as you point out), but the higher-level issue -- the accidental friction of embedding a (small) foreign document (JSON snippet, SQL snippet, etc) in a Java program without Java's string proclivities mangling the embedded document.? Multi-line is one aspect of this, but if this were all there was, I'd still be with you on "we already have this, let's move on."? The bigger aspect is intrusion on things like regexes (lots of double-escaping, since \ is used extensively by regex), which are not even multi-line, and the introduction of errors into embedded documents while trying to turn them into something the Java lexer will accept. So, I prefer to think of this feature not as "multi-line strings" or even as "raw strings", but "embedded strings"; things that look like strings from the outside but look like whatever you want them to on the inside. From guy.steele at oracle.com Mon Feb 5 17:31:43 2018 From: guy.steele at oracle.com (Guy Steele) Date: Mon, 5 Feb 2018 12:31:43 -0500 Subject: [raw-strings] Indentation problem In-Reply-To: References: <7a14daa0-1eca-23b4-839d-a5942d51a0cb@oracle.com> Message-ID: <5D30B3C8-6CDB-4CB9-8456-ECEC24CD73A5@oracle.com> > On Feb 5, 2018, at 12:37 PM, Brian Goetz wrote: > > >> But, wait! We don?t even have to add that to Java, because we have a string concatenation operator, `+`: > > For years, this was exactly my answer regarding "Why don't we have multi-line strings." What turned me around was Jim convincing me that the problem was not the inability to express a multi-line string (which we've been able to do since day 1, as you point out), but the higher-level issue -- the accidental friction of embedding a (small) foreign document (JSON snippet, SQL snippet, etc) in a Java program without Java's string proclivities mangling the embedded document. Multi-line is one aspect of this, but if this were all there was, I'd still be with you on "we already have this, let's move on." The bigger aspect is intrusion on things like regexes (lots of double-escaping, since \ is used extensively by regex), which are not even multi-line, and the introduction of errors into embedded documents while trying to turn them into something the Java lexer will accept. > > So, I prefer to think of this feature not as "multi-line strings" or even as "raw strings", but "embedded strings"; things that look like strings from the outside but look like whatever you want them to on the inside. Good, that?s a better characterization of the broad problem. However, I also note that the broad problem may two or three distinct symptoms, and: (1) A solution that addresses one symptom may not address the others, and (2) On the other hand, it may (or may not) be perfectly reason to address the most painful symptoms in different ways, rather than insisting that a single solution cover them all. In particular, I happen to think that the problem of distinguishing snippet indentation from encoding-program indentation may require a rather different kind of solution from the problem of escape characters in embedded snippets. The reason is that in both these cases the painful symptom is visual in nature rather than logical. That?s why I can understand what drove Tagir to pursue the pipe-character approach (even though I think it may not be the best solution to the problem). We may want to use ```?``` to enclose regexes but also want to use some other approach to solve the multi-line / indentation problems. From brian.goetz at oracle.com Mon Feb 5 18:39:53 2018 From: brian.goetz at oracle.com (Brian Goetz) Date: Mon, 5 Feb 2018 13:39:53 -0500 Subject: [raw-strings] Indentation problem In-Reply-To: <5D30B3C8-6CDB-4CB9-8456-ECEC24CD73A5@oracle.com> References: <7a14daa0-1eca-23b4-839d-a5942d51a0cb@oracle.com> <5D30B3C8-6CDB-4CB9-8456-ECEC24CD73A5@oracle.com> Message-ID: > However, I also note that the broad problem may two or three distinct symptoms, and: > (1) A solution that addresses one symptom may not address the others, and > (2) On the other hand, it may (or may not) be perfectly reason to address the most painful symptoms in different ways, rather than insisting that a single solution cover them all. Indeed so.? This is one reason why we resisted the call to do string interpolation (which many developers conflate with multi-line strings, as many languages with one also have the other) at the same time.? Another way to ask this question is: are we yet sufficiently minimal?? We boiled it down quite a lot already, but are we at "minimal" yet?? Or, did we take a wrong turn in boiling it down, and find ourselves only a local minimum? > In particular, I happen to think that the problem of distinguishing snippet indentation from encoding-program indentation may require a rather different kind of solution from the problem of escape characters in embedded snippets. The reason is that in both these cases the painful symptom is visual in nature rather than logical. That?s why I can understand what drove Tagir to pursue the pipe-character approach (even though I think it may not be the best solution to the problem). We may want to use ```?``` to enclose regexes but also want to use some other approach to solve the multi-line / indentation problems. OK, so what you're saying here is that it might be a clever self-deception to count newline handling as "just another aspect of raw-ness"? From guy.steele at oracle.com Mon Feb 5 18:39:04 2018 From: guy.steele at oracle.com (Guy Steele) Date: Mon, 5 Feb 2018 13:39:04 -0500 Subject: [raw-strings] Indentation problem In-Reply-To: References: <7a14daa0-1eca-23b4-839d-a5942d51a0cb@oracle.com> <5D30B3C8-6CDB-4CB9-8456-ECEC24CD73A5@oracle.com> Message-ID: <8A0CACD4-A850-45E0-A00B-D0F8A959432C@oracle.com> > On Feb 5, 2018, at 1:39 PM, Brian Goetz wrote: > > >> However, I also note that the broad problem may two or three distinct symptoms, and: >> (1) A solution that addresses one symptom may not address the others, and >> (2) On the other hand, it may (or may not) be perfectly reason to address the most painful symptoms in different ways, rather than insisting that a single solution cover them all. > > Indeed so. This is one reason why we resisted the call to do string interpolation (which many developers conflate with multi-line strings, as many languages with one also have the other) at the same time. Another way to ask this question is: are we yet sufficiently minimal? We boiled it down quite a lot already, but are we at "minimal" yet? Or, did we take a wrong turn in boiling it down, and find ourselves only a local minimum? > >> In particular, I happen to think that the problem of distinguishing snippet indentation from encoding-program indentation may require a rather different kind of solution from the problem of escape characters in embedded snippets. The reason is that in both these cases the painful symptom is visual in nature rather than logical. That?s why I can understand what drove Tagir to pursue the pipe-character approach (even though I think it may not be the best solution to the problem). We may want to use ```?``` to enclose regexes but also want to use some other approach to solve the multi-line / indentation problems. > > OK, so what you're saying here is that it might be a clever self-deception to count newline handling as "just another aspect of raw-ness"? Bingo. Back in the day (I?m talking 1960s) it was ugly and wasteful but predictable: if there were line breaks at all (as opposed to record-oriented I/O), they were represented by two characters, CR and then LF, held over from the mechanical abilities/requirements of Teletype machines. Then in mid-1960s an ISO standard allowed plain LF (eventually semi-renamed Newline) as an alternative, and Multics and then Unix spread this idea (and eventually to Apple). But another branch of the world, notably the CP/M to MS-DOS to Windows line, continued to use CR/LF. Worse yet, some software came to use CR along (perhaps a natural enough theory when you consider that the ?Return? key on keyboards usually generates the CR character rather than the LF character). It is simply impossible to be compatible with everyone on this issue, and we are fooling ourselves if we think that raw string representations can solve this problem in all contexts. Much better, I think, in the absence of consensus to have explicit software gatekeepers at the points where data transitions among these disparate worlds. From brian.goetz at oracle.com Mon Feb 5 20:55:24 2018 From: brian.goetz at oracle.com (Brian Goetz) Date: Mon, 5 Feb 2018 15:55:24 -0500 Subject: [raw-strings] Indentation problem In-Reply-To: <8A0CACD4-A850-45E0-A00B-D0F8A959432C@oracle.com> References: <7a14daa0-1eca-23b4-839d-a5942d51a0cb@oracle.com> <5D30B3C8-6CDB-4CB9-8456-ECEC24CD73A5@oracle.com> <8A0CACD4-A850-45E0-A00B-D0F8A959432C@oracle.com> Message-ID: <39218091-4267-59e1-1de1-a792c1702cbb@oracle.com> OK, let's take a step back.? We have identified at least three degrees of freedom that have been sources of friction with existing string literals: ?- Sometimes we don't want traditional escaping (\n, etc); ?- Sometimes we don't want unicode escaping (\unnnn); ?- Sometimes we want to represent multiple lines of text as a single String. Traditional strings could be described as (false, false, false) on these axes; the propose raw strings are (true, true, true).? As a first evaluation (if these really are the axes), this is encouraging; if you're going to pick 2 of 2^N prepackaged options, its often best to pick the ones with the biggest hamming distance. I have a hard time imagining that people really need, for example, traditional escaping but not unicode escaping, with any frequency.? So offering all 2^n combinations is not likely to carry its weight. I think what you are suggesting is that its fine to lump the first two, but it might have been a premature move to lump them with the third.? (A second question is: are these the only axes we should be concerned with right now.)? So, let's examine that. We explored allowing double-quoted strings to span lines too; this gives you a different stacking: { escaping multi-line, raw multi-line }.? But I think the part that's still? unexplored is: do we need to explicitly surface how source lines are combined into strings? The assumption we've been working off of is: \n has won (this wasn't true when Java got started.)? Is this wishful thinking? And if not, can the library approach serve this purpose here too: ??? `a long ???? string`.toPlatformLineEnding() (which, as has been observed, can be optimized either by compile-time evaluation or by link-time evaluation using LDC and ConstantDynamic, so I think we can ignore the "but then I'm doing work at runtime" aspect of this.) On 2/5/2018 1:39 PM, Guy Steele wrote: >> On Feb 5, 2018, at 1:39 PM, Brian Goetz wrote: >> >> >>> However, I also note that the broad problem may two or three distinct symptoms, and: >>> (1) A solution that addresses one symptom may not address the others, and >>> (2) On the other hand, it may (or may not) be perfectly reason to address the most painful symptoms in different ways, rather than insisting that a single solution cover them all. >> Indeed so. This is one reason why we resisted the call to do string interpolation (which many developers conflate with multi-line strings, as many languages with one also have the other) at the same time. Another way to ask this question is: are we yet sufficiently minimal? We boiled it down quite a lot already, but are we at "minimal" yet? Or, did we take a wrong turn in boiling it down, and find ourselves only a local minimum? >> >>> In particular, I happen to think that the problem of distinguishing snippet indentation from encoding-program indentation may require a rather different kind of solution from the problem of escape characters in embedded snippets. The reason is that in both these cases the painful symptom is visual in nature rather than logical. That?s why I can understand what drove Tagir to pursue the pipe-character approach (even though I think it may not be the best solution to the problem). We may want to use ```?``` to enclose regexes but also want to use some other approach to solve the multi-line / indentation problems. >> OK, so what you're saying here is that it might be a clever self-deception to count newline handling as "just another aspect of raw-ness"? > Bingo. > > Back in the day (I?m talking 1960s) it was ugly and wasteful but predictable: if there were line breaks at all (as opposed to record-oriented I/O), they were represented by two characters, CR and then LF, held over from the mechanical abilities/requirements of Teletype machines. > > Then in mid-1960s an ISO standard allowed plain LF (eventually semi-renamed Newline) as an alternative, and Multics and then Unix spread this idea (and eventually to Apple). > > But another branch of the world, notably the CP/M to MS-DOS to Windows line, continued to use CR/LF. Worse yet, some software came to use CR along (perhaps a natural enough theory when you consider that the ?Return? key on keyboards usually generates the CR character rather than the LF character). > > It is simply impossible to be compatible with everyone on this issue, and we are fooling ourselves if we think that raw string representations can solve this problem in all contexts. Much better, I think, in the absence of consensus to have explicit software gatekeepers at the points where data transitions among these disparate worlds. > From daniel.smith at oracle.com Wed Feb 7 00:42:16 2018 From: daniel.smith at oracle.com (Dan Smith) Date: Tue, 6 Feb 2018 17:42:16 -0700 Subject: Specification for JEP 323: Local-Variable Syntax for Lambda Parameters Message-ID: <575B787D-3350-4414-8DE0-91D9467AFB30@oracle.com> Here's a proposed specification for JEP 323. Mainly a few small tweaks to allow use of 'var' in the lambda syntax. http://cr.openjdk.java.net/~dlsmith/lambda-parameters.html From john.r.rose at oracle.com Tue Feb 13 17:53:18 2018 From: john.r.rose at oracle.com (John Rose) Date: Tue, 13 Feb 2018 09:53:18 -0800 Subject: Raw String Literals Revisions In-Reply-To: References: Message-ID: On Feb 5, 2018, at 8:58 AM, Jim Laskey wrote: > > Based on input received since the reveal of https://bugs.openjdk.java.net/browse/JDK-8196004 > Raw String Literals, we propose making the following changes. I have written a draft edit to the JLS to support this feature. Please find it here: http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v4.pdf Best wishes, ? John From alex.buckley at oracle.com Tue Feb 13 17:58:55 2018 From: alex.buckley at oracle.com (Alex Buckley) Date: Tue, 13 Feb 2018 09:58:55 -0800 Subject: Raw string literals and Unicode escapes Message-ID: <5A83275F.80802@oracle.com> I suspect the trickiest part of specifying raw string literals will be the lexer's modal behavior for Unicode escapes. As such, I am going to put the behavior under the microscope. Here is what the JEP has to say: ----- Unicode escapes, in the form \uxxxx, are processed as part of character input prior to interpretation by the lexer. To support the raw string literal as-is requirement, Unicode escape processing is disabled when the lexer encounters an opening backtick and reenabled when encountering a closing backtick. ----- I would like to assume that if the lexer comes across the six tokens \ u 0 0 6 0 then it should interpret them as a Unicode escape representing a backtick _and then continue as if consuming the tokens of a raw string literal_. However, the mention of _an_ opening backtick and _a_ closing backtick gave me pause, given that repeated backticks can serve as the opening delimiter and the closing delimiter. For absolute clarity, let's write out examples to confirm intent: (Jim, please confirm or deny as you see fit!) 1. String s = \u0060`; Illegal. The RHS is lexed as ``; which is disallowed by the grammar. 2. String s = \u0060Hello\u0060; Illegal. The RHS is lexed as `Hello\u0060; and so on for the rest of the compilation unit -- the six tokens \ u 0 0 6 0 are not treated as a Unicode escape since we're lexing a raw string literal. And without a closing delimiter before the end of the compilation unit, a compile-time error occurs. 3a. String s = \u0060Hello`; Legal. The RHS is lexed as `Hello`; which is well formed. 3b. String s = \u0060\u0060Hello`; Depends! If you take the JEP literally, then just the Unicode escape which serves as the first opening backtick ("_an_ opening backtick") is enough to enter raw-string mode. That makes the code legal: the RHS is lexed as `\u0060Hello`; which is well formed. On the other hand, you might think that we shouldn't enter raw-string mode until the lexer in traditional mode has lexed the opening delimiter fully (i.e. ALL the opening backticks). Then, the code in 3b is illegal, because the opening delimiter (``) and the closing delimiter (`) are not symmetric. I think we should take the JEP literally, so that 3b is legal. And then, some more examples: 4a. String s = \u0060`Hello``; Legal. The RHS is lexed as ``Hello``; which is well formed. 4b. String s = \u0060\u0060Hello``; Illegal. The RHS is lexed as `\u0060Hello``; which is disallowed by the grammar. A raw string literal containing 11 tokens is immediately followed by a ` token and a ; token which are not expected. 4c. String s = \u0060\u0060Hello`\u0060; Depends! If you take the JEP literally, where _a_ closing backtick is enough to re-enable Unicode escape processing, then the RHS is lexed as `\u0060Hello``; which is illegal per 4b. On the other hand, if you think that we shouldn't re-enter traditional mode until the lexer in raw-string mode has lexed the closing delimiter fully (i.e. ALL the closing backticks), then presumably you think analogously about the opening delimiter, so the RHS would be lexed as ``Hello`\u0060; which is illegal per 2 (no closing delimiter `` before the end of the compilation unit). 5. String s = \u0060`Hello`\u0060; I put this here because it looks nice. It hits the same issues as 3b and 4c. Alex From john.r.rose at oracle.com Tue Feb 13 22:11:06 2018 From: john.r.rose at oracle.com (John Rose) Date: Tue, 13 Feb 2018 14:11:06 -0800 Subject: Raw string literals and Unicode escapes In-Reply-To: <5A83275F.80802@oracle.com> References: <5A83275F.80802@oracle.com> Message-ID: On Feb 13, 2018, at 9:58 AM, Alex Buckley wrote: > > I suspect the trickiest part of specifying raw string literals will be the lexer's modal behavior for Unicode escapes. As such, I am going to put the behavior under the microscope. For an approach to this see: http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v4.pdf In short: We define a so-called "preimage" for each token, which is the unambiguously defined sequence of UTF-16 code points that translate to that token via \u substitution and line terminator normalization. For raw strings (only) the preimage of a token is significant. The backticks of a raw string (both opening and closing) are required to be their own preimage (no \u0060 allowed). And the raw string body contents are the preimage of the string token, not the normal token image. I think preimage is the trick we need here, and it settles a number of questions, such as those you raised. All of the tricky examples you raised are uniformly illegal, under the preimage rule for raw-string quotes. ? John From james.laskey at oracle.com Tue Feb 13 22:19:03 2018 From: james.laskey at oracle.com (Jim Laskey) Date: Tue, 13 Feb 2018 18:19:03 -0400 Subject: Raw string literals and Unicode escapes In-Reply-To: <5A83275F.80802@oracle.com> References: <5A83275F.80802@oracle.com> Message-ID: <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> 10a. String s = `abc`; 10b. String s = \u0060abc`; As it stands both are legal. This decision has been mostly taken away from us because the lookahead of the previous token has ?consumed" the character. There is little hope of finding out which form the backtick was derived. Not technically true in javac since we can sift back through the input buffer. Other tools may differ. I?m going to ignore this remark in a second. Choice: do we turn off escape processing on the first open backtick or the last open backtick? It doesn?t really matter as long as we do it before consuming the first non-backtick character. Choice: do we turn on escape processing on the first close backtick or the last close backtick? It doesn?t matter as long as we do it before consuming the next non-backtick character. If we have an aborted close sequence (too few or too many backticks) then we have to turn it off again. What about embedding \u0060 in a raw string? If we treat them the same as backtick then the user is limited in the ways to express untranslated escapes. Note: We can always convert manually in the scanner by looking ahead for ?\?, ?u?, ?0?, ?0?, ?6?, ?0?. That all said, I think we should not allow \u0060 to represent a backtick in a raw string literal, ever. It complicates things unnecessarily and limits what the user can embed in the raw string. So, change the scanner to A) Peek back to make sure the first open backtick was exactly a backtick. B) Turn off Unicode escapes immediately so that only backtick characters can be part of the delimiter. C) Turn on Unicode escapes only after a valid closing delimiter is encountered. Based on this all your examples are illegal. ? Jim > On Feb 13, 2018, at 1:58 PM, Alex Buckley wrote: > > I suspect the trickiest part of specifying raw string literals will be the lexer's modal behavior for Unicode escapes. As such, I am going to put the behavior under the microscope. Here is what the JEP has to say: > > ----- > Unicode escapes, in the form \uxxxx, are processed as part of character input prior to interpretation by the lexer. To support the raw string literal as-is requirement, Unicode escape processing is disabled when the lexer encounters an opening backtick and reenabled when encountering a closing backtick. > ----- > > I would like to assume that if the lexer comes across the six tokens \ u 0 0 6 0 then it should interpret them as a Unicode escape representing a backtick _and then continue as if consuming the tokens of a raw string literal_. However, the mention of _an_ opening backtick and _a_ closing backtick gave me pause, given that repeated backticks can serve as the opening delimiter and the closing delimiter. For absolute clarity, let's write out examples to confirm intent: (Jim, please confirm or deny as you see fit!) > > 1. String s = \u0060`; > > Illegal. The RHS is lexed as ``; which is disallowed by the grammar. > > 2. String s = \u0060Hello\u0060; > > Illegal. The RHS is lexed as `Hello\u0060; and so on for the rest of the compilation unit -- the six tokens \ u 0 0 6 0 are not treated as a Unicode escape since we're lexing a raw string literal. And without a closing delimiter before the end of the compilation unit, a compile-time error occurs. > > 3a. String s = \u0060Hello`; > > Legal. The RHS is lexed as `Hello`; which is well formed. > > 3b. String s = \u0060\u0060Hello`; > > Depends! If you take the JEP literally, then just the Unicode escape which serves as the first opening backtick ("_an_ opening backtick") is enough to enter raw-string mode. That makes the code legal: the RHS is lexed as `\u0060Hello`; which is well formed. On the other hand, you might think that we shouldn't enter raw-string mode until the lexer in traditional mode has lexed the opening delimiter fully (i.e. ALL the opening backticks). Then, the code in 3b is illegal, because the opening delimiter (``) and the closing delimiter (`) are not symmetric. > > I think we should take the JEP literally, so that 3b is legal. And then, some more examples: > > 4a. String s = \u0060`Hello``; > > Legal. The RHS is lexed as ``Hello``; which is well formed. > > 4b. String s = \u0060\u0060Hello``; > > Illegal. The RHS is lexed as `\u0060Hello``; which is disallowed by the grammar. A raw string literal containing 11 tokens is immediately followed by a ` token and a ; token which are not expected. > > 4c. String s = \u0060\u0060Hello`\u0060; > > Depends! If you take the JEP literally, where _a_ closing backtick is enough to re-enable Unicode escape processing, then the RHS is lexed as `\u0060Hello``; which is illegal per 4b. On the other hand, if you think that we shouldn't re-enter traditional mode until the lexer in raw-string mode has lexed the closing delimiter fully (i.e. ALL the closing backticks), then presumably you think analogously about the opening delimiter, so the RHS would be lexed as ``Hello`\u0060; which is illegal per 2 (no closing delimiter `` before the end of the compilation unit). > > 5. String s = \u0060`Hello`\u0060; > > I put this here because it looks nice. It hits the same issues as 3b and 4c. > > Alex From john.r.rose at oracle.com Tue Feb 13 22:27:30 2018 From: john.r.rose at oracle.com (John Rose) Date: Tue, 13 Feb 2018 14:27:30 -0800 Subject: Raw string literals and Unicode escapes In-Reply-To: <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> Message-ID: On Feb 13, 2018, at 2:19 PM, Jim Laskey wrote: > > So, change the scanner to > > A) Peek back to make sure the first open backtick was exactly a backtick. > B) Turn off Unicode escapes immediately so that only backtick characters can be part of the delimiter. > C) Turn on Unicode escapes only after a valid closing delimiter is encountered. > > Based on this all your examples are illegal. +1 I think this is also the simplest behavior to explain to users. From alex.buckley at oracle.com Wed Feb 14 19:46:23 2018 From: alex.buckley at oracle.com (Alex Buckley) Date: Wed, 14 Feb 2018 11:46:23 -0800 Subject: Raw string literals and Unicode escapes In-Reply-To: References: <5A83275F.80802@oracle.com> Message-ID: <5A84920F.20201@oracle.com> On 2/13/2018 2:11 PM, John Rose wrote: > On Feb 13, 2018, at 9:58 AM, Alex Buckley > wrote: >> >> I suspect the trickiest part of specifying raw string literals will be >> the lexer's modal behavior for Unicode escapes. As such, I am going to >> put the behavior under the microscope. > > For an approach to this see: > http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v4.pdf > > In short: We define a so-called "preimage" for each token, > which is the unambiguously defined sequence of UTF-16 > code points that translate to that token via \u substitution > and line terminator normalization. > > For raw strings (only) the preimage of a token is significant. > The backticks of a raw string (both opening and closing) > are required to be their own preimage (no \u0060 allowed). > And the raw string body contents are the preimage of the > string token, not the normal token image. > > I think preimage is the trick we need here, and it settles > a number of questions, such as those you raised. > All of the tricky examples you raised are uniformly illegal, > under the preimage rule for raw-string quotes. I agree that holding on to the preimage of each InputElement (JLS 3.5) is necessary because ` can legitimately appear in some kinds of InputElement as an ordinary InputCharacter (derived from either the RawInputCharacter ` or the UnicodeEscape \u0060): 1. Comment // This Markdown processor treats ` specially. /* This Markdown processor treats \u0060 specially. */ 2. Token (and more specifically, StringLiteral) "Hi `Bob`" "Hi \u0060Bob\u0060" Only if the InputElement is a Token, and more specifically a RawStringLiteral, do we need to take the sequence of InputCharacters and LineTerminators that constitute its RawStringBody and replace that sequence with its preimage. I want to say something about the delimiters of the raw string literal now, but I'll do that in response to Jim's mail. Alex From alex.buckley at oracle.com Wed Feb 14 20:24:36 2018 From: alex.buckley at oracle.com (Alex Buckley) Date: Wed, 14 Feb 2018 12:24:36 -0800 Subject: Raw string literals and Unicode escapes In-Reply-To: <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> Message-ID: <5A849B04.8030503@oracle.com> On 2/13/2018 2:19 PM, Jim Laskey wrote: > 10a. String s = `abc`; 10b. String s = \u0060abc`; >... > So, change the scanner to > > A) Peek back to make sure the first open backtick was exactly a > backtick. B) Turn off Unicode escapes immediately so that only > backtick characters can be part of the delimiter. C) Turn on Unicode > escapes only after a valid closing delimiter is encountered. > > Based on this all your examples are illegal. I am not opposed to saying that a delimiter must be constructed from actual ` characters (that is, the RawInputCharacter ` rather than the UnicodeEscape \u0060). It would be silly if the opening delimiter was \u0060 because the closing delimiter cannot be identical -- that hurts readability. (Clearly the six characters \ u 0 0 6 0 inside a raw string literal get no special processing.) Unfortunately, there is nothing in the lexical grammar that prevents \u0060Hello` or \u0060Hello\u0060 or in fact any of the examples below from being lexed as a RawStringLiteral. The JLS will need a semantic rule to force each RawStringDelimiter to be composed of actual ` characters. As you say, this will make all the examples below illegal. There is plenty of precedent for semantic rules ("It is a compile-time error ...") in the interpretation of Literal tokens, so that's fine. In fact, JLS 3.10.4 already has a semantic rule that appears to constrain a delimiter in a CharacterLiteral token: It is a compile-time error for the character following the SingleCharacter or EscapeSequence to be other than a '. although it doesn't mean to force an actual ' character (that is, the RawInputCharacter ' and not the UnicodeEscape \u0027). It means: It is a compile-time error for the character following the SingleCharacter or EscapeSequence to be other than a ' (or the Unicode escape thereof). Alex >> On Feb 13, 2018, at 1:58 PM, Alex Buckley >> wrote: >> >> I suspect the trickiest part of specifying raw string literals will >> be the lexer's modal behavior for Unicode escapes. As such, I am >> going to put the behavior under the microscope. Here is what the >> JEP has to say: >> >> ----- Unicode escapes, in the form \uxxxx, are processed as part of >> character input prior to interpretation by the lexer. To support >> the raw string literal as-is requirement, Unicode escape processing >> is disabled when the lexer encounters an opening backtick and >> reenabled when encountering a closing backtick. ----- >> >> I would like to assume that if the lexer comes across the six >> tokens \ u 0 0 6 0 then it should interpret them as a Unicode >> escape representing a backtick _and then continue as if consuming >> the tokens of a raw string literal_. However, the mention of _an_ >> opening backtick and _a_ closing backtick gave me pause, given that >> repeated backticks can serve as the opening delimiter and the >> closing delimiter. For absolute clarity, let's write out examples >> to confirm intent: (Jim, please confirm or deny as you see fit!) >> >> 1. String s = \u0060`; >> >> Illegal. The RHS is lexed as ``; which is disallowed by the >> grammar. >> >> 2. String s = \u0060Hello\u0060; >> >> Illegal. The RHS is lexed as `Hello\u0060; and so on for the rest >> of the compilation unit -- the six tokens \ u 0 0 6 0 are not >> treated as a Unicode escape since we're lexing a raw string >> literal. And without a closing delimiter before the end of the >> compilation unit, a compile-time error occurs. >> >> 3a. String s = \u0060Hello`; >> >> Legal. The RHS is lexed as `Hello`; which is well formed. >> >> 3b. String s = \u0060\u0060Hello`; >> >> Depends! If you take the JEP literally, then just the Unicode >> escape which serves as the first opening backtick ("_an_ opening >> backtick") is enough to enter raw-string mode. That makes the code >> legal: the RHS is lexed as `\u0060Hello`; which is well formed. >> On the other hand, you might think that we shouldn't enter >> raw-string mode until the lexer in traditional mode has lexed the >> opening delimiter fully (i.e. ALL the opening backticks). Then, the >> code in 3b is illegal, because the opening delimiter (``) and the >> closing delimiter (`) are not symmetric. >> >> I think we should take the JEP literally, so that 3b is legal. And >> then, some more examples: >> >> 4a. String s = \u0060`Hello``; >> >> Legal. The RHS is lexed as ``Hello``; which is well formed. >> >> 4b. String s = \u0060\u0060Hello``; >> >> Illegal. The RHS is lexed as `\u0060Hello``; which is disallowed >> by the grammar. A raw string literal containing 11 tokens is >> immediately followed by a ` token and a ; token which are not >> expected. >> >> 4c. String s = \u0060\u0060Hello`\u0060; >> >> Depends! If you take the JEP literally, where _a_ closing backtick >> is enough to re-enable Unicode escape processing, then the RHS is >> lexed as `\u0060Hello``; which is illegal per 4b. On the other >> hand, if you think that we shouldn't re-enter traditional mode >> until the lexer in raw-string mode has lexed the closing delimiter >> fully (i.e. ALL the closing backticks), then presumably you think >> analogously about the opening delimiter, so the RHS would be lexed >> as ``Hello`\u0060; which is illegal per 2 (no closing delimiter >> `` before the end of the compilation unit). >> >> 5. String s = \u0060`Hello`\u0060; >> >> I put this here because it looks nice. It hits the same issues as >> 3b and 4c. >> >> Alex > From john.r.rose at oracle.com Wed Feb 14 20:42:01 2018 From: john.r.rose at oracle.com (John Rose) Date: Wed, 14 Feb 2018 12:42:01 -0800 Subject: Raw string literals and Unicode escapes In-Reply-To: <5A849B04.8030503@oracle.com> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> Message-ID: <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> On Feb 14, 2018, at 12:24 PM, Alex Buckley wrote: > > There is plenty of precedent for semantic rules In my draft version this is done with "where" clauses on the grammar rules: > > RawStringLiteral: > > RawQuote RawStringBody RawQuote > where the two raw-quotes are constrained to be identical > > RawQuote: > ` {`} > where the preimage is constrained to be unescaped From alex.buckley at oracle.com Wed Feb 14 21:43:27 2018 From: alex.buckley at oracle.com (Alex Buckley) Date: Wed, 14 Feb 2018 13:43:27 -0800 Subject: Raw string literals and Unicode escapes In-Reply-To: <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> Message-ID: <5A84AD7F.5030803@oracle.com> On 2/14/2018 12:42 PM, John Rose wrote: > On Feb 14, 2018, at 12:24 PM, Alex Buckley > wrote: >> >> There is plenty of precedent for semantic rules > > In my draft version this is done with "where" clauses on the > grammar rules: > >> RawStringLiteral: >> >> RawQuote RawStringBody RawQuote >> where the two raw-quotes are constrained to be identical >> >> RawQuote: >> ` {`} >> where the preimage is constrained to be unescaped We're dancing on the head of a pin now, but as a matter of specificational style I'm wary of too many rules in the grammar itself, especially a context-sensitive rule like raw-quotes-must-balance. JLS 3.10.5 is a good specimen to study: there is a context-free rule in the grammar: StringCharacter: InputCharacter but not " or \ and a context-sensitive semantic rule: It is a compile-time error for a line terminator to appear after the opening " and before the closing matching ". Strictly speaking, the semantic rule is unnecessary because InputCharacter is DEFINED to exclude the CR and LF line terminators! But the semantic rule makes the intent very very clear. Writing rules in this form also prevents the spec from becoming a soup of statements that are more than just observations but less than full-throated assertions. Anyway, the draft was very useful, thanks! Alex From john.r.rose at oracle.com Wed Feb 14 21:48:20 2018 From: john.r.rose at oracle.com (John Rose) Date: Wed, 14 Feb 2018 13:48:20 -0800 Subject: Raw string literals and Unicode escapes In-Reply-To: <5A84AD7F.5030803@oracle.com> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> <5A84AD7F.5030803@oracle.com> Message-ID: On Feb 14, 2018, at 1:43 PM, Alex Buckley wrote: > > Strictly speaking, the semantic rule is unnecessary because InputCharacter is DEFINED to exclude the CR and LF line terminators! But the semantic rule makes the intent very very clear. Writing rules in this form also prevents the spec from becoming a soup of statements that are more than just observations but less than full-throated assertions. That makes sense. > Anyway, the draft was very useful, thanks! Glad to hear it! ? John P.S. I posted another version that takes a slightly different tack on the restriction of "cannot begin with a backquote". It basically lifts the whole design of Markdown code quotes. http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v5.pdf From alex.buckley at oracle.com Wed Feb 14 22:42:50 2018 From: alex.buckley at oracle.com (Alex Buckley) Date: Wed, 14 Feb 2018 14:42:50 -0800 Subject: Raw string literals and Unicode escapes In-Reply-To: References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> <5A84AD7F.5030803@oracle.com> Message-ID: <5A84BB6A.60102@oracle.com> On 2/14/2018 1:48 PM, John Rose wrote: > P.S. I posted another version that takes a slightly different > tack on the restriction of "cannot begin with a backquote". > It basically lifts the whole design of Markdown code quotes. > > http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v5.pdf The inclusion of RawSP means that you are fully delivering on your trailer from Jan 30: "Spoiler: I think I can prove that Markdown code quoting is appropriately minimal in its design, in a way Jim's is not." Let me first recognize the power of RawSP in lifting TWO restrictions: cannot begin with a backtick, and cannot end with a backtick: String s = ``Hi `Bob```; // Error, unbalanced delimiters String s = ``Hi `Bob`` + "`"; // OK String s = `` Hi `Bob` ``; // OK with RawSP trick However, since the JEP's goal is to allow copy-paste of arbitrary text without interpretation, I think the RawSP trick of assigning meaning to whitespace is out of place. To most people, the raw string literal: ` and ` denotes a perfectly good five-character string that will probably be inserted between two other strings. Explaining that, no, it's really a three-character string will not be popular. Also, the inclusion of RawSP makes the lexing of RawStringLiteral ambiguous, since RawStringBody allows opening and closing whitespace. No doubt this can be fixed with rules involving "If the first character after RawSP is a backtick ...", but now being like Markdown is getting expensive. Alex From john.r.rose at oracle.com Wed Feb 14 22:46:54 2018 From: john.r.rose at oracle.com (John Rose) Date: Wed, 14 Feb 2018 14:46:54 -0800 Subject: Raw string literals and Unicode escapes In-Reply-To: <5A84BB6A.60102@oracle.com> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> Message-ID: <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> On Feb 14, 2018, at 2:42 PM, Alex Buckley wrote: > > Also, the inclusion of RawSP makes the lexing of RawStringLiteral ambiguous, since RawStringBody allows opening and closing whitespace. No doubt this can be fixed with rules involving "If the first character after RawSP is a backtick ...", but now being like Markdown is getting expensive. These matters are already covered in the draft, under the blanket provision that the RSB cannot contain a close-quote sequence. So I don't think I swept anything under the covers there. A similar effect could be gotten by replacing RawSP with any other raw character (fixed by the JLS), such as period ``.asdf.``, double-quote ``"asdf"``, etc. From brian.goetz at oracle.com Sun Feb 18 21:44:31 2018 From: brian.goetz at oracle.com (Brian Goetz) Date: Sun, 18 Feb 2018 13:44:31 -0800 Subject: Updated records doc Message-ID: <8B9F27FE-7A72-4050-B945-A85C78A610B3@oracle.com> I?ve updated the records doc at: http://cr.openjdk.java.net/~briangoetz/amber/datum.html to reflect comments and discussion to date. From brian.goetz at oracle.com Fri Feb 23 21:00:31 2018 From: brian.goetz at oracle.com (Brian Goetz) Date: Fri, 23 Feb 2018 16:00:31 -0500 Subject: Raw string literals and Unicode escapes In-Reply-To: <5A84BB6A.60102@oracle.com> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> Message-ID: <7fd3e3a0-997b-c6c7-ee59-473e84366396@oracle.com> > > However, since the JEP's goal is to allow copy-paste of arbitrary text > without interpretation, I think the RawSP trick of assigning meaning > to whitespace is out of place. To most people, the raw string literal: > > ? ` and ` > > denotes a perfectly good five-character string that will probably be > inserted between two other strings. Explaining that, no, it's really a > three-character string will not be popular. +100.? The RawSP trick is clever, but too much so.? There are ample simpler approaches for beginning/ending with BT: ??? String s = BACKTICK + `a raw string` + BACKTICK; ??? String s = `` `a raw string` ``.trim(); These move the cognitive load on the user to the corner case, rather than landing it on the general case. From guy.steele at oracle.com Fri Feb 23 21:07:53 2018 From: guy.steele at oracle.com (Guy Steele) Date: Fri, 23 Feb 2018 16:07:53 -0500 Subject: Raw string literals and Unicode escapes In-Reply-To: <7fd3e3a0-997b-c6c7-ee59-473e84366396@oracle.com> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <7fd3e3a0-997b-c6c7-ee59-473e84366396@oracle.com> Message-ID: +200. Or even String s = ?`" + `a raw string` + ?`?; It?s perfectly okay to use both kinds of string in one expression. > On Feb 23, 2018, at 4:00 PM, Brian Goetz wrote: > > > >> >> However, since the JEP's goal is to allow copy-paste of arbitrary text without interpretation, I think the RawSP trick of assigning meaning to whitespace is out of place. To most people, the raw string literal: >> >> ` and ` >> >> denotes a perfectly good five-character string that will probably be inserted between two other strings. Explaining that, no, it's really a three-character string will not be popular. > > +100. The RawSP trick is clever, but too much so. There are ample simpler approaches for beginning/ending with BT: > > String s = BACKTICK + `a raw string` + BACKTICK; > String s = `` `a raw string` ``.trim(); > > These move the cognitive load on the user to the corner case, rather than landing it on the general case. > > From john.r.rose at oracle.com Sat Feb 24 06:34:35 2018 From: john.r.rose at oracle.com (John Rose) Date: Fri, 23 Feb 2018 22:34:35 -0800 Subject: Raw string literals and Unicode escapes In-Reply-To: <7fd3e3a0-997b-c6c7-ee59-473e84366396@oracle.com> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <7fd3e3a0-997b-c6c7-ee59-473e84366396@oracle.com> Message-ID: <7CE813FC-53AC-4FB1-A5E6-D892E2BF702E@oracle.com> On Feb 23, 2018, at 1:00 PM, Brian Goetz wrote: > >> >> >> However, since the JEP's goal is to allow copy-paste of arbitrary text without interpretation, I think the RawSP trick of assigning meaning to whitespace is out of place. To most people, the raw string literal: >> >> ` and ` >> >> denotes a perfectly good five-character string that will probably be inserted between two other strings. Explaining that, no, it's really a three-character string will not be popular. > > +100. The RawSP trick is clever, but too much so. There are ample simpler approaches for beginning/ending with BT: > > String s = BACKTICK + `a raw string` + BACKTICK; > String s = `` `a raw string` ``.trim(); > > These move the cognitive load on the user to the corner case, rather than landing it on the general case. Note that the "trim" trick moves the problem elsewhere: It can remove more than just the one extra space, so the string "`xxx " needs a different technique. A bunch of only-partially-applicable tricks like that is also a kind of cognitive load, isn't it? Here's one I also kind of like: If the string has no embedded newlines, then do ``|`a raw string`|``.trimMargins(). The + operator is a more robust solution, although it requires parentheses also if used with a postfix method of any sort. Maybe better is trimLines, where a newline is the "guard" to be stripped, but of which at most only one is stripped. I suppose reasonable people might differ on whether a fixed aperiodic quote (like "``` " or "```|") is more surprising than a grab bag of methods for fixing edge effects. But, I do agree that libraries can fix such edge effects. And, I am very happy that, in lengthening the opening and closing quotes, we are making it possible to paste an arbitrary sequence of unicode without having to hunt around inside the sequence to find stuff that needs extra quoting, as is the case with today's strings. The thing we are discussing here, the need to give special handling to leading and trailing backticks is (crucially) an edge effect (only at the two ends of the string) and not a bulk effect (something that needs attention throughout the string). That means we have won on the key issue, and are just disagreeing about how to collect our winnings. (Yes you do have to look at the string bulk, but only to choose a "strong enough fence" to enclose that bulk. And then adjust the fence to handle edge effects from backticks. The emount of escaping is O(1) not O(N).) ? John From brian.goetz at oracle.com Sat Feb 24 15:28:22 2018 From: brian.goetz at oracle.com (Brian Goetz) Date: Sat, 24 Feb 2018 10:28:22 -0500 Subject: Raw string literals and Unicode escapes In-Reply-To: <7CE813FC-53AC-4FB1-A5E6-D892E2BF702E@oracle.com> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <7fd3e3a0-997b-c6c7-ee59-473e84366396@oracle.com> <7CE813FC-53AC-4FB1-A5E6-D892E2BF702E@oracle.com> Message-ID: <96a9a046-b63f-6d8c-797b-7ca1f0535583@oracle.com> > And, I am very happy that, in lengthening the opening and > closing quotes, we are making it possible to paste an arbitrary > sequence of unicode without having to hunt around inside > the sequence to find stuff that needs extra quoting, as is > the case with today's strings. That's the high order bit here; paste an arbitrary snippet. > The thing we are discussing here, the need to give special > handling to leading and trailing backticks is (crucially) > an edge effect (only at the two ends of the string) and > not a bulk effect (something that needs attention throughout > the string). ?That means we have won on the key issue, > and are just disagreeing about how to collect our winnings. I already collected my winnings, and am now spending them in the bar.? Join me for a drink :) From forax at univ-mlv.fr Sun Feb 25 12:19:05 2018 From: forax at univ-mlv.fr (Remi Forax) Date: Sun, 25 Feb 2018 13:19:05 +0100 (CET) Subject: Raw string literals and Unicode escapes In-Reply-To: <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> Message-ID: <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> I'm late in the game but why not using the same system as Perl, PHP, Ruby to solve the Lts [1], i.e you have a sequence that says this is the starts of a raw string (%Q, qq, m) then a character (in a predefined list), the raw string and at the end of the raw string the same character as at the beginning (or its mirror). By example, this 'raw' as prefix for a raw string raw`this is a raw string` raw'this is another raw string' raw[yet another raw string] R?mi [1] https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome > De: "John Rose" > ?: "Alex Buckley" > Cc: "amber-spec-experts" > Envoy?: Mercredi 14 F?vrier 2018 23:46:54 > Objet: Re: Raw string literals and Unicode escapes > On Feb 14, 2018, at 2:42 PM, Alex Buckley < [ mailto:alex.buckley at oracle.com | > alex.buckley at oracle.com ] > wrote: >> Also, the inclusion of RawSP makes the lexing of RawStringLiteral ambiguous, >> since RawStringBody allows opening and closing whitespace. No doubt this can be >> fixed with rules involving "If the first character after RawSP is a backtick >> ...", but now being like Markdown is getting expensive. > These matters are already covered in the draft, under the blanket > provision that the RSB cannot contain a close-quote sequence. > So I don't think I swept anything under the covers there. > A similar effect could be gotten by replacing RawSP with any > other raw character (fixed by the JLS), such as period ``.asdf.``, > double-quote ``"asdf"``, etc. From alex.buckley at oracle.com Mon Feb 26 18:43:46 2018 From: alex.buckley at oracle.com (Alex Buckley) Date: Mon, 26 Feb 2018 10:43:46 -0800 Subject: Raw string literals and Unicode escapes In-Reply-To: <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> Message-ID: <5A945562.8080108@oracle.com> On 2/25/2018 4:19 AM, Remi Forax wrote: > I'm late in the game but why not using the same system as Perl, PHP, > Ruby to solve the Lts [1], i.e > you have a sequence that says this is the starts of a raw string (%Q, > qq, m) then a character (in a predefined list), the raw string and at > the end of the raw string the same character as at the beginning (or its > mirror). > > By example, this 'raw' as prefix for a raw string > raw`this is a raw string` > raw'this is another raw string' > raw[yet another raw string] See "Choice of Delimiters" in the "Alternatives" section of the JEP. Alex From john.r.rose at oracle.com Mon Feb 26 20:17:13 2018 From: john.r.rose at oracle.com (John Rose) Date: Mon, 26 Feb 2018 12:17:13 -0800 Subject: Raw string literals and Unicode escapes In-Reply-To: <5A945562.8080108@oracle.com> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> <5A945562.8080108@oracle.com> Message-ID: <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> On Feb 26, 2018, at 10:43 AM, Alex Buckley wrote: > > On 2/25/2018 4:19 AM, Remi Forax wrote: >> I'm late in the game but why not using the same system as Perl, PHP, >> Ruby to solve the Lts [1], i.e >> you have a sequence that says this is the starts of a raw string (%Q, >> qq, m) then a character (in a predefined list), the raw string and at >> the end of the raw string the same character as at the beginning (or its >> mirror). >> >> By example, this 'raw' as prefix for a raw string >> raw`this is a raw string` >> raw'this is another raw string' >> raw[yet another raw string] > > See "Choice of Delimiters" in the "Alternatives" section of the JEP. The JEP doesn't clearly call out the goal of *no* escapes in the bulk of the raw string, but that requirement (which we have adopted) affects the choice of quotes in a decisive manner. Let me try to lay out the "string physics" that underly this decision. *Any* single-character end-quote will have a significant probability of showing up inside the bulk of a (randomly selected) raw string. How significant? Well, let's say conservatively that raw strings can have all possible characters, but the end-quote sequence only shows up one out of a hundred times, per character position, in raw strings. If you are using a series of ten-character raw strings (to say nothing of bigger ones), you have about a 10% chance for any given raw string to contain an inconvenient end-quote. That percentage is significant, especially given that in some cases strings will be longer and quote characters will be more common, both factors increasing the failure rate beyond 10%. But even a 0.1% failure rate is noticeable to users, making a feature feel unreliable. This generalizes to any fixed multi-character end-quote, with a reduction of probability exponential in the length of the end-quote, but still with a non-zero probability, of occurring in the bulk of a randomly selected string. A two-character end-quote might have a probability of 10^-4, and that means you have a more modest but still significant chance of failure of 10% across a suite of 100 random 10-character strings, or for one random 1000-character string. Any *finite choice* of end-quotes has the same problem, with a non-zero probability that decreases (but does not vanish) with the number of available end-quotes. The only way to break out of the box is to allow the user an unlimited range of successively "stronger" end-quotes (i.e., less likely ones). (Randomly selected raw strings are easy to model, although the numbers used above are an approximation to a binomial distribution. In fact, though, strings which show up non-randomly in real code are *more* likely to mention end-quotes, since their contents are somehow correlated to the enclosing language.) You can easily demonstrate this issue by nesting Java code which uses raw quotes inside of a containing raw quote. An easy first test of a proposed quoting mechanism is, "will it nest?" If not, then the quoting mechanism does not meet a key requirement for raw quotes. This key requirement is unconstrained pasting *without* fixups (escape sequences embedded in the bulk of the quote). Anything else, with some epsilon probability of requiring escapes, is not truly raw, just "mostly raw". In the case you propose, Remi, the probability of having an un-quotable bulk string is quite high, since all of the end-quotes are single characters. Only a convention with an end-quote of arbitrary length is strong enough to "fence in" arbitrary raw strings. The simplest possible such convention is to allow replication of a single character to serve as the end-quote. This decision toward simplicity influences other features in Java raw strings, including the decision to use a new character and to diasallow certain edge cases, notably null strings. ? John P.S. I expect IDE vendors will quickly supply useful "stretchy quotes" which will resize themselves to contain whatever users throw into the raw string body. At that point backticks will feel like magic tokens that never accidentally match raw string bodies. From maurizio.cimadamore at oracle.com Mon Feb 26 21:29:31 2018 From: maurizio.cimadamore at oracle.com (Maurizio Cimadamore) Date: Mon, 26 Feb 2018 21:29:31 +0000 Subject: Raw string literals and Unicode escapes In-Reply-To: <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> <5A945562.8080108@oracle.com> <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> Message-ID: <14db28db-0f9d-5c31-3a92-86cc96a604f8@oracle.com> On 26/02/18 20:17, John Rose wrote: > Any*finite choice* of end-quotes has the same problem, with > a non-zero probability that decreases (but does not vanish) > with the number of available end-quotes. The only way to > break out of the box is to allow the user an unlimited range > of successively "stronger" end-quotes (i.e., less likely ones). In reality there is a 'finite' upper bound for this length, which is given by 2^16 /2 = 2 ^ 15. That's the maximum delimiter size you could encode in a Java String which you can also symmetrically close - and it's an edge case, as it will contain the empty string. So, yes, on paper, I agree with the argument, in practice, I guess I'd me more in favor of limiting the number of repetitions - I wouldn't like to open the door to puzzlers: `````````````````````````````````````````````````````````````````````````hello````````````````````````````````````````````````````````````````````````` (which might leave some Ascii art lovers a bit unhappy :-)) I think limiting to 8 or some other reasonable small number will probably reduce the clash probability enough? And, even if it's not enough, I guess we'd still be left with the question if a long (possibly unbounded?) escaping sequence is something we'd like to see in Java. Maurizio From james.laskey at oracle.com Mon Feb 26 21:54:01 2018 From: james.laskey at oracle.com (Jim Laskey) Date: Mon, 26 Feb 2018 17:54:01 -0400 Subject: Raw string literals and Unicode escapes In-Reply-To: <14db28db-0f9d-5c31-3a92-86cc96a604f8@oracle.com> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> <5A945562.8080108@oracle.com> <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> <14db28db-0f9d-5c31-3a92-86cc96a604f8@oracle.com> Message-ID: Why introduce an artificial limit? Identifiers don?t have a limit. 3.8. Identifiers An identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter. ? Jim > On Feb 26, 2018, at 5:29 PM, Maurizio Cimadamore wrote: > > > > On 26/02/18 20:17, John Rose wrote: >> Any *finite choice* of end-quotes has the same problem, with >> a non-zero probability that decreases (but does not vanish) >> with the number of available end-quotes. The only way to >> break out of the box is to allow the user an unlimited range >> of successively "stronger" end-quotes (i.e., less likely ones). > In reality there is a 'finite' upper bound for this length, which is given by 2^16 /2 = 2 ^ 15. That's the maximum delimiter size you could encode in a Java String which you can also symmetrically close - and it's an edge case, as it will contain the empty string. > > So, yes, on paper, I agree with the argument, in practice, I guess I'd me more in favor of limiting the number of repetitions - I wouldn't like to open the door to puzzlers: > > `````````````````````````````````````````````````````````````````````````hello````````````````````````````````````````````````````````````````````````` > > (which might leave some Ascii art lovers a bit unhappy :-)) > > I think limiting to 8 or some other reasonable small number will probably reduce the clash probability enough? And, even if it's not enough, I guess we'd still be left with the question if a long (possibly unbounded?) escaping sequence is something we'd like to see in Java. > > Maurizio From john.r.rose at oracle.com Mon Feb 26 22:01:51 2018 From: john.r.rose at oracle.com (John Rose) Date: Mon, 26 Feb 2018 14:01:51 -0800 Subject: Raw string literals and Unicode escapes In-Reply-To: <14db28db-0f9d-5c31-3a92-86cc96a604f8@oracle.com> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> <5A945562.8080108@oracle.com> <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> <14db28db-0f9d-5c31-3a92-86cc96a604f8@oracle.com> Message-ID: <0FC7FA01-A7CB-4D67-80BE-A4B19D0306F6@oracle.com> On Feb 26, 2018, at 1:29 PM, Maurizio Cimadamore wrote: > > On 26/02/18 20:17, John Rose wrote: >> Any *finite choice* of end-quotes has the same problem, with >> a non-zero probability that decreases (but does not vanish) >> with the number of available end-quotes. The only way to >> break out of the box is to allow the user an unlimited range >> of successively "stronger" end-quotes (i.e., less likely ones). > In reality there is a 'finite' upper bound for this length, which is given by 2^16 /2 = 2 ^ 15. That's the maximum delimiter size you could encode in a Java String which you can also symmetrically close - and it's an edge case, as it will contain the empty string. That's only true for constant pool strings; there is no such defined limit in the JLS. And condy lifts the limit in the constant pool. This is the point at which we need to note that there is a soft upper bound to raw string literals, which is the amount of stuff you are willing to paste into your Java source file before it isn't Java source any more. Probably a half-page of code will be the usual jumbo size, with occasional multi-page outliers. That's maybe 1kb (30 lines of 30 chars). Still, that is plenty long enough to encounter lots of odd corner-case end quotes. > So, yes, on paper, I agree with the argument, in practice, I guess I'd me more in favor of limiting the number of repetitions Pick your puzzler. I'd rather not leave a single string un-representable; that would lead to a different kind of puzzler, *as well as* a hard limitation. Test question: Does the JLS have a maximum length for identifiers? It does not. Even though long identifiers hypothetically lead to "puzzlers" involving abuses of the notation. Ridiculously long or difficult-to-read identifiers are naturally avoided in practice, and the same will be true for ridiculously long end-quotes. http://merriam-webster.com/dictionary/abusus+non+tollit+usum ? John From maurizio.cimadamore at oracle.com Mon Feb 26 22:57:37 2018 From: maurizio.cimadamore at oracle.com (Maurizio Cimadamore) Date: Mon, 26 Feb 2018 22:57:37 +0000 Subject: Raw string literals and Unicode escapes In-Reply-To: References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> <5A945562.8080108@oracle.com> <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> <14db28db-0f9d-5c31-3a92-86cc96a604f8@oracle.com> Message-ID: <551f3767-d7b2-6c52-b7a8-dc374c749583@oracle.com> Of course - delimiters is not part of the string length - I see now why you can have (in theory) unbound prefix/suffix. Personally, I find the argument - "because you can have unlimited-length identifiers" not a great fit. From a lexer writer perspective, I can see that it is used as a candidate - after all it is a token whose size is unbound. But I find it hard to ignore that the roles played by identifiers and delimiters in the grammar are quite different. At least there were other cases were we found different trade off between expressiveness and practicality - see Project Coin's use of repeated underscores in binary literals (subsequently banned): private static final int BOND = ?0000_____________0000________0000000000000000__000000000000000000+ ?00000000_________00000000______000000000000000__0000000000000000000+ ? 000____000_______000____000_____000_______0000__00______0+ ?000______000_____000______000_____________0000___00______0+ 0000______0000___0000______0000___________0000_____0_____0+ 0000______0000___0000______0000__________0000___________0+ 0000______0000___0000______0000_________0000__0000000000+ 0000______0000___0000______0000________0000+ ?000______000_____000______000________0000+ ? 000____000_______000____000_______00000+ ? ?00000000_________00000000_______0000000+ ? ? ?0000_____________0000________000000007; (Example courtesy of Joshua Bloch) Maurizio On 26/02/18 21:54, Jim Laskey wrote: > Why introduce an artificial limit? Identifiers don?t have a > limit.?3.8. Identifiers?An?identifier?is an *unlimited-length > sequence* of?Java letters?and?Java digits, the first of which must be > a?Java letter. > > ? Jim > >> On Feb 26, 2018, at 5:29 PM, Maurizio Cimadamore >> > > wrote: >> >> >> >> On 26/02/18 20:17, John Rose wrote: >>> Any*finite choice* of end-quotes has the same problem, with >>> a non-zero probability that decreases (but does not vanish) >>> with the number of available end-quotes. The only way to >>> break out of the box is to allow the user an unlimited range >>> of successively "stronger" end-quotes (i.e., less likely ones). >> In reality there is a 'finite' upper bound for this length, which is >> given by 2^16 /2 = 2 ^ 15. That's the maximum delimiter size you >> could encode in a Java String which you can also symmetrically close >> - and it's an edge case, as it will contain the empty string. >> >> So, yes, on paper, I agree with the argument, in practice, I guess >> I'd me more in favor of limiting the number of repetitions - I >> wouldn't like to open the door to puzzlers: >> >> `````````````````````````````````````````````````````````````````````````hello````````````````````````````````````````````````````````````````````````` >> >> (which might leave some Ascii art lovers a bit unhappy :-)) >> >> I think limiting to 8 or some other reasonable small number will >> probably reduce the clash probability enough? And, even if it's not >> enough, I guess we'd still be left with the question if a long >> (possibly unbounded?) escaping sequence is something we'd like to see >> in Java. >> >> Maurizio > From maurizio.cimadamore at oracle.com Mon Feb 26 23:02:10 2018 From: maurizio.cimadamore at oracle.com (Maurizio Cimadamore) Date: Mon, 26 Feb 2018 23:02:10 +0000 Subject: Raw string literals and Unicode escapes In-Reply-To: <551f3767-d7b2-6c52-b7a8-dc374c749583@oracle.com> References: <5A83275F.80802@oracle.com> <5BBEAFEB-D3A9-494D-A01B-5AC56E9B4A2F@oracle.com> <5A849B04.8030503@oracle.com> <9B8AFF30-3CB7-4A49-8513-C15B68E0D697@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> <5A945562.8080108@oracle.com> <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> <14db28db-0f9d-5c31-3a92-86cc96a604f8@oracle.com> <551f3767-d7b2-6c52-b7a8-dc374c749583@oracle.com> Message-ID: I stand corrected - repeated underscores are allowed - but Josh's example reminded me of the state of affair with raw strings. Maurizio On 26/02/18 22:57, Maurizio Cimadamore wrote: > > At least there were other cases were we found different trade off > between expressiveness and practicality - see Project Coin's use of > repeated underscores in binary literals (subsequently banned): > From forax at univ-mlv.fr Tue Feb 27 08:16:04 2018 From: forax at univ-mlv.fr (forax at univ-mlv.fr) Date: Tue, 27 Feb 2018 09:16:04 +0100 (CET) Subject: Raw string literals and Unicode escapes In-Reply-To: <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> References: <5A83275F.80802@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> <5A945562.8080108@oracle.com> <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> Message-ID: <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr> Hi John, see below. ----- Mail original ----- > De: "John Rose" > ?: "Remi Forax" > Cc: "amber-spec-experts" > Envoy?: Lundi 26 F?vrier 2018 21:17:13 > Objet: Re: Raw string literals and Unicode escapes > On Feb 26, 2018, at 10:43 AM, Alex Buckley wrote: >> >> On 2/25/2018 4:19 AM, Remi Forax wrote: >>> I'm late in the game but why not using the same system as Perl, PHP, >>> Ruby to solve the Lts [1], i.e >>> you have a sequence that says this is the starts of a raw string (%Q, >>> qq, m) then a character (in a predefined list), the raw string and at >>> the end of the raw string the same character as at the beginning (or its >>> mirror). >>> >>> By example, this 'raw' as prefix for a raw string >>> raw`this is a raw string` >>> raw'this is another raw string' >>> raw[yet another raw string] >> >> See "Choice of Delimiters" in the "Alternatives" section of the JEP. > > The JEP doesn't clearly call out the goal of *no* escapes in the bulk > of the raw string, but that requirement (which we have adopted) > affects the choice of quotes in a decisive manner. Let me try to > lay out the "string physics" that underly this decision. > > *Any* single-character end-quote will have a significant probability > of showing up inside the bulk of a (randomly selected) raw string. > > How significant? Well, let's say conservatively that raw strings > can have all possible characters, but the end-quote sequence > only shows up one out of a hundred times, per character position, > in raw strings. If you are using a series of ten-character raw > strings (to say nothing of bigger ones), you have about a 10% > chance for any given raw string to contain an inconvenient > end-quote. > > That percentage is significant, especially given that in some > cases strings will be longer and quote characters will be more > common, both factors increasing the failure rate beyond 10%. > But even a 0.1% failure rate is noticeable to users, making a > feature feel unreliable. > > This generalizes to any fixed multi-character end-quote, with a > reduction of probability exponential in the length of the end-quote, > but still with a non-zero probability, of occurring in the bulk of > a randomly selected string. A two-character end-quote might > have a probability of 10^-4, and that means you have a more > modest but still significant chance of failure of 10% across a > suite of 100 random 10-character strings, or for one random > 1000-character string. > > Any *finite choice* of end-quotes has the same problem, with > a non-zero probability that decreases (but does not vanish) > with the number of available end-quotes. The only way to > break out of the box is to allow the user an unlimited range > of successively "stronger" end-quotes (i.e., less likely ones). > > (Randomly selected raw strings are easy to model, although > the numbers used above are an approximation to a binomial > distribution. In fact, though, strings which show up non-randomly > in real code are *more* likely to mention end-quotes, since their > contents are somehow correlated to the enclosing language.) > > You can easily demonstrate this issue by nesting Java code > which uses raw quotes inside of a containing raw quote. An > easy first test of a proposed quoting mechanism is, "will it > nest?" If not, then the quoting mechanism does not meet > a key requirement for raw quotes. > > This key requirement is unconstrained pasting *without* fixups > (escape sequences embedded in the bulk of the quote). > Anything else, with some epsilon probability of requiring escapes, > is not truly raw, just "mostly raw". > > In the case you propose, Remi, the probability of having an > un-quotable bulk string is quite high, since all of the end-quotes > are single characters. > > Only a convention with an end-quote of arbitrary length is strong > enough to "fence in" arbitrary raw strings. The simplest possible > such convention is to allow replication of a single character to > serve as the end-quote. This decision toward simplicity > influences other features in Java raw strings, including the > decision to use a new character and to disallow certain > edge cases, notably null strings. > > ? John I understand your point but i disagree with your analysis. My own experience is that raw strings follow what i call the 'embedded languages' hypothesis, i.e. for any application, there is a length such all raw strings with a length greater than this length contain only embedded programming languages. So after this length instead of having the probability to see a character to be virtually 1, you have the opposite effect, because programming languages (a human construct) are very regular in the set of chars they use. So you do not need to a repetition of a character to avoid a statistical effect that does not occur. Being able to choose the escape character, is enough. > > P.S. I expect IDE vendors will quickly supply useful "stretchy quotes" > which will resize themselves to contain whatever users throw into > the raw string body. At that point backticks will feel like magic tokens > that never accidentally match raw string bodies. regards, R?mi From maurizio.cimadamore at oracle.com Tue Feb 27 10:55:53 2018 From: maurizio.cimadamore at oracle.com (Maurizio Cimadamore) Date: Tue, 27 Feb 2018 10:55:53 +0000 Subject: Raw string literals and Unicode escapes In-Reply-To: <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr> References: <5A83275F.80802@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> <5A945562.8080108@oracle.com> <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr> Message-ID: <706e24f3-713d-4494-b6a9-ce7d9a591a00@oracle.com> On 27/02/18 08:16, forax at univ-mlv.fr wrote: > Hi John, > see below. > > ----- Mail original ----- >> De: "John Rose" >> ?: "Remi Forax" >> Cc: "amber-spec-experts" >> Envoy?: Lundi 26 F?vrier 2018 21:17:13 >> Objet: Re: Raw string literals and Unicode escapes >> On Feb 26, 2018, at 10:43 AM, Alex Buckley wrote: >>> On 2/25/2018 4:19 AM, Remi Forax wrote: >>>> I'm late in the game but why not using the same system as Perl, PHP, >>>> Ruby to solve the Lts [1], i.e >>>> you have a sequence that says this is the starts of a raw string (%Q, >>>> qq, m) then a character (in a predefined list), the raw string and at >>>> the end of the raw string the same character as at the beginning (or its >>>> mirror). >>>> >>>> By example, this 'raw' as prefix for a raw string >>>> raw`this is a raw string` >>>> raw'this is another raw string' >>>> raw[yet another raw string] >>> See "Choice of Delimiters" in the "Alternatives" section of the JEP. >> The JEP doesn't clearly call out the goal of *no* escapes in the bulk >> of the raw string, but that requirement (which we have adopted) >> affects the choice of quotes in a decisive manner. Let me try to >> lay out the "string physics" that underly this decision. >> >> *Any* single-character end-quote will have a significant probability >> of showing up inside the bulk of a (randomly selected) raw string. >> >> How significant? Well, let's say conservatively that raw strings >> can have all possible characters, but the end-quote sequence >> only shows up one out of a hundred times, per character position, >> in raw strings. If you are using a series of ten-character raw >> strings (to say nothing of bigger ones), you have about a 10% >> chance for any given raw string to contain an inconvenient >> end-quote. >> >> That percentage is significant, especially given that in some >> cases strings will be longer and quote characters will be more >> common, both factors increasing the failure rate beyond 10%. >> But even a 0.1% failure rate is noticeable to users, making a >> feature feel unreliable. >> >> This generalizes to any fixed multi-character end-quote, with a >> reduction of probability exponential in the length of the end-quote, >> but still with a non-zero probability, of occurring in the bulk of >> a randomly selected string. A two-character end-quote might >> have a probability of 10^-4, and that means you have a more >> modest but still significant chance of failure of 10% across a >> suite of 100 random 10-character strings, or for one random >> 1000-character string. >> >> Any *finite choice* of end-quotes has the same problem, with >> a non-zero probability that decreases (but does not vanish) >> with the number of available end-quotes. The only way to >> break out of the box is to allow the user an unlimited range >> of successively "stronger" end-quotes (i.e., less likely ones). >> >> (Randomly selected raw strings are easy to model, although >> the numbers used above are an approximation to a binomial >> distribution. In fact, though, strings which show up non-randomly >> in real code are *more* likely to mention end-quotes, since their >> contents are somehow correlated to the enclosing language.) >> >> You can easily demonstrate this issue by nesting Java code >> which uses raw quotes inside of a containing raw quote. An >> easy first test of a proposed quoting mechanism is, "will it >> nest?" If not, then the quoting mechanism does not meet >> a key requirement for raw quotes. >> >> This key requirement is unconstrained pasting *without* fixups >> (escape sequences embedded in the bulk of the quote). >> Anything else, with some epsilon probability of requiring escapes, >> is not truly raw, just "mostly raw". >> >> In the case you propose, Remi, the probability of having an >> un-quotable bulk string is quite high, since all of the end-quotes >> are single characters. >> >> Only a convention with an end-quote of arbitrary length is strong >> enough to "fence in" arbitrary raw strings. The simplest possible >> such convention is to allow replication of a single character to >> serve as the end-quote. This decision toward simplicity >> influences other features in Java raw strings, including the >> decision to use a new character and to disallow certain >> edge cases, notably null strings. >> >> ? John > > I understand your point but i disagree with your analysis. > My own experience is that raw strings follow what i call the 'embedded languages' hypothesis, > i.e. for any application, there is a length such all raw strings with a length greater than this length contain only embedded programming languages. > So after this length instead of having the probability to see a character to be virtually 1, you have the opposite effect, because programming languages (a human construct) are very regular in the set of chars they use. So you do not need to a repetition of a character to avoid a statistical effect that does not occur. Being able to choose the escape character, is enough. W/o diving too much on the repeated vs. 'single but customizable' choice, I'm also a bit suspicious of the fact that John's analysis conservatively assumes that a snippet of text embedded in a raw string is a random sequence of character, in the true sense. This, to me, just seems the wrong assumption - by definition something truly random has high entropy and something with high entropy is usually associated with low information content - which is just not compatible with the use case of 'pasting in a code snippet' (example: it's highly likely that the prefix 'cla' will be followed by 'ss' in a Java-like snippet). I would expect entropy of the embedded snippet to be quite low compared to the assumption made here, which greatly affects the probability calculations. For the analysis to be correct, it should take into account the _frequency_ by which a given delimiter can appear in the various kinds of snippets that could be pasted in (and there's one such frequency for each snippet kind) - or we're at risk of overestimating (if we pick a delimiter symbol whose frequency is, in reality, really low), or underestimating (if we pick a symbol that, conversely,? happens very frequently). Maurizio > >> P.S. I expect IDE vendors will quickly supply useful "stretchy quotes" >> which will resize themselves to contain whatever users throw into >> the raw string body. At that point backticks will feel like magic tokens >> that never accidentally match raw string bodies. > regards, > R?mi From brian.goetz at oracle.com Tue Feb 27 19:48:04 2018 From: brian.goetz at oracle.com (Brian Goetz) Date: Tue, 27 Feb 2018 14:48:04 -0500 Subject: Raw string literals and Unicode escapes In-Reply-To: <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr> References: <5A83275F.80802@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> <5A945562.8080108@oracle.com> <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr> Message-ID: <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com> > So after this length instead of having the probability to see a character to be virtually 1, you have the opposite effect, because programming languages (a human construct) are very regular in the set of chars they use. So you do not need to a repetition of a character to avoid a statistical effect that does not occur. Being able to choose the escape character, is enough. The problem is not that it's enough, its that it is too much. Having nine ways to say the same thing is too many; having infinitely many (e.g., nonces) is worse.? Having used the "pick your delimiter" approach taken by Perl, I find that you are *still* often bitten by the inability to find a good delimiter for embedding a snippet of a program written in a language similar to the outer language.? And it surely makes code less readable, because many more things can be interpreted as quotes. From guy.steele at oracle.com Tue Feb 27 19:56:57 2018 From: guy.steele at oracle.com (Guy Steele) Date: Tue, 27 Feb 2018 14:56:57 -0500 Subject: Raw string literals and Unicode escapes In-Reply-To: <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com> References: <5A83275F.80802@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> <5A945562.8080108@oracle.com> <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr> <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com> Message-ID: <8D9A42BF-41A0-49C3-B2F8-1B556A41BBBB@oracle.com> > On Feb 27, 2018, at 2:48 PM, Brian Goetz wrote: > > >> So after this length instead of having the probability to see a character to be virtually 1, you have the opposite effect, because programming languages (a human construct) are very regular in the set of chars they use. So you do not need to a repetition of a character to avoid a statistical effect that does not occur. Being able to choose the escape character, is enough. > > The problem is not that it's enough, its that it is too much. Having nine ways to say the same thing is too many; having infinitely many (e.g., nonces) is worse. Having used the "pick your delimiter" approach taken by Perl, I find that you are *still* often bitten by the inability to find a good delimiter for embedding a snippet of a program written in a language similar to the outer language. And it surely makes code less readable, because many more things can be interpreted as quotes. > I agree with the comments that in practice many raw strings are much more likely to be some sort of code rather than relatively random strings. That said, here is a perfectly plausible bit of Java code: final String uppercase = ?ABCDEFGHIJKLMONOPQRSTUVWXYZ?; final String lowercase = ?abcdefghijklmnopqrstuvwxyz?; final String enclosers = ?(){}[]?; final String punctuation = ?`~!@#$%^&*_+-=|\\:\?;?<>,.?/?; I can quote it easily using `` and ``. But it?s at least a little less easy, as John has argued, to quote it using the ?raw|?|? convention: there is no character on my keyboard that does not occur in the code to be quoted, so I have to go in and muck with the middle of the string. Which makes it less easy to read the embedded code: in order to interpret it, one must be mindful that the syntactic requirements of the containing language may have intruded (requiring doubling or escaping of certain characters, for example). The nice thing about ``He said `what??`` is that I can be _completely_ sure that the syntax of Java has not intruded _at all_ into the middle of the HTML syntax, so that?s one less thing to worry about while puzzling over the HTML. This becomes even more important if the raw-string syntax is nested: System.out.println(?\t? + ```final String htmlSnippet = ?\t? + ``He said `what??`` + ?\n";``` + ?\n?); ?Guy From john.r.rose at oracle.com Tue Feb 27 21:20:56 2018 From: john.r.rose at oracle.com (John Rose) Date: Tue, 27 Feb 2018 13:20:56 -0800 Subject: Raw string literals and Unicode escapes In-Reply-To: <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com> References: <5A83275F.80802@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> <5A945562.8080108@oracle.com> <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr> <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com> Message-ID: <5FAE6846-F83D-400B-9A62-51A11CE80143@oracle.com> On Feb 27, 2018, at 11:48 AM, Brian Goetz wrote: > >> >> So after this length instead of having the probability to see a character to be virtually 1, you have the opposite effect, because programming languages (a human construct) are very regular in the set of chars they use. So you do not need to a repetition of a character to avoid a statistical effect that does not occur. Being able to choose the escape character, is enough. > > The problem is not that it's enough, its that it is too much. Having nine ways to say the same thing is too many; having infinitely many (e.g., nonces) is worse. Having used the "pick your delimiter" approach taken by Perl, I find that you are *still* often bitten by the inability to find a good delimiter for embedding a snippet of a program written in a language similar to the outer language. And it surely makes code less readable, because many more things can be interpreted as quotes. My experience tracks with Brian's. That's why I think the random string model is more robust than some vague hope that languages won't overlap. Yes, random strings are an outlier, but less so that you'd think. A typical compression ratio for code is 5x, which means that if you replace "random string of length 10" with "random code snippet of length 50" you get the same analytic results. In order to exclude a close-quote, you need an additional constraint, which in practical terms results in folks having to grub around inside their raw strings looking for accidentall quotes. ? John From guy.steele at oracle.com Tue Feb 27 21:12:14 2018 From: guy.steele at oracle.com (Guy Steele) Date: Tue, 27 Feb 2018 16:12:14 -0500 Subject: Raw string literals and Unicode escapes In-Reply-To: <5FAE6846-F83D-400B-9A62-51A11CE80143@oracle.com> References: <5A83275F.80802@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> <5A945562.8080108@oracle.com> <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr> <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com> <5FAE6846-F83D-400B-9A62-51A11CE80143@oracle.com> Message-ID: <8C4448A0-A0DB-4DBF-A34C-2B014ADA013E@oracle.com> > On Feb 27, 2018, at 4:20 PM, John Rose wrote: > > On Feb 27, 2018, at 11:48 AM, Brian Goetz > wrote: >> >>> >>> So after this length instead of having the probability to see a character to be virtually 1, you have the opposite effect, because programming languages (a human construct) are very regular in the set of chars they use. So you do not need to a repetition of a character to avoid a statistical effect that does not occur. Being able to choose the escape character, is enough. >> >> The problem is not that it's enough, its that it is too much. Having nine ways to say the same thing is too many; having infinitely many (e.g., nonces) is worse. Having used the "pick your delimiter" approach taken by Perl, I find that you are *still* often bitten by the inability to find a good delimiter for embedding a snippet of a program written in a language similar to the outer language. And it surely makes code less readable, because many more things can be interpreted as quotes. > > My experience tracks with Brian's. That's why I think the random string > model is more robust than some vague hope that languages won't overlap. > > Yes, random strings are an outlier, but less so that you'd think. A typical > compression ratio for code is 5x, which means that if you replace "random > string of length 10" with "random code snippet of length 50" you get the > same analytic results. In order to exclude a close-quote, you need an > additional constraint, which in practical terms results in folks having to > grub around inside their raw strings looking for accidentall quotes. Which leads us to the following theoretical result: the ```` mechanism does not require you to grub around in the interior of the string AT ALL if you don?t want to. All you need to know is the length. If the length of the raw string is n, and it does not begin or end with ` (a necessary check in any case), then using n-1 backquote characters before and after will always do the job. In practice, many programmers (and programs) will be willing to do a quick search to see whether ?```? or failing that ?````? happens to be absent from the raw string. :-) From brian.goetz at oracle.com Tue Feb 27 21:33:24 2018 From: brian.goetz at oracle.com (Brian Goetz) Date: Tue, 27 Feb 2018 16:33:24 -0500 Subject: Raw string literals and Unicode escapes In-Reply-To: <8C4448A0-A0DB-4DBF-A34C-2B014ADA013E@oracle.com> References: <5A83275F.80802@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> <5A945562.8080108@oracle.com> <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr> <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com> <5FAE6846-F83D-400B-9A62-51A11CE80143@oracle.com> <8C4448A0-A0DB-4DBF-A34C-2B014ADA013E@oracle.com> Message-ID: <8205ef25-7e3c-47dd-1849-4f71212c324b@oracle.com> > Which leads us to the following theoretical result: the ```` mechanism > does not require you to grub around in the interior of the string AT > ALL if you don?t want to. ?All you need to know is the length. ?If the > length of the raw string is n, and it does not begin or end with ` (a > necessary check in any case), then using n-1 backquote characters > before and after will always do the job. > > In practice, many programmers (and programs) will be willing to do a > quick search to see whether ?```? or failing that ?````? happens to be > absent from the raw string. :-) > Or the IDE will helpfully suggest a sensible number of quotes when you do quote-quote-paste. From guy.steele at oracle.com Tue Feb 27 21:19:15 2018 From: guy.steele at oracle.com (Guy Steele) Date: Tue, 27 Feb 2018 16:19:15 -0500 Subject: Raw string literals and Unicode escapes In-Reply-To: <8205ef25-7e3c-47dd-1849-4f71212c324b@oracle.com> References: <5A83275F.80802@oracle.com> <5A84AD7F.5030803@oracle.com> <5A84BB6A.60102@oracle.com> <6C528C52-9CDB-4D04-A7F9-A88D71FEDCAE@oracle.com> <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> <5A945562.8080108@oracle.com> <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr> <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com> <5FAE6846-F83D-400B-9A62-51A11CE80143@oracle.com> <8C4448A0-A0DB-4DBF-A34C-2B014ADA013E@oracle.com> <8205ef25-7e3c-47dd-1849-4f71212c324b@oracle.com> Message-ID: > On Feb 27, 2018, at 4:33 PM, Brian Goetz wrote: > > >> Which leads us to the following theoretical result: the ```` mechanism does not require you to grub around in the interior of the string AT ALL if you don?t want to. All you need to know is the length. If the length of the raw string is n, and it does not begin or end with ` (a necessary check in any case), then using n-1 backquote characters before and after will always do the job. >> >> In practice, many programmers (and programs) will be willing to do a quick search to see whether ?```? or failing that ?````? happens to be absent from the raw string. :-) >> > > Or the IDE will helpfully suggest a sensible number of quotes when you do quote-quote-paste. The IDE is a program. I refer to my previous statement. :-) But thanks for the clarification. From forax at univ-mlv.fr Tue Feb 27 21:46:38 2018 From: forax at univ-mlv.fr (Remi Forax) Date: Tue, 27 Feb 2018 22:46:38 +0100 (CET) Subject: Raw string literals and Unicode escapes In-Reply-To: <8C4448A0-A0DB-4DBF-A34C-2B014ADA013E@oracle.com> References: <5A83275F.80802@oracle.com> <974982243.2668211.1519561145219.JavaMail.zimbra@u-pem.fr> <5A945562.8080108@oracle.com> <521F736B-E297-4F6D-8A62-CD44BE36114C@oracle.com> <266500123.456719.1519719364354.JavaMail.zimbra@u-pem.fr> <88c110bf-75cb-e566-8961-b9201c22cc72@oracle.com> <5FAE6846-F83D-400B-9A62-51A11CE80143@oracle.com> <8C4448A0-A0DB-4DBF-A34C-2B014ADA013E@oracle.com> Message-ID: <1324412928.831260.1519767998348.JavaMail.zimbra@u-pem.fr> > De: "Guy Steele" > ?: "John Rose" > Cc: "amber-spec-experts" > Envoy?: Mardi 27 F?vrier 2018 22:12:14 > Objet: Re: Raw string literals and Unicode escapes >> On Feb 27, 2018, at 4:20 PM, John Rose < [ mailto:john.r.rose at oracle.com | >> john.r.rose at oracle.com ] > wrote: >> On Feb 27, 2018, at 11:48 AM, Brian Goetz < [ mailto:brian.goetz at oracle.com | >> brian.goetz at oracle.com ] > wrote: >>>> So after this length instead of having the probability to see a character to be >>>> virtually 1, you have the opposite effect, because programming languages (a >>>> human construct) are very regular in the set of chars they use. So you do not >>>> need to a repetition of a character to avoid a statistical effect that does not >>>> occur. Being able to choose the escape character, is enough. >>> The problem is not that it's enough, its that it is too much. Having nine ways >>> to say the same thing is too many; having infinitely many (e.g., nonces) is >>> worse. Having used the "pick your delimiter" approach taken by Perl, I find >>> that you are *still* often bitten by the inability to find a good delimiter for >>> embedding a snippet of a program written in a language similar to the outer >>> language. And it surely makes code less readable, because many more things can >>> be interpreted as quotes. >> My experience tracks with Brian's. That's why I think the random string >> model is more robust than some vague hope that languages won't overlap. >> Yes, random strings are an outlier, but less so that you'd think. A typical >> compression ratio for code is 5x, which means that if you replace "random >> string of length 10" with "random code snippet of length 50" you get the >> same analytic results. In order to exclude a close-quote, you need an >> additional constraint, which in practical terms results in folks having to >> grub around inside their raw strings looking for accidentall quotes. > Which leads us to the following theoretical result: the ```` mechanism does not > require you to grub around in the interior of the string AT ALL if you don?t > want to. All you need to know is the length. If the length of the raw string is > n, and it does not begin or end with ` (a necessary check in any case), then > using n-1 backquote characters before and after will always do the job. > In practice, many programmers (and programs) will be willing to do a quick > search to see whether ?```? or failing that ?````? happens to be absent from > the raw string. :-) Ok, i'm clearly in minority here, the repetition pattern wins. R?mi