[raw-strings] Indentation problem

Mon Feb 5 15:53:41 UTC 2018

Sorry for the delay getting back to this.

> Hello!
>
> Every language which implements the multiline strings has problems
> with indentation.

Indeed.  The fundamental problem here is that the indentation of 
embedded snippets is serving two masters; the nesting of the surrounding 
code, and the snippet itself.  Sometimes the user cares about one; 
sometimes the other, and there's no one-size-fits-all set of rules that 
any language has come up with that doesn't make both camps happy.

Sometimes it doesn't really matter; a few extra spaces in an HTML 
document or SQL query is often an acceptable price to pay for 
clean-looking code.  But sometimes it does matter.  Which raises two 
questions:
  - What should programmers do?
  - What should the language help them do?

> E.g. consider something like this:

So, in light of the above questions, let's ask: is this the right way to 
generate a HTML document?  It not only has "holes" to be filled in, but 
it has entire sections whose presence or absence depends on state.  I 
think the mess of this example goes far deeper than indentation.  (But 
yes, people will write code like this, with whatever tools we give 
them.)  To the second question, what should the language do to help this 
code?  Some would say "of course, the problem is you don't support 
interpolation."  But as this example shows, interpolation only helps 
with the trivial bits; it doesn't help with the conditional inclusion, 
so it only gets you a small part of the way to this example.  For that, 
you either need something with more structure, or a templating engine, 
or a builder, or one of the zillion other tools we've invented for this 
sort of thing.

So, without ignoring your fundamental question about indentation, I'll 
just point out that this example is about way more than indentation, and 
move on ...

> Now we have broken formatting in the generated HTML, which ruins the
> idea of multiline strings

I think "multiline strings" (or even "raw strings") are a bit of a 
misleading name.  What we're going for here is the ability to embed an 
arbitrary snippet of a "program" (shell script, SQL query, JSON doc) in 
a Java program, without having to mangle the embedded snippet.  This 
enhances readability (not mucked up with escapes and extra quotes) and 
reduces errors (because you can just cut and paste that snippet of 
script from the editor in which you've probably already written it, 
without risking breaking it via syntactic mangling.)  But, as you say, 
there are issues with indentation, when it matters.  (Surely it matters 
for snippets of python.)

Secondarily, the design center for this feature is: _short_ snippets -- 
those for which putting them in a separate document would be 
obfuscatory.  To see this, we have to approach it from both sides. On 
the short side, imagine Java didn't have string literals at all. Having 
to read "yes" and "no" out of a file would be ridiculously obfuscatory; 
eliminating this indirection makes code easier to read and less 
error-prone.  But on the long side, using raw strings to embed a 
million-line snippet in a Java program is also ridiculous; it would be 
far easier for maintainers of both the Java part and the embedded part 
to have their own uniform artifacts to maintain.  So the sweet spot for 
this feature is somewhere in the middle -- snippets that are short 
enough that indirecting to a file impairs readability, but not so long 
that there's any question where the embedded snippet ends and the Java 
code resumes.  (Subjectively, I'd say that this sweet spot is in the 
5-10 line range.)

> (why bother to generate \n in output HTML if
> it looks like a mess anyways?) Moreover, the structure of Java program
> now affects the output. E.g. if you add several more nested "if" or
> "switch" statement, you will need to indent <p> even more.

My answer to those people is: then don't do that ;)  They're already 
well outside the design center (as outlined above).  They should be 
using a templating mechanism, a builder, or something else to decouple 
the static content from the dynamic content.  Of course, they will, but 
I'm not sure bending over backwards to accomodate them is the winning move.

> Many languages provide library methods to handle this.

Good, now we're back to indentation.  All things being equal, it is 
better to do things in libraries than in the language; it is cheaper, 
more flexible, faster to market, less risky, and can support a broader 
range of preferences (you can have different libraries for different 
preferences.)  So I like this direction.

> E.g.
> trimIndent() could be provided to remove leading spaces of every line,
> but this would kill the HTML indents at all. Another possibility is to
> provide a method like trimMargin() on Kotlin [1] which trims all
> spaces before a special character (pipe by default) including a
> special character itself.

Now that we're in library world, we can have _all_ of these.  We can 
trim indents to the first indent, or trim a specified number of spaces 
off, or trim to a user-selected marker.  And if the users don't like the 
ones we include, they can write their own.

> This is almost nice. Even without syntax highlighting you can easily
> distinguish between Java code and injected HTML code, you can indent
> Java and HTML independently and HTML code does not clash with Java
> code structure.

Pushing this to a library gives users the option, but not the obligation 
to do this.  That's good.

> The only problem is the necesity to call the
> trimMargin() method.

For some meaning of "only" :)   Like most syntactic conventions, some 
users will say "this is great" and others will say "yuck".  I prefer the 
semantic transparency of calling a method that has a clear specification 
-- especially when there are multiple possible options.

Remember that we're already in a corner case with respect to indentation 
-- in many cases, the users don't care at all about the extra spaces, 
they're just building up a SQL query that is going to be sent to a 
database, and the database doesn't care either.

> This means that original line is preserved in the
> bytecode and during runtime and the trimming is processed every time
> the method is called causing performance and memory handicap. This
> problem could be minimized making trimMargin() a javac intrinsic.

There are multiple layers at which this can be optimized (the JIT may be 
able to observe that this a pure function applied to a constant), but 
indeed, this is a great candidate for compile-time constant folding.  
(You can even see experiments related to compile-time constant folding 
going on in the condy-folding branch of the amber repo.)  Note too that 
we're now in corner-case-of-corner-case territory -- those who care 
about the indentation and the cost of runtime string processing.

> Hoever even in this case it would be hard to enforce usage of this
> method and I expect that tons of hard-to-read Java code will appear in
> the wild, despite I believe that Java is about readability.

Developers ability to combine simple features to produce unreadable code 
far outstrips the ability of language designers to do anything about it ...

> So I propose to enforce such (or similar) format on language level
> instead of adding a library method like "trimMargin()".

I think this would be a language design mistake.  This is taking one 
arbitrary convention and burning it into the language.  That convention 
might be fine for some situations, but terrible for others; not only 
might it not be the most readable choice in all cases, but it could be 
an actual conflict -- what if the | character is meaningful in the 
embedded language, such as Markdown tables? Now we're back to escaping 
-- which we were trying to avoid.

The language shouldn't pick favorites here; it should provide a simple, 
clear mechanism, which can be usefully composed with other mechanisms to 
get the job done.  Polluting the language to avoid the method call is a 
bad trade.

> I see some advantages with such syntax:
> 1. You can comment (or comment out!) a part of multiline string
> without terminating it

Rather than framing this as a property of a proposed solution, let's 
frame it as a question.  What should be the interaction with comments in 
a raw string?  Should you be able to embed comments? Should you be able 
to comment lines out?  (Note that many languages support comments, so it 
may be possible to do this by embedding a comment, rather than using the 
Java-level commenting.)  While I can surely see the utility of 
interaction with commenting, I also think that these "requirements" are 
only in play when the string in question is too long in the first place.

> 2. Looking into code fragment out of context (e.g. diff log) you
> understand that you are inside a multiline literal.
> reviewing a diff like
>
>              | x++;
> +           | if (x == 10) break;
>              | foo(x);
>
> Without pipes you could think that it's Java code without any further
> consideration.

This is true, but this is also true of large block comments; you can't 
tell whether the added line is part of a commented out block or of 
executable code.

Again, with raw strings, this is more of a problem when used with 
too-long blocks.

So, there are two things I don't like about this proposal: it's too 
"opinionated", and at the same time, it loses the fundamental goal we 
were trying to get to -- not having to muck up an embedded block with 
escaping.  (Sure, IDEs could (and should) help on pasting here, but that 
only helps writing, not reading.)

> The only disadvantage I see in forcing a pipe prefix is inability to
> just paste a big snippet from somewhere to the middle of Java program
> in a plain text editor.

As mentioned, we think this is most of the point, so this is a pretty 
big disadvantage indeed.