Raw String Literal Library Support
John Rose
john.r.rose at oracle.com
Tue Mar 13 22:49:16 UTC 2018
On Mar 13, 2018, at 6:47 AM, Jim Laskey <james.laskey at oracle.com> wrote:
>
> …
> A. Line support.
>
> public Stream<String> lines()
>
Suggest factoring this as:
public Stream<String> splits(String regex) { }
public Stream<String> lines() { return splits(`\n|\r\n?`); }
The reason is that "splits" is useful with several other patterns.
For raw strings, splits(`\n`) is a more efficient way to get the same
result (because they normalize CR NL? to NL). There's also a
nifty unicode-oriented pattern splits(`\R`) which matches a larger
set of line terminations. And of course splits(":") or splits(`\s`) will
be old friends. A new friend might be paragraph splitting splits(`\n\n`).
Splitting is old, as Remi points out, but new thing is supplying the
stream-style fluent notation starting from a (potentially) large string
constant.
> B. Additions to basic trim methods. In addition to margin methods trimIndent and trimMarkers described below in Section C, it would be worth introducing trimLeft and trimRight to augment the longstanding trim method. A key question is how trimLeft and trimRight should detect whitespace, because different definitions of whitespace exist in the library.
> ...
> That sets up several kinds of whitespace; trim's whitespace (TWS), Character whitespace (CWS) and the union of the two (UWS). TWS is a fast test. CWS is a slow test. UWS is fast for Latin1 and slow-ish for UTF-16.
For the record, even though we are not talking performance much,
CWS is not significantly slower than UWS. You can use a 64-bit int
constant for a bitmask and check for an arbitrary subset of the first
64 ASCII code points in one or two machine instructions.
> We are recommending that trimLeft and trimRight use UWS, leave trim alone to avoid breaking the world and then possibly introduce trimWhitespace that uses UWS.
Putting aside the performance question, I have to ask if compatibility
with TWS is at all important. (Don't know the answer, suspect not.)
> …
> C. Margin management. With introduction of multi-line Raw String Literals, developers will have to deal with the extraneous spacing introduced by indenting and formatting string bodies.
>
> Note that for all the methods in this group, if the first line is empty then it is removed and if the last is empty then it is removed. This removal provides a means for developers that use delimiters on separate lines to bracket string bodies. Also note, that all line separators are replaced with \n.
(As a bonus, margin management gives a story for escaping leading and trailing
backticks. If your string is a single line, surround it with pipe characters `|asdf|`.
If your string is multiple lines, surround it with blank lines easy to do. Either
pipes or newlines will protect backticks from merging into quotes.)
There's a sort of beauty contest going on here between indents and
markers. I often prefer markers, but I see how indents will often win
the contest. I'll pre-emptively disagree with anyone who observes
that we only need one of the two.
> public String trimMarkers(String leftMarker, String rightMarker)
I like this function and anticipate using it. (I use similar things in
shell script here-files.) Thanks for including end-of-line markers
in the mix. This allows lines with significant *trailing* whitespace
to protect that whitespace as well as *leading* whitespace.
Suggestion: Give users a gentle nudge toward the pipe character by
making it a default argument so trimMarkers() => trimMarkers("|","|").
Suggestion: Allow the markers to be regular expressions.
(So `\|` would be the default.)
>
> D. Escape management. Since Raw String Literals do not interpret Unicode escapes (\unnnn) or escape sequences (\n, \b, etc), we need to provide a scheme for developers who just want multi-line strings but still have escape sequences interpreted.
This all looks good.
Thanks,
— John
More information about the core-libs-dev
mailing list