RFR: [15] JDK-8236048: Cleanup use of Utils.normalizeNewlines

Tue Dec 17 18:44:33 UTC 2019

Inline...

On 12/17/2019 05:51 AM, Pavel Rappo wrote:
> Jon,
>
> Could you please explain why "generated files should only contain \n"?

I don't have a robust spec-based answer, so the answer is more along the 
lines of ...

* because we've always done it this way ... related, back in the day, 
Apache was the
    gold-standard webserver, and was typically running on Linux systems, 
so favoring that
    config seemd a good idea
* for consistency in the output between user-provided newlines and 
generated newlines
     ... I did find a spec reference that said to not be inconsistent
* for self-defence in our self-tests, maybe predating back to before 
Mercurial and before we
   were so rigorous about enforcing Unix-newlines only in our source code

>
> Is it possible to perform the `rawHtml.indexOf('\r') == -1` optimization check first thing in
> normalizeNewlines?

Are you suggesting it would be better to always call normalizeNewlines 
and so move the
check into the method?  I guess I was thinking it was worth checking to 
detect whether it was
necessary to call the method in the first place.

>
> New mechanics of string normalization is more readable, albeit is probably slower in a typical case.
> Old mechanics used bulk append, whereas the new one goes char by char. Not sure if it is of any
> practical significance though.

I don't see why it will be slower in any case.   In all previous cases, 
we were unconditionally
scanning the string character-by-character in Utils.normalizeNewlines 
and building up a
new StringBuilder, which was then converted to a String.

Now, in StringContent, the scan is merged with the scan to escape the 
HTML characters.
In RawHtml content, the scan/copy is only done if a \r is found. Yes, we 
have to do a read-scan
but we only copy/update the string if needed, which is going to be 
faster than the always-copy
that was done before.

One possible suggestion/improvement, ... but I don't know the 
performance cost

* remove RawHtml.normalizeNewlines completely
* change the RawHtml constructor to a regex, as in

52 rawHtmlContent = rawHtml.indexOf('\r') == -1 ? rawHtml : 
rawHtml.replaceAll("\\R", "\n");

I must admit, I sorta like this, even if it might be a bit slower, to 
use a regex.

Note, as an aside, we use RawHtml way more than we should. The design 
intent was that it
should be a Content-of-last-resort.  I have ideas on how to 
significantly reduce the usage,
which is the general direction and motivation for a lot of this round of 
cleanup work.

>
> I wonder if normalization should be deferred until the latest possible moment. The reason is that
> this operation doesn't seem to be additive, i.e.
>
>      normalizeNewlines(A) + normalizeNewlines(B) != normalizeNewlines(A + B).

My initial thought had been to do the normalization on the fly while 
writing out the characters,
but that would mean writing the file character at a time, instead of 
bulk-write.  It would also mean
a bigger semantic change to leave un-normalized newlines in the data 
members ... note
that there is code to check if items "end with newline" which is done by 
checking if the last
character is DocletConstants.NL.  It also seemed inconsistent to be 
scanning for HTML
characters while leaving newlines alone.   If we were to defer 
normalization until writing out Content
objects, I would consider deferring the <>& translation as well. That 
might be a good idea but could
be done separately as a later more-localized  changeset.

Note that there is a hidden assumption that newlines are completely 
represented in
each call to a method or constructor for any subtype of Content. Even if 
we deferred
normalization to the very last possible moment, (i.e. writing out) we 
would not be able
to correctly handle a \r\n pair split across a pair of consecutive 
Content nodes. I don't
think this is a case worth worrying about.

>
> -Pavel
>
>> On 17 Dec 2019, at 02:30, Jonathan Gibbons <jonathan.gibbons at oracle.com> wrote:
>>
>> Please review a moderately simple change for javadoc, to improve the way that newlines in user input (doc comments and values for command-line options) are handled, reducing the amount of overall string copying.  The underlying issue is that user newlines may be \n, \r or \r\n, but the generated files should only contain \n.
>>
>> Instead of using Utils.normalizeNewlines (which does an unconditional copy) at a number of sites, the normalization is now done in StringContent and RawHtml. In the case of StringContent, the normalization is done while checking and handling the escape sequences for < > &; in the case of RawHtml, it is done when constructing the object, but only if the argument contains \r.
>>
>> There are no good pre-existing tests for testing newline behavior, and so a new test is provided that runs javadoc on input containing the different kinds of newline.  The test revealed a previously unknown bug, that non-standard newslines in command-line option values were not handled correctly.
>>
>> The output is also the same (before-and-after) for the generated JDK API docs.
>>
>> -- Jon
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8236048
>> Webrev: http://cr.openjdk.java.net/~jjg/8236048/webrev.00/index.html
>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/javadoc-dev/attachments/20191217/8b78999a/attachment.htm>