RFR: [15] JDK-8236048: Cleanup use of Utils.normalizeNewlines
Jonathan Gibbons
jonathan.gibbons at oracle.com
Wed Dec 18 14:47:49 UTC 2019
OK, thanks; I'll revert back to using \R.
-- Jon
On 12/18/19 6:46 AM, Jim Laskey wrote:
> Possibly, since the match is likely more expensive than the replace,
> which is just appending to the sb in both cases.
>
>
>> On Dec 18, 2019, at 10:38 AM, Jonathan Gibbons
>> <jonathan.gibbons at oracle.com <mailto:jonathan.gibbons at oracle.com>> wrote:
>>
>> Hi Jim,
>>
>> I started off using \R, but that unnecessarily matches \n as well as
>> \r and \r\n. In this context, we're only looking to replace
>> newlines containing \r. Do you still think using \R would be better?
>>
>> -- Jon
>>
>> On 12/17/19 5:00 PM, James Laskey wrote:
>>> Note: you can use the regular expression \R to match line
>>> terminators. A bit faster than your pattern. I found this out when
>>> testing String::lines() (which was faster still.)
>>>
>>> Cheers,
>>>
>>> — Jim
>>>
>>> On the road.
>>>
>>>> On Dec 17, 2019, at 8:29 PM, Jonathan Gibbons
>>>> <Jonathan.Gibbons at oracle.com> wrote:
>>>>
>>>>
>>>>
>>>> Updated webrev. This version uses a regular expression and
>>>> .replaceAll to replace \r\n? with \n in RawHtml. No other changes.
>>>>
>>>> http://cr.openjdk.java.net/~jjg/8236048/webrev.01/
>>>>
>>>> -- Jon
>>>>
>>>>
>>>> On 12/17/2019 10:44 AM, Jonathan Gibbons wrote:
>>>>>
>>>>> Inline...
>>>>>
>>>>>
>>>>> On 12/17/2019 05:51 AM, Pavel Rappo wrote:
>>>>>> Jon,
>>>>>>
>>>>>> Could you please explain why "generated files should only contain \n"?
>>>>>
>>>>> I don't have a robust spec-based answer, so the answer is more
>>>>> along the lines of ...
>>>>>
>>>>> * because we've always done it this way ... related, back in the
>>>>> day, Apache was the
>>>>> gold-standard webserver, and was typically running on Linux
>>>>> systems, so favoring that
>>>>> config seemd a good idea
>>>>> * for consistency in the output between user-provided newlines and
>>>>> generated newlines
>>>>> ... I did find a spec reference that said to not be inconsistent
>>>>> * for self-defence in our self-tests, maybe predating back to
>>>>> before Mercurial and before we
>>>>> were so rigorous about enforcing Unix-newlines only in our
>>>>> source code
>>>>>
>>>>>
>>>>>> Is it possible to perform the `rawHtml.indexOf('\r') == -1` optimization check first thing in
>>>>>> normalizeNewlines?
>>>>>
>>>>> Are you suggesting it would be better to always call
>>>>> normalizeNewlines and so move the
>>>>> check into the method? I guess I was thinking it was worth
>>>>> checking to detect whether it was
>>>>> necessary to call the method in the first place.
>>>>>
>>>>>> New mechanics of string normalization is more readable, albeit is probably slower in a typical case.
>>>>>> Old mechanics used bulk append, whereas the new one goes char by char. Not sure if it is of any
>>>>>> practical significance though.
>>>>>
>>>>> I don't see why it will be slower in any case. In all previous
>>>>> cases, we were unconditionally
>>>>> scanning the string character-by-character in
>>>>> Utils.normalizeNewlines and building up a
>>>>> new StringBuilder, which was then converted to a String.
>>>>>
>>>>> Now, in StringContent, the scan is merged with the scan to escape
>>>>> the HTML characters.
>>>>> In RawHtml content, the scan/copy is only done if a \r is found.
>>>>> Yes, we have to do a read-scan
>>>>> but we only copy/update the string if needed, which is going to be
>>>>> faster than the always-copy
>>>>> that was done before.
>>>>>
>>>>> One possible suggestion/improvement, ... but I don't know the
>>>>> performance cost
>>>>>
>>>>> * remove RawHtml.normalizeNewlines completely
>>>>> * change the RawHtml constructor to a regex, as in
>>>>> 52 rawHtmlContent = rawHtml.indexOf('\r') == -1 ? rawHtml :
>>>>> rawHtml.replaceAll("\\R", "\n");
>>>>> I must admit, I sorta like this, even if it might be a bit slower,
>>>>> to use a regex.
>>>>>
>>>>> Note, as an aside, we use RawHtml way more than we should. The
>>>>> design intent was that it
>>>>> should be a Content-of-last-resort. I have ideas on how to
>>>>> significantly reduce the usage,
>>>>> which is the general direction and motivation for a lot of this
>>>>> round of cleanup work.
>>>>>
>>>>>> I wonder if normalization should be deferred until the latest possible moment. The reason is that
>>>>>> this operation doesn't seem to be additive, i.e.
>>>>>>
>>>>>> normalizeNewlines(A) + normalizeNewlines(B) != normalizeNewlines(A + B).
>>>>>
>>>>> My initial thought had been to do the normalization on the fly
>>>>> while writing out the characters,
>>>>> but that would mean writing the file character at a time, instead
>>>>> of bulk-write. It would also mean
>>>>> a bigger semantic change to leave un-normalized newlines in the
>>>>> data members ... note
>>>>> that there is code to check if items "end with newline" which is
>>>>> done by checking if the last
>>>>> character is DocletConstants.NL. It also seemed inconsistent to
>>>>> be scanning for HTML
>>>>> characters while leaving newlines alone. If we were to defer
>>>>> normalization until writing out Content
>>>>> objects, I would consider deferring the <>& translation as well.
>>>>> That might be a good idea but could
>>>>> be done separately as a later more-localized changeset.
>>>>>
>>>>> Note that there is a hidden assumption that newlines are
>>>>> completely represented in
>>>>> each call to a method or constructor for any subtype of Content.
>>>>> Even if we deferred
>>>>> normalization to the very last possible moment, (i.e. writing out)
>>>>> we would not be able
>>>>> to correctly handle a \r\n pair split across a pair of consecutive
>>>>> Content nodes. I don't
>>>>> think this is a case worth worrying about.
>>>>>
>>>>>
>>>>>
>>>>>> -Pavel
>>>>>>
>>>>>>> On 17 Dec 2019, at 02:30, Jonathan Gibbons<jonathan.gibbons at oracle.com> wrote:
>>>>>>>
>>>>>>> Please review a moderately simple change for javadoc, to improve the way that newlines in user input (doc comments and values for command-line options) are handled, reducing the amount of overall string copying. The underlying issue is that user newlines may be \n, \r or \r\n, but the generated files should only contain \n.
>>>>>>>
>>>>>>> Instead of using Utils.normalizeNewlines (which does an unconditional copy) at a number of sites, the normalization is now done in StringContent and RawHtml. In the case of StringContent, the normalization is done while checking and handling the escape sequences for < > &; in the case of RawHtml, it is done when constructing the object, but only if the argument contains \r.
>>>>>>>
>>>>>>> There are no good pre-existing tests for testing newline behavior, and so a new test is provided that runs javadoc on input containing the different kinds of newline. The test revealed a previously unknown bug, that non-standard newslines in command-line option values were not handled correctly.
>>>>>>>
>>>>>>> The output is also the same (before-and-after) for the generated JDK API docs.
>>>>>>>
>>>>>>> -- Jon
>>>>>>>
>>>>>>> JBS:https://bugs.openjdk.java.net/browse/JDK-8236048
>>>>>>> Webrev:http://cr.openjdk.java.net/~jjg/8236048/webrev.00/index.html
>>>>>>>
>>>>>
>>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/javadoc-dev/attachments/20191218/34c0ea7c/attachment-0001.htm>
More information about the javadoc-dev
mailing list