RFR: JDK-8224225 - Tokenizer improvements
James Laskey
james.laskey at oracle.com
Fri Aug 14 22:05:19 UTC 2020
Agree. Will work it up.
> On Aug 14, 2020, at 6:30 PM, Jonathan Gibbons <Jonathan.Gibbons at oracle.com> wrote:
>
>
> Re SubChar.java
>
> The correct solution is to dynamically generate any files with non-standard whitespace, even if it is simple as
> Files.writeString(tmpPath, Fies.readString(srcPath).trim())
>
> -- Jon
>
>> On 8/14/20 11:09 AM, Jim Laskey wrote:
>> Thank you.
>>
>> The commentary for SubChar.java reads.
>>
>> "Note: this source file has been crafted very carefully to end with the
>> unicode escape sequence for the control-Z character without a
>> following newline. The scanner is specified to allow control-Z there.
>> If you edit this source file, please make sure that your editor does
>> not insert a newline after that trailing line."
>>
>> The person that checked in the test did exactly that - added a newline after that trailing line (this was done when moving to mercurial in 2007.)
>>
>> This patch is to remove that newline. Note that jcheck doesn't check tests for missing last newline.
>>
>> Cheers,
>>
>> -- Jim
>>
>>
>>> On Aug 14, 2020, at 2:42 PM, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
>>>
>>> Hi Jim,
>>> this is a very good cleanup. I like how the new code makes the tokenizers a lot less verbose than before, and I like how you have refactored the various UnicodeReader vs. DocReader (now PositionTrackingReader), as the status quo was messy, and we had a lot of flexibility on paper that wasn't really used in practice and ended up making the code more complex than it needed to be.
>>>
>>> Big thumbs up from me.
>>>
>>> Minor comment: what's up with SubChar.java? Webrev is empty, but patch reports following diff:
>>>
>>> iff --git a/test/langtools/tools/javac/unicode/SubChar.java b/test/langtools/tools/javac/unicode/SubChar.java
>>> --- a/test/langtools/tools/javac/unicode/SubChar.java
>>> +++ b/test/langtools/tools/javac/unicode/SubChar.java
>>> @@ -45,4 +45,4 @@
>>> return;
>>> }
>>> }
>>> -/* \u001A */
>>> +/* \u001A */
>>> \ No newline at end of file
>>>
>>>
>>> Is that deliberate?
>>>
>>>
>>>
>>> Maurizio
>>>
>>>
>>>
>>> On 13/08/2020 18:32, Jim Laskey wrote:
>>>> webrev: http://cr.openjdk.java.net/~jlaskey/8224225/webrev-04
>>>> jbs: https://bugs.openjdk.java.net/browse/JDK-8224225
>>>>
>>>> I recommend looking at the "new" versions of 1. UnicodeReader, then 2. JavaTokenizer and then 3. JavadocTokenizer before venturing into the diffs.
>>>>
>>>>
>>>> Rationale, under the heading of technical debt: There is a lot "going on" in the JavaTokenizer/JavadocTokenizer that needed to be cleaned up.
>>>>
>>>> - The UnicodeReader shouldn't really be accumulating characters for literals.
>>>> - A tokenizer shouldn't need to be aware of the unicode translations.
>>>> - There is no need for peek ahead.
>>>> - There were a lot of repetitive tasks that should be done in methods instead of complex expressions.
>>>> - Names of existing methods were often confusing.
>>>>
>>>> To avoid disruption, I avoided changing logical, except in the UnicodeReader. There are some relics in the JavaTokenizer/JavadocTokenizer that could be cleaned up but require deeper analysis.
>>>>
>>>> Some details;
>>>>
>>>> - UnicodeReader was reworked to provide tokenizers a running stream of unicode characters/codepoints. Steps:
>>>> - characters are read from the buffer.
>>>> - if the character is a '\' then check to see if the character is the beginning of an unicode escape sequence, If so, then translate.
>>>> - if the character is a high surrogate then check to see if next character is the low surrogate. If so then combine.
>>>> - A tokenizer can test a codepoint with the isSurrogate predicate (when/if needed.)
>>>> The result of putting this logic on UnicodeReader's shoulders means that a tokenizer does not need have any unicode "logical."
>>>>
>>>> - The old UnicodeReader modified the source buffer to insert an EOI character at the end to mark the last character.
>>>> - This meant the buffer had to be large enough (grown) to accommodate.
>>>> - There really was no need since we can simply return an EOI when trying to read past the end of buffer.
>>>>
>>>> - The only buffer mutability left behind is when reading digits.
>>>> - Unicode digits are still replaced with ASCII digits.
>>>> - This seems unnecessary, but I didn't want to risk messing around with the existing logic. Maybe someone can enlighten me.
>>>>
>>>> - The sequence '\\' is special cased in the UnicodeReader so that the sequence "\\uXXXX" is handled properly.
>>>> - Thus, tokenizers don't have to special case '\\' (happened frequently in the JavadocTokenizer.)
>>>>
>>>> - JavaTokenizer was modified to accumulate scanned literals in a StringBuilder.
>>>> - This simplified/clarified the code significantly.
>>>>
>>>> - Since a lot of the functionality needed by the JavaTokenizer comes directly from a UnicodeReader, I made JavaTokenizer a subclass of UnicodeReader.
>>>> - Otherwise, I would have had to reference "reader." everywhere or would have to create JavaTokenizer methods to repeat the same logic. This was simpler and cleaner.
>>>>
>>>> - Since the pattern "if (ch == 'X') bpos++" occurred a lot, I switched to using "if (accept('X')) " patterns.
>>>> - Actually, I tightened up a lot of these patterns, as you will see in the code.
>>>>
>>>> - There are a lot of great mysteries in JavadocTokenizer, but I think I cracked most of them. The code is simpler and more modular.
>>>>
>>>> - The new scanner is slower to warm up due to new layers of method calls (ex. HelloWorld is 5% slower). However, once warmed up, this new scanner is faster than the existing code. The JDK java code compiles 5-10% faster.
>>>>
>>>> All comments, suggestions and contributions welcome.
>>>>
>>>> Cheers,
>>>>
>>>> --- Jim
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/compiler-dev/attachments/20200814/92d7aef7/attachment-0001.htm>
More information about the compiler-dev
mailing list