RFR: JDK-8224225 - Tokenizer improvements

Fri Aug 14 21:20:05 UTC 2020

On 14/08/2020 19:09, Jim Laskey wrote:
> Thank you.
>
> The commentary for SubChar.java reads.
>
> "Note: this source file has been crafted very carefully to end with the
> unicode escape sequence for the control-Z character without a
> following newline.  The scanner is specified to allow control-Z there.
> If you edit this source file, please make sure that your editor does
> not insert a newline after that trailing line."
>
> The person that checked in the test did exactly that - added a newline 
> after that trailing line (this was done when moving to mercurial in 2007.)
:-D
>
> This patch is to remove that newline. Note that jcheck doesn't check 
> tests for missing last newline.

Very interesting - thanks for the explanation!

Maurizio

>
> Cheers,
>
> -- Jim
>
>
>> On Aug 14, 2020, at 2:42 PM, Maurizio Cimadamore 
>> <maurizio.cimadamore at oracle.com 
>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>
>> Hi Jim,
>> this is a very good cleanup. I like how the new code makes the 
>> tokenizers a lot less verbose than before, and I like how you have 
>> refactored the various UnicodeReader vs. DocReader (now 
>> PositionTrackingReader), as the status quo was messy, and we had a 
>> lot of flexibility on paper that wasn't really used in practice and 
>> ended up making the code more complex than it needed to be.
>>
>> Big thumbs up from me.
>>
>> Minor comment: what's up with SubChar.java? Webrev is empty, but 
>> patch reports following diff:
>>
>> iff --git a/test/langtools/tools/javac/unicode/SubChar.java b/test/langtools/tools/javac/unicode/SubChar.java
>> --- a/test/langtools/tools/javac/unicode/SubChar.java
>> +++ b/test/langtools/tools/javac/unicode/SubChar.java
>> @@ -45,4 +45,4 @@
>>           return;
>>       }
>>   }
>> -/* \u001A */
>> +/* \u001A */
>> \ No newline at end of file
>>
>>
>> Is that deliberate?
>>
>>
>> Maurizio
>>
>>
>> On 13/08/2020 18:32, Jim Laskey wrote:
>>> webrev: http://cr.openjdk.java.net/~jlaskey/8224225/webrev-04
>>> jbs: https://bugs.openjdk.java.net/browse/JDK-8224225
>>>
>>> I recommend looking at the "new" versions of 1. UnicodeReader, then 
>>> 2. JavaTokenizer and then 3. JavadocTokenizer before venturing into 
>>> the diffs.
>>>
>>>
>>> Rationale, under the heading of technical debt: There is a lot 
>>> "going on" in the JavaTokenizer/JavadocTokenizer that needed to be 
>>> cleaned up.
>>>
>>> - The UnicodeReader shouldn't really be accumulating characters for 
>>> literals.
>>> - A tokenizer shouldn't need to be aware of the unicode translations.
>>> - There is no need for peek ahead.
>>> - There were a lot of repetitive tasks that should be done in 
>>> methods instead of complex expressions.
>>> - Names of existing methods were often confusing.
>>>
>>> To avoid disruption, I avoided changing logical, except in the 
>>> UnicodeReader. There are some relics in the 
>>> JavaTokenizer/JavadocTokenizer that could be cleaned up but require 
>>> deeper analysis.
>>>
>>> Some details;
>>> - UnicodeReader was reworked to provide tokenizers a running stream 
>>> of unicode characters/codepoints. Steps:
>>> - characters are read from the buffer.
>>> - if the character is a '\' then check to see if the character is 
>>> the beginning of an unicode escape sequence, If so, then translate.
>>> - if the character is a high surrogate then check to see if next 
>>> character is the low surrogate. If so then combine.
>>> - A tokenizer can test a codepoint with the isSurrogate predicate 
>>> (when/if needed.)
>>>   The result of putting this logic on UnicodeReader's shoulders 
>>> means that a tokenizer does not need have any unicode "logical."
>>>
>>> - The old UnicodeReader modified the source buffer to insert an EOI 
>>> character at the end to mark the last character.
>>> - This meant the buffer had to be large enough (grown) to accommodate.
>>> - There really was no need since we can simply return an EOI when 
>>> trying to read past the end of buffer.
>>>
>>> - The only buffer mutability left behind is when reading digits.
>>> - Unicode digits are still replaced with ASCII digits.
>>> - This seems unnecessary, but I didn't want to risk messing around 
>>> with the existing logic. Maybe someone can enlighten me.
>>>
>>> - The sequence '\\' is special cased in the UnicodeReader so that 
>>> the sequence "\\uXXXX" is handled properly.
>>> - Thus, tokenizers don't have to special case '\\' (happened 
>>> frequently in the JavadocTokenizer.)
>>>
>>> - JavaTokenizer was modified to accumulate scanned literals in a 
>>> StringBuilder.
>>> - This simplified/clarified the code significantly.
>>>
>>> - Since a lot of the functionality needed by the JavaTokenizer comes 
>>> directly from a UnicodeReader, I made JavaTokenizer a subclass 
>>> of UnicodeReader.
>>> - Otherwise, I would have had to reference "reader." everywhere or 
>>> would have to create JavaTokenizer methods to repeat the same logic. 
>>> This was simpler and cleaner.
>>>
>>> - Since the pattern "if (ch == 'X') bpos++" occurred a lot, I 
>>> switched to using "if (accept('X')) " patterns.
>>> - Actually, I tightened up a lot of these patterns, as you will see 
>>> in the code.
>>>
>>> - There are a lot of great mysteries in JavadocTokenizer, but I 
>>> think I cracked most of them. The code is simpler and more modular.
>>>
>>> - The new scanner is slower to warm up due to new layers of method 
>>> calls (ex. HelloWorld is 5% slower). However, once warmed up, this 
>>> new scanner is faster than the existing code. The JDK java code 
>>> compiles 5-10% faster.
>>>
>>> All comments, suggestions and contributions welcome.
>>>
>>> Cheers,
>>>
>>> --- Jim
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/compiler-dev/attachments/20200814/a6111264/attachment-0001.htm>