RFR: 8224225: Tokenizer improvements [v2]

Thu Oct 1 13:41:15 UTC 2020

On Thu, 1 Oct 2020 13:32:42 GMT, Jim Laskey <jlaskey at openjdk.org> wrote:

>> Please review these changes to the javac scanner.
>> 
>> I recommend looking at the "new" versions of 1. UnicodeReader, then 2. JavaTokenizer and then 3. JavadocTokenizer
>> before venturing into the diffs.
>> Rationale, under the heading of technical debt and separation of concerns: There is a lot "going on" in the
>> JavaTokenizer/JavadocTokenizer that needed to be cleaned up.
>> - The UnicodeReader shouldn't really be accumulating characters for literals.
>> - A tokenizer shouldn't need to be aware of the unicode translations.
>> - There is no need for peek ahead.
>> - There were a lot of repetitive tasks that should be done in methods instead of complex expressions.
>> - Names of existing methods were often confusing.
>> 
>> To avoid disruption, I avoided changing logical, except in the UnicodeReader. There are some relics in the
>> JavaTokenizer/JavadocTokenizer that could be cleaned up but require deeper analysis.
>> Some details;
>>   
>> - UnicodeReader was reworked to provide tokenizers a running stream of unicode characters/codepoints. Steps:
>> 	- characters are read from the buffer.
>> 	- if the character is a '' then check to see if the character is the beginning of an unicode escape sequence, If so,
>> then translate. 	- if the character is a high surrogate then check to see if next character is the low surrogate. If so
>> then combine. 		- A tokenizer can test a codepoint with the isSurrogate predicate (when/if needed.)
>>   The result of putting this logic on UnicodeReader's shoulders means that a tokenizer does not need have any unicode
>>   "logical."
>> 
>> - The old UnicodeReader modified the source buffer to insert an EOI character at the end to mark the last character.
>> 	- This meant the buffer had to be large enough (grown) to accommodate.
>> 	- There really was no need since we can simply return an EOI when trying to read past the end of buffer.
>> 
>> - The only buffer mutability left behind is when reading digits.
>> 	- Unicode digits are still replaced with ASCII digits.
>> 		- This seems unnecessary, but I didn't want to risk messing around with the existing logic. Maybe someone can
>> enlighten me.
>> - The sequence '\' is special cased in the UnicodeReader so that the sequence "\\uXXXX" is handled properly.
>> 	- Thus, tokenizers don't have to special case '\' (happened frequently in the JavadocTokenizer.)
>> 
>> - JavaTokenizer was modified to accumulate scanned literals in a StringBuilder.
>> 	- This simplified/clarified the code significantly.
>> 
>> - Since a lot of the functionality needed by the JavaTokenizer comes directly from a UnicodeReader, I made JavaTokenizer
>>   a subclass of UnicodeReader.
>> 	- Otherwise, I would have had to reference "reader." everywhere or would have to create JavaTokenizer methods to
>> repeat the same logic. This was simpler and cleaner.
>> - Since the pattern "if (ch == 'X') bpos++" occurred a lot, I switched to using "if (accept('X')) " patterns.
>> 	- Actually, I tightened up a lot of these patterns, as you will see in the code.
>> 
>> - There are a lot of great mysteries in JavadocTokenizer, but I think I cracked most of them. The code is simpler and
>>   more modular.
>> 
>> - The new scanner is slower to warm up due to new layers of method calls (ex. HelloWorld is 5% slower). However, once
>>   warmed up, this new scanner is faster than the existing code. The JDK java code compiles 5-10% faster.
>> 
>> Previous review: https://mail.openjdk.java.net/pipermail/compiler-dev/2020-August/014806.html
>
> Jim Laskey has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev
> excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since
> the last revision:
>  - Merge branch 'master' into 8224225
>  - 8224225: Tokenizer improvements

Approved as per:
https://mail.openjdk.java.net/pipermail/compiler-dev/2020-August/014810.html

-------------

Marked as reviewed by mcimadamore (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/435