Integrated: 8224225: Tokenizer improvements

Jim Laskey jlaskey at
Thu Oct 1 15:43:09 UTC 2020

On Wed, 30 Sep 2020 14:19:21 GMT, Jim Laskey <jlaskey at> wrote:

> Please review these changes to the javac scanner.
> I recommend looking at the "new" versions of 1. UnicodeReader, then 2. JavaTokenizer and then 3. JavadocTokenizer
> before venturing into the diffs.
> Rationale, under the heading of technical debt and separation of concerns: There is a lot "going on" in the
> JavaTokenizer/JavadocTokenizer that needed to be cleaned up.
> - The UnicodeReader shouldn't really be accumulating characters for literals.
> - A tokenizer shouldn't need to be aware of the unicode translations.
> - There is no need for peek ahead.
> - There were a lot of repetitive tasks that should be done in methods instead of complex expressions.
> - Names of existing methods were often confusing.
> To avoid disruption, I avoided changing logical, except in the UnicodeReader. There are some relics in the
> JavaTokenizer/JavadocTokenizer that could be cleaned up but require deeper analysis.
> Some details;
> - UnicodeReader was reworked to provide tokenizers a running stream of unicode characters/codepoints. Steps:
> 	- characters are read from the buffer.
> 	- if the character is a '' then check to see if the character is the beginning of an unicode escape sequence, If so,
> then translate. 	- if the character is a high surrogate then check to see if next character is the low surrogate. If so
> then combine. 		- A tokenizer can test a codepoint with the isSurrogate predicate (when/if needed.)
>   The result of putting this logic on UnicodeReader's shoulders means that a tokenizer does not need have any unicode
>   "logical."
> - The old UnicodeReader modified the source buffer to insert an EOI character at the end to mark the last character.
> 	- This meant the buffer had to be large enough (grown) to accommodate.
> 	- There really was no need since we can simply return an EOI when trying to read past the end of buffer.
> - The only buffer mutability left behind is when reading digits.
> 	- Unicode digits are still replaced with ASCII digits.
> 		- This seems unnecessary, but I didn't want to risk messing around with the existing logic. Maybe someone can
> enlighten me.
> - The sequence '\' is special cased in the UnicodeReader so that the sequence "\\uXXXX" is handled properly.
> 	- Thus, tokenizers don't have to special case '\' (happened frequently in the JavadocTokenizer.)
> - JavaTokenizer was modified to accumulate scanned literals in a StringBuilder.
> 	- This simplified/clarified the code significantly.
> - Since a lot of the functionality needed by the JavaTokenizer comes directly from a UnicodeReader, I made JavaTokenizer
>   a subclass of UnicodeReader.
> 	- Otherwise, I would have had to reference "reader." everywhere or would have to create JavaTokenizer methods to
> repeat the same logic. This was simpler and cleaner.
> - Since the pattern "if (ch == 'X') bpos++" occurred a lot, I switched to using "if (accept('X')) " patterns.
> 	- Actually, I tightened up a lot of these patterns, as you will see in the code.
> - There are a lot of great mysteries in JavadocTokenizer, but I think I cracked most of them. The code is simpler and
>   more modular.
> - The new scanner is slower to warm up due to new layers of method calls (ex. HelloWorld is 5% slower). However, once
>   warmed up, this new scanner is faster than the existing code. The JDK java code compiles 5-10% faster.
> Previous review:

This pull request has now been integrated.

Changeset: 90c131f2
Author:    Jim Laskey <jlaskey at>
Stats:     2353 lines in 18 files changed: 1119 ins; 603 del; 631 mod

8224225: Tokenizer improvements

Reviewed-by: mcimadamore



More information about the compiler-dev mailing list