RFR: 8224225: Tokenizer improvements

Jim Laskey jlaskey at openjdk.java.net
Wed Sep 30 14:28:55 UTC 2020

Please review these changes to the javac scanner.

I recommend looking at the "new" versions of 1. UnicodeReader, then 2. JavaTokenizer and then 3. JavadocTokenizer
before venturing into the diffs.

Rationale, under the heading of technical debt and separation of concerns: There is a lot "going on" in the
JavaTokenizer/JavadocTokenizer that needed to be cleaned up.

- The UnicodeReader shouldn't really be accumulating characters for literals.
- A tokenizer shouldn't need to be aware of the unicode translations.
- There is no need for peek ahead.
- There were a lot of repetitive tasks that should be done in methods instead of complex expressions.
- Names of existing methods were often confusing.

To avoid disruption, I avoided changing logical, except in the UnicodeReader. There are some relics in the
JavaTokenizer/JavadocTokenizer that could be cleaned up but require deeper analysis.

Some details;
- UnicodeReader was reworked to provide tokenizers a running stream of unicode characters/codepoints. Steps:
	- characters are read from the buffer.
	- if the character is a '' then check to see if the character is the beginning of an unicode escape sequence, If so,
then translate. 	- if the character is a high surrogate then check to see if next character is the low surrogate. If so
then combine. 		- A tokenizer can test a codepoint with the isSurrogate predicate (when/if needed.)
  The result of putting this logic on UnicodeReader's shoulders means that a tokenizer does not need have any unicode

- The old UnicodeReader modified the source buffer to insert an EOI character at the end to mark the last character.
	- This meant the buffer had to be large enough (grown) to accommodate.
	- There really was no need since we can simply return an EOI when trying to read past the end of buffer.

- The only buffer mutability left behind is when reading digits.
	- Unicode digits are still replaced with ASCII digits.
		- This seems unnecessary, but I didn't want to risk messing around with the existing logic. Maybe someone can
enlighten me.

- The sequence '\' is special cased in the UnicodeReader so that the sequence "\\uXXXX" is handled properly.
	- Thus, tokenizers don't have to special case '\' (happened frequently in the JavadocTokenizer.)

- JavaTokenizer was modified to accumulate scanned literals in a StringBuilder.
	- This simplified/clarified the code significantly.

- Since a lot of the functionality needed by the JavaTokenizer comes directly from a UnicodeReader, I made JavaTokenizer
  a subclass of UnicodeReader.
	- Otherwise, I would have had to reference "reader." everywhere or would have to create JavaTokenizer methods to
repeat the same logic. This was simpler and cleaner.

- Since the pattern "if (ch == 'X') bpos++" occurred a lot, I switched to using "if (accept('X')) " patterns.
	- Actually, I tightened up a lot of these patterns, as you will see in the code.

- There are a lot of great mysteries in JavadocTokenizer, but I think I cracked most of them. The code is simpler and
  more modular.

- The new scanner is slower to warm up due to new layers of method calls (ex. HelloWorld is 5% slower). However, once
  warmed up, this new scanner is faster than the existing code. The JDK java code compiles 5-10% faster.

Previous review: https://mail.openjdk.java.net/pipermail/compiler-dev/2020-August/014806.html


Commit messages:
 - 8224225: Tokenizer improvements

Changes: https://git.openjdk.java.net/jdk/pull/435/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=435&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8224225
  Stats: 2353 lines in 18 files changed: 1119 ins; 603 del; 631 mod
  Patch: https://git.openjdk.java.net/jdk/pull/435.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/435/head:pull/435

PR: https://git.openjdk.java.net/jdk/pull/435

