RFR: JDK-8224225 - Tokenizer improvements

Thu Aug 13 17:32:54 UTC 2020

webrev: http://cr.openjdk.java.net/~jlaskey/8224225/webrev-04 <http://cr.openjdk.java.net/~jlaskey/8224225/webrev-04>
jbs: https://bugs.openjdk.java.net/browse/JDK-8224225 <https://bugs.openjdk.java.net/browse/JDK-8224225>

I recommend looking at the "new" versions of 1. UnicodeReader, then 2. JavaTokenizer and then 3. JavadocTokenizer before venturing into the diffs.

Rationale, under the heading of technical debt: There is a lot "going on" in the JavaTokenizer/JavadocTokenizer that needed to be cleaned up.

- The UnicodeReader shouldn't really be accumulating characters for literals.
- A tokenizer shouldn't need to be aware of the unicode translations.
- There is no need for peek ahead.
- There were a lot of repetitive tasks that should be done in methods instead of complex expressions.
- Names of existing methods were often confusing.

To avoid disruption, I avoided changing logical, except in the UnicodeReader. There are some relics in the JavaTokenizer/JavadocTokenizer that could be cleaned up but require deeper analysis.

Some details;

- UnicodeReader was reworked to provide tokenizers a running stream of unicode characters/codepoints. Steps:
	- characters are read from the buffer.
	- if the character is a '\' then check to see if the character is the beginning of an unicode escape sequence, If so, then translate.
	- if the character is a high surrogate then check to see if next character is the low surrogate. If so then combine.
		- A tokenizer can test a codepoint with the isSurrogate predicate (when/if needed.)
  The result of putting this logic on UnicodeReader's shoulders means that a tokenizer does not need have any unicode "logical."

- The old UnicodeReader modified the source buffer to insert an EOI character at the end to mark the last character. 
	- This meant the buffer had to be large enough (grown) to accommodate.
	- There really was no need since we can simply return an EOI when trying to read past the end of buffer.

- The only buffer mutability left behind is when reading digits.
	- Unicode digits are still replaced with ASCII digits.
		- This seems unnecessary, but I didn't want to risk messing around with the existing logic. Maybe someone can enlighten me.

- The sequence '\\' is special cased in the UnicodeReader so that the sequence "\\uXXXX" is handled properly.
	- Thus, tokenizers don't have to special case '\\' (happened frequently in the JavadocTokenizer.)

- JavaTokenizer was modified to accumulate scanned literals in a StringBuilder.
	- This simplified/clarified the code significantly.

- Since a lot of the functionality needed by the JavaTokenizer comes directly from a UnicodeReader, I made JavaTokenizer a subclass of UnicodeReader.
	- Otherwise, I would have had to reference "reader." everywhere or would have to create JavaTokenizer methods to repeat the same logic. This was simpler and cleaner.

- Since the pattern "if (ch == 'X') bpos++" occurred a lot, I switched to using "if (accept('X')) " patterns.
	- Actually, I tightened up a lot of these patterns, as you will see in the code.

- There are a lot of great mysteries in JavadocTokenizer, but I think I cracked most of them. The code is simpler and more modular.

- The new scanner is slower to warm up due to new layers of method calls (ex. HelloWorld is 5% slower). However, once warmed up, this new scanner is faster than the existing code. The JDK java code compiles 5-10% faster.

All comments, suggestions and contributions welcome.

Cheers,

--- Jim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/compiler-dev/attachments/20200813/b7a5cc8f/attachment.htm>