RFR: JDK-8224225 - Tokenizer improvements

Fri Aug 14 17:42:09 UTC 2020

Hi Jim,
this is a very good cleanup. I like how the new code makes the 
tokenizers a lot less verbose than before, and I like how you have 
refactored the various UnicodeReader vs. DocReader (now 
PositionTrackingReader), as the status quo was messy, and we had a lot 
of flexibility on paper that wasn't really used in practice and ended up 
making the code more complex than it needed to be.

Big thumbs up from me.

Minor comment: what's up with SubChar.java? Webrev is empty, but patch 
reports following diff:

iff --git a/test/langtools/tools/javac/unicode/SubChar.java b/test/langtools/tools/javac/unicode/SubChar.java

--- a/test/langtools/tools/javac/unicode/SubChar.java
+++ b/test/langtools/tools/javac/unicode/SubChar.java
@@ -45,4 +45,4 @@
          return;
      }
  }
-/* \u001A */
+/* \u001A */
\ No newline at end of file


Is that deliberate?


Maurizio


On 13/08/2020 18:32, Jim Laskey wrote:
> webrev: http://cr.openjdk.java.net/~jlaskey/8224225/webrev-04
> jbs: https://bugs.openjdk.java.net/browse/JDK-8224225
>
> I recommend looking at the "new" versions of 1. UnicodeReader, then 2. 
> JavaTokenizer and then 3. JavadocTokenizer before venturing into the 
> diffs.
>
>
> Rationale, under the heading of technical debt: There is a lot "going 
> on" in the JavaTokenizer/JavadocTokenizer that needed to be cleaned up.
>
> - The UnicodeReader shouldn't really be accumulating characters for 
> literals.
> - A tokenizer shouldn't need to be aware of the unicode translations.
> - There is no need for peek ahead.
> - There were a lot of repetitive tasks that should be done in methods 
> instead of complex expressions.
> - Names of existing methods were often confusing.
>
> To avoid disruption, I avoided changing logical, except in the 
> UnicodeReader. There are some relics in the 
> JavaTokenizer/JavadocTokenizer that could be cleaned up but require 
> deeper analysis.
>
> Some details;
> - UnicodeReader was reworked to provide tokenizers a running stream of 
> unicode characters/codepoints. Steps:
> - characters are read from the buffer.
> - if the character is a '\' then check to see if the character is the 
> beginning of an unicode escape sequence, If so, then translate.
> - if the character is a high surrogate then check to see if next 
> character is the low surrogate. If so then combine.
> - A tokenizer can test a codepoint with the isSurrogate predicate 
> (when/if needed.)
>   The result of putting this logic on UnicodeReader's shoulders means 
> that a tokenizer does not need have any unicode "logical."
>
> - The old UnicodeReader modified the source buffer to insert an EOI 
> character at the end to mark the last character.
> - This meant the buffer had to be large enough (grown) to accommodate.
> - There really was no need since we can simply return an EOI when 
> trying to read past the end of buffer.
>
> - The only buffer mutability left behind is when reading digits.
> - Unicode digits are still replaced with ASCII digits.
> - This seems unnecessary, but I didn't want to risk messing around 
> with the existing logic. Maybe someone can enlighten me.
>
> - The sequence '\\' is special cased in the UnicodeReader so that the 
> sequence "\\uXXXX" is handled properly.
> - Thus, tokenizers don't have to special case '\\' (happened 
> frequently in the JavadocTokenizer.)
>
> - JavaTokenizer was modified to accumulate scanned literals in a 
> StringBuilder.
> - This simplified/clarified the code significantly.
>
> - Since a lot of the functionality needed by the JavaTokenizer comes 
> directly from a UnicodeReader, I made JavaTokenizer a subclass 
> of UnicodeReader.
> - Otherwise, I would have had to reference "reader." everywhere or 
> would have to create JavaTokenizer methods to repeat the same logic. 
> This was simpler and cleaner.
>
> - Since the pattern "if (ch == 'X') bpos++" occurred a lot, I switched 
> to using "if (accept('X')) " patterns.
> - Actually, I tightened up a lot of these patterns, as you will see in 
> the code.
>
> - There are a lot of great mysteries in JavadocTokenizer, but I think 
> I cracked most of them. The code is simpler and more modular.
>
> - The new scanner is slower to warm up due to new layers of method 
> calls (ex. HelloWorld is 5% slower). However, once warmed up, this new 
> scanner is faster than the existing code. The JDK java code compiles 
> 5-10% faster.
>
> All comments, suggestions and contributions welcome.
>
> Cheers,
>
> --- Jim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/compiler-dev/attachments/20200814/0f0bd6a1/attachment-0001.htm>