RFR: JDK-8224225 - Tokenizer improvements

Jim Laskey james.laskey at oracle.com
Fri Aug 14 18:09:30 UTC 2020

Thank you.

The commentary for SubChar.java reads.

"Note: this source file has been crafted very carefully to end with the
unicode escape sequence for the control-Z character without a
following newline.  The scanner is specified to allow control-Z there.
If you edit this source file, please make sure that your editor does
not insert a newline after that trailing line."

The person that checked in the test did exactly that - added a newline after that trailing line (this was done when moving to mercurial in 2007.)

This patch is to remove that newline. Note that jcheck doesn't check tests for missing last newline.


-- Jim

> On Aug 14, 2020, at 2:42 PM, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
> Hi Jim,
> this is a very good cleanup. I like how the new code makes the tokenizers a lot less verbose than before, and I like how you have refactored the various UnicodeReader vs. DocReader (now PositionTrackingReader), as the status quo was messy, and we had a lot of flexibility on paper that wasn't really used in practice and ended up making the code more complex than it needed to be.
> Big thumbs up from me.
> Minor comment: what's up with SubChar.java? Webrev is empty, but patch reports following diff:
> iff --git a/test/langtools/tools/javac/unicode/SubChar.java b/test/langtools/tools/javac/unicode/SubChar.java
> --- a/test/langtools/tools/javac/unicode/SubChar.java
> +++ b/test/langtools/tools/javac/unicode/SubChar.java
> @@ -45,4 +45,4 @@
>          return;
>      }
>  }
> -/* \u001A */
> +/* \u001A */
> \ No newline at end of file
> Is that deliberate?
> Maurizio
> On 13/08/2020 18:32, Jim Laskey wrote:
>> webrev: http://cr.openjdk.java.net/~jlaskey/8224225/webrev-04 <http://cr.openjdk.java.net/~jlaskey/8224225/webrev-04>
>> jbs: https://bugs.openjdk.java.net/browse/JDK-8224225 <https://bugs.openjdk.java.net/browse/JDK-8224225>
>> I recommend looking at the "new" versions of 1. UnicodeReader, then 2. JavaTokenizer and then 3. JavadocTokenizer before venturing into the diffs.
>> Rationale, under the heading of technical debt: There is a lot "going on" in the JavaTokenizer/JavadocTokenizer that needed to be cleaned up.
>> - The UnicodeReader shouldn't really be accumulating characters for literals.
>> - A tokenizer shouldn't need to be aware of the unicode translations.
>> - There is no need for peek ahead.
>> - There were a lot of repetitive tasks that should be done in methods instead of complex expressions.
>> - Names of existing methods were often confusing.
>> To avoid disruption, I avoided changing logical, except in the UnicodeReader. There are some relics in the JavaTokenizer/JavadocTokenizer that could be cleaned up but require deeper analysis.
>> Some details;
>> - UnicodeReader was reworked to provide tokenizers a running stream of unicode characters/codepoints. Steps:
>> 	- characters are read from the buffer.
>> 	- if the character is a '\' then check to see if the character is the beginning of an unicode escape sequence, If so, then translate.
>> 	- if the character is a high surrogate then check to see if next character is the low surrogate. If so then combine.
>> 		- A tokenizer can test a codepoint with the isSurrogate predicate (when/if needed.)
>>   The result of putting this logic on UnicodeReader's shoulders means that a tokenizer does not need have any unicode "logical."
>> - The old UnicodeReader modified the source buffer to insert an EOI character at the end to mark the last character. 
>> 	- This meant the buffer had to be large enough (grown) to accommodate.
>> 	- There really was no need since we can simply return an EOI when trying to read past the end of buffer.
>> - The only buffer mutability left behind is when reading digits.
>> 	- Unicode digits are still replaced with ASCII digits.
>> 		- This seems unnecessary, but I didn't want to risk messing around with the existing logic. Maybe someone can enlighten me.
>> - The sequence '\\' is special cased in the UnicodeReader so that the sequence "\\uXXXX" is handled properly.
>> 	- Thus, tokenizers don't have to special case '\\' (happened frequently in the JavadocTokenizer.)
>> - JavaTokenizer was modified to accumulate scanned literals in a StringBuilder.
>> 	- This simplified/clarified the code significantly.
>> - Since a lot of the functionality needed by the JavaTokenizer comes directly from a UnicodeReader, I made JavaTokenizer a subclass of UnicodeReader.
>> 	- Otherwise, I would have had to reference "reader." everywhere or would have to create JavaTokenizer methods to repeat the same logic. This was simpler and cleaner.
>> - Since the pattern "if (ch == 'X') bpos++" occurred a lot, I switched to using "if (accept('X')) " patterns.
>> 	- Actually, I tightened up a lot of these patterns, as you will see in the code.
>> - There are a lot of great mysteries in JavadocTokenizer, but I think I cracked most of them. The code is simpler and more modular.
>> - The new scanner is slower to warm up due to new layers of method calls (ex. HelloWorld is 5% slower). However, once warmed up, this new scanner is faster than the existing code. The JDK java code compiles 5-10% faster.
>> All comments, suggestions and contributions welcome.
>> Cheers,
>> --- Jim

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/compiler-dev/attachments/20200814/2923a357/attachment.htm>

More information about the compiler-dev mailing list