RFR: JDK-8224225 - Tokenizer improvements

Fri Aug 14 22:05:19 UTC 2020

Agree. Will work it up. 

��

> On Aug 14, 2020, at 6:30 PM, Jonathan Gibbons <Jonathan.Gibbons at oracle.com> wrote:
> 
> 
> Re SubChar.java
> 
> The correct solution is to dynamically generate any files with non-standard whitespace, even if it is simple as 
> Files.writeString(tmpPath, Fies.readString(srcPath).trim())
> 
> -- Jon
> 
>> On 8/14/20 11:09 AM, Jim Laskey wrote:
>> Thank you.
>> 
>> The commentary for SubChar.java reads.
>> 
>> "Note: this source file has been crafted very carefully to end with the
>> unicode escape sequence for the control-Z character without a
>> following newline.  The scanner is specified to allow control-Z there.
>> If you edit this source file, please make sure that your editor does
>> not insert a newline after that trailing line."
>> 
>> The person that checked in the test did exactly that - added a newline after that trailing line (this was done when moving to mercurial in 2007.)
>> 
>> This patch is to remove that newline. Note that jcheck doesn't check tests for missing last newline.
>> 
>> Cheers,
>> 
>> -- Jim
>> 
>> 
>>> On Aug 14, 2020, at 2:42 PM, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
>>> 
>>> Hi Jim,
>>> this is a very good cleanup. I like how the new code makes the tokenizers a lot less verbose than before, and I like how you have refactored the various UnicodeReader vs. DocReader (now PositionTrackingReader), as the status quo was messy, and we had a lot of flexibility on paper that wasn't really used in practice and ended up making the code more complex than it needed to be.
>>> 
>>> Big thumbs up from me.
>>> 
>>> Minor comment: what's up with SubChar.java? Webrev is empty, but patch reports following diff:
>>> 
>>> iff --git a/test/langtools/tools/javac/unicode/SubChar.java b/test/langtools/tools/javac/unicode/SubChar.java
>>> --- a/test/langtools/tools/javac/unicode/SubChar.java
>>> +++ b/test/langtools/tools/javac/unicode/SubChar.java
>>> @@ -45,4 +45,4 @@
>>>          return;
>>>      }
>>>  }
>>> -/* \u001A */
>>> +/* \u001A */
>>> \ No newline at end of file
>>> 
>>> 
>>> Is that deliberate?
>>> 
>>> 
>>> 
>>> Maurizio
>>> 
>>> 
>>> 
>>> On 13/08/2020 18:32, Jim Laskey wrote:
>>>> webrev: http://cr.openjdk.java.net/~jlaskey/8224225/webrev-04
>>>> jbs: https://bugs.openjdk.java.net/browse/JDK-8224225
>>>> 
>>>> I recommend looking at the "new" versions of 1. UnicodeReader, then 2. JavaTokenizer and then 3. JavadocTokenizer before venturing into the diffs.
>>>> 
>>>> 
>>>> Rationale, under the heading of technical debt: There is a lot "going on" in the JavaTokenizer/JavadocTokenizer that needed to be cleaned up.
>>>> 
>>>> - The UnicodeReader shouldn't really be accumulating characters for literals.
>>>> - A tokenizer shouldn't need to be aware of the unicode translations.
>>>> - There is no need for peek ahead.
>>>> - There were a lot of repetitive tasks that should be done in methods instead of complex expressions.
>>>> - Names of existing methods were often confusing.
>>>> 
>>>> To avoid disruption, I avoided changing logical, except in the UnicodeReader. There are some relics in the JavaTokenizer/JavadocTokenizer that could be cleaned up but require deeper analysis.
>>>> 
>>>> Some details;
>>>>   
>>>> - UnicodeReader was reworked to provide tokenizers a running stream of unicode characters/codepoints. Steps:
>>>> 	- characters are read from the buffer.
>>>> 	- if the character is a '\' then check to see if the character is the beginning of an unicode escape sequence, If so, then translate.
>>>> 	- if the character is a high surrogate then check to see if next character is the low surrogate. If so then combine.
>>>> 		- A tokenizer can test a codepoint with the isSurrogate predicate (when/if needed.)
>>>>   The result of putting this logic on UnicodeReader's shoulders means that a tokenizer does not need have any unicode "logical."
>>>> 
>>>> - The old UnicodeReader modified the source buffer to insert an EOI character at the end to mark the last character. 
>>>> 	- This meant the buffer had to be large enough (grown) to accommodate.
>>>> 	- There really was no need since we can simply return an EOI when trying to read past the end of buffer.
>>>> 
>>>> - The only buffer mutability left behind is when reading digits.
>>>> 	- Unicode digits are still replaced with ASCII digits.
>>>> 		- This seems unnecessary, but I didn't want to risk messing around with the existing logic. Maybe someone can enlighten me.
>>>> 
>>>> - The sequence '\\' is special cased in the UnicodeReader so that the sequence "\\uXXXX" is handled properly.
>>>> 	- Thus, tokenizers don't have to special case '\\' (happened frequently in the JavadocTokenizer.)
>>>> 
>>>> - JavaTokenizer was modified to accumulate scanned literals in a StringBuilder.
>>>> 	- This simplified/clarified the code significantly.
>>>> 
>>>> - Since a lot of the functionality needed by the JavaTokenizer comes directly from a UnicodeReader, I made JavaTokenizer a subclass of UnicodeReader.
>>>> 	- Otherwise, I would have had to reference "reader." everywhere or would have to create JavaTokenizer methods to repeat the same logic. This was simpler and cleaner.
>>>> 
>>>> - Since the pattern "if (ch == 'X') bpos++" occurred a lot, I switched to using "if (accept('X')) " patterns.
>>>> 	- Actually, I tightened up a lot of these patterns, as you will see in the code.
>>>> 
>>>> - There are a lot of great mysteries in JavadocTokenizer, but I think I cracked most of them. The code is simpler and more modular.
>>>> 
>>>> - The new scanner is slower to warm up due to new layers of method calls (ex. HelloWorld is 5% slower). However, once warmed up, this new scanner is faster than the existing code. The JDK java code compiles 5-10% faster.
>>>> 
>>>> All comments, suggestions and contributions welcome.
>>>> 
>>>> Cheers,
>>>> 
>>>> --- Jim
>> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/compiler-dev/attachments/20200814/92d7aef7/attachment-0001.htm>