RFR: JDK-8224225 - Tokenizer improvements

Tue Sep 1 16:00:04 UTC 2020

I've been waiting for the jdk build to switch to using a jdk-15 bootstrap (soon.)

In the meantime, I've updated the webrev with requested changes.

webrev: http://cr.openjdk.java.net/~jlaskey/8224225/webrev-05 <http://cr.openjdk.java.net/~jlaskey/8224225/webrev-05>

Cheers,

-- Jim

> On Aug 14, 2020, at 7:05 PM, James Laskey <james.laskey at oracle.com> wrote:
> 
> Agree. Will work it up. 
> 
> ��
> 
>> On Aug 14, 2020, at 6:30 PM, Jonathan Gibbons <Jonathan.Gibbons at oracle.com> wrote:
>> 
>> 
>> Re SubChar.java
>> 
>> The correct solution is to dynamically generate any files with non-standard whitespace, even if it is simple as 
>> Files.writeString(tmpPath, Fies.readString(srcPath).trim())
>> 
>> -- Jon
>> 
>> On 8/14/20 11:09 AM, Jim Laskey wrote:
>>> Thank you.
>>> 
>>> The commentary for SubChar.java reads.
>>> 
>>> "Note: this source file has been crafted very carefully to end with the
>>> unicode escape sequence for the control-Z character without a
>>> following newline.  The scanner is specified to allow control-Z there.
>>> If you edit this source file, please make sure that your editor does
>>> not insert a newline after that trailing line."
>>> 
>>> The person that checked in the test did exactly that - added a newline after that trailing line (this was done when moving to mercurial in 2007.)
>>> 
>>> This patch is to remove that newline. Note that jcheck doesn't check tests for missing last newline.
>>> 
>>> Cheers,
>>> 
>>> -- Jim
>>> 
>>> 
>>>> On Aug 14, 2020, at 2:42 PM, Maurizio Cimadamore <maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>> 
>>>> Hi Jim,
>>>> this is a very good cleanup. I like how the new code makes the tokenizers a lot less verbose than before, and I like how you have refactored the various UnicodeReader vs. DocReader (now PositionTrackingReader), as the status quo was messy, and we had a lot of flexibility on paper that wasn't really used in practice and ended up making the code more complex than it needed to be.
>>>> 
>>>> Big thumbs up from me.
>>>> 
>>>> Minor comment: what's up with SubChar.java? Webrev is empty, but patch reports following diff:
>>>> 
>>>> iff --git a/test/langtools/tools/javac/unicode/SubChar.java b/test/langtools/tools/javac/unicode/SubChar.java
>>>> --- a/test/langtools/tools/javac/unicode/SubChar.java
>>>> +++ b/test/langtools/tools/javac/unicode/SubChar.java
>>>> @@ -45,4 +45,4 @@
>>>>          return;
>>>>      }
>>>>  }
>>>> -/* \u001A */
>>>> +/* \u001A */
>>>> \ No newline at end of file
>>>> 
>>>> 
>>>> Is that deliberate?
>>>> 
>>>> 
>>>> 
>>>> Maurizio
>>>> 
>>>> 
>>>> 
>>>> On 13/08/2020 18:32, Jim Laskey wrote:
>>>>> webrev: http://cr.openjdk.java.net/~jlaskey/8224225/webrev-04 <http://cr.openjdk.java.net/~jlaskey/8224225/webrev-04>
>>>>> jbs: https://bugs.openjdk.java.net/browse/JDK-8224225 <https://bugs.openjdk.java.net/browse/JDK-8224225>
>>>>> 
>>>>> I recommend looking at the "new" versions of 1. UnicodeReader, then 2. JavaTokenizer and then 3. JavadocTokenizer before venturing into the diffs.
>>>>> 
>>>>> 
>>>>> Rationale, under the heading of technical debt: There is a lot "going on" in the JavaTokenizer/JavadocTokenizer that needed to be cleaned up.
>>>>> 
>>>>> - The UnicodeReader shouldn't really be accumulating characters for literals.
>>>>> - A tokenizer shouldn't need to be aware of the unicode translations.
>>>>> - There is no need for peek ahead.
>>>>> - There were a lot of repetitive tasks that should be done in methods instead of complex expressions.
>>>>> - Names of existing methods were often confusing.
>>>>> 
>>>>> To avoid disruption, I avoided changing logical, except in the UnicodeReader. There are some relics in the JavaTokenizer/JavadocTokenizer that could be cleaned up but require deeper analysis.
>>>>> 
>>>>> Some details;
>>>>>   
>>>>> - UnicodeReader was reworked to provide tokenizers a running stream of unicode characters/codepoints. Steps:
>>>>> 	- characters are read from the buffer.
>>>>> 	- if the character is a '\' then check to see if the character is the beginning of an unicode escape sequence, If so, then translate.
>>>>> 	- if the character is a high surrogate then check to see if next character is the low surrogate. If so then combine.
>>>>> 		- A tokenizer can test a codepoint with the isSurrogate predicate (when/if needed.)
>>>>>   The result of putting this logic on UnicodeReader's shoulders means that a tokenizer does not need have any unicode "logical."
>>>>> 
>>>>> - The old UnicodeReader modified the source buffer to insert an EOI character at the end to mark the last character. 
>>>>> 	- This meant the buffer had to be large enough (grown) to accommodate.
>>>>> 	- There really was no need since we can simply return an EOI when trying to read past the end of buffer.
>>>>> 
>>>>> - The only buffer mutability left behind is when reading digits.
>>>>> 	- Unicode digits are still replaced with ASCII digits.
>>>>> 		- This seems unnecessary, but I didn't want to risk messing around with the existing logic. Maybe someone can enlighten me.
>>>>> 
>>>>> - The sequence '\\' is special cased in the UnicodeReader so that the sequence "\\uXXXX" is handled properly.
>>>>> 	- Thus, tokenizers don't have to special case '\\' (happened frequently in the JavadocTokenizer.)
>>>>> 
>>>>> - JavaTokenizer was modified to accumulate scanned literals in a StringBuilder.
>>>>> 	- This simplified/clarified the code significantly.
>>>>> 
>>>>> - Since a lot of the functionality needed by the JavaTokenizer comes directly from a UnicodeReader, I made JavaTokenizer a subclass of UnicodeReader.
>>>>> 	- Otherwise, I would have had to reference "reader." everywhere or would have to create JavaTokenizer methods to repeat the same logic. This was simpler and cleaner.
>>>>> 
>>>>> - Since the pattern "if (ch == 'X') bpos++" occurred a lot, I switched to using "if (accept('X')) " patterns.
>>>>> 	- Actually, I tightened up a lot of these patterns, as you will see in the code.
>>>>> 
>>>>> - There are a lot of great mysteries in JavadocTokenizer, but I think I cracked most of them. The code is simpler and more modular.
>>>>> 
>>>>> - The new scanner is slower to warm up due to new layers of method calls (ex. HelloWorld is 5% slower). However, once warmed up, this new scanner is faster than the existing code. The JDK java code compiles 5-10% faster.
>>>>> 
>>>>> All comments, suggestions and contributions welcome.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> --- Jim
>>> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/compiler-dev/attachments/20200901/76daf4fb/attachment-0001.htm>