RFR: JDK-8224225 - Tokenizer improvements
Jim Laskey
james.laskey at oracle.com
Tue Sep 1 16:00:04 UTC 2020
I've been waiting for the jdk build to switch to using a jdk-15 bootstrap (soon.)
In the meantime, I've updated the webrev with requested changes.
webrev: http://cr.openjdk.java.net/~jlaskey/8224225/webrev-05 <http://cr.openjdk.java.net/~jlaskey/8224225/webrev-05>
Cheers,
-- Jim
> On Aug 14, 2020, at 7:05 PM, James Laskey <james.laskey at oracle.com> wrote:
>
> Agree. Will work it up.
>
>
>
>> On Aug 14, 2020, at 6:30 PM, Jonathan Gibbons <Jonathan.Gibbons at oracle.com> wrote:
>>
>>
>> Re SubChar.java
>>
>> The correct solution is to dynamically generate any files with non-standard whitespace, even if it is simple as
>> Files.writeString(tmpPath, Fies.readString(srcPath).trim())
>>
>> -- Jon
>>
>> On 8/14/20 11:09 AM, Jim Laskey wrote:
>>> Thank you.
>>>
>>> The commentary for SubChar.java reads.
>>>
>>> "Note: this source file has been crafted very carefully to end with the
>>> unicode escape sequence for the control-Z character without a
>>> following newline. The scanner is specified to allow control-Z there.
>>> If you edit this source file, please make sure that your editor does
>>> not insert a newline after that trailing line."
>>>
>>> The person that checked in the test did exactly that - added a newline after that trailing line (this was done when moving to mercurial in 2007.)
>>>
>>> This patch is to remove that newline. Note that jcheck doesn't check tests for missing last newline.
>>>
>>> Cheers,
>>>
>>> -- Jim
>>>
>>>
>>>> On Aug 14, 2020, at 2:42 PM, Maurizio Cimadamore <maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>
>>>> Hi Jim,
>>>> this is a very good cleanup. I like how the new code makes the tokenizers a lot less verbose than before, and I like how you have refactored the various UnicodeReader vs. DocReader (now PositionTrackingReader), as the status quo was messy, and we had a lot of flexibility on paper that wasn't really used in practice and ended up making the code more complex than it needed to be.
>>>>
>>>> Big thumbs up from me.
>>>>
>>>> Minor comment: what's up with SubChar.java? Webrev is empty, but patch reports following diff:
>>>>
>>>> iff --git a/test/langtools/tools/javac/unicode/SubChar.java b/test/langtools/tools/javac/unicode/SubChar.java
>>>> --- a/test/langtools/tools/javac/unicode/SubChar.java
>>>> +++ b/test/langtools/tools/javac/unicode/SubChar.java
>>>> @@ -45,4 +45,4 @@
>>>> return;
>>>> }
>>>> }
>>>> -/* \u001A */
>>>> +/* \u001A */
>>>> \ No newline at end of file
>>>>
>>>>
>>>> Is that deliberate?
>>>>
>>>>
>>>>
>>>> Maurizio
>>>>
>>>>
>>>>
>>>> On 13/08/2020 18:32, Jim Laskey wrote:
>>>>> webrev: http://cr.openjdk.java.net/~jlaskey/8224225/webrev-04 <http://cr.openjdk.java.net/~jlaskey/8224225/webrev-04>
>>>>> jbs: https://bugs.openjdk.java.net/browse/JDK-8224225 <https://bugs.openjdk.java.net/browse/JDK-8224225>
>>>>>
>>>>> I recommend looking at the "new" versions of 1. UnicodeReader, then 2. JavaTokenizer and then 3. JavadocTokenizer before venturing into the diffs.
>>>>>
>>>>>
>>>>> Rationale, under the heading of technical debt: There is a lot "going on" in the JavaTokenizer/JavadocTokenizer that needed to be cleaned up.
>>>>>
>>>>> - The UnicodeReader shouldn't really be accumulating characters for literals.
>>>>> - A tokenizer shouldn't need to be aware of the unicode translations.
>>>>> - There is no need for peek ahead.
>>>>> - There were a lot of repetitive tasks that should be done in methods instead of complex expressions.
>>>>> - Names of existing methods were often confusing.
>>>>>
>>>>> To avoid disruption, I avoided changing logical, except in the UnicodeReader. There are some relics in the JavaTokenizer/JavadocTokenizer that could be cleaned up but require deeper analysis.
>>>>>
>>>>> Some details;
>>>>>
>>>>> - UnicodeReader was reworked to provide tokenizers a running stream of unicode characters/codepoints. Steps:
>>>>> - characters are read from the buffer.
>>>>> - if the character is a '\' then check to see if the character is the beginning of an unicode escape sequence, If so, then translate.
>>>>> - if the character is a high surrogate then check to see if next character is the low surrogate. If so then combine.
>>>>> - A tokenizer can test a codepoint with the isSurrogate predicate (when/if needed.)
>>>>> The result of putting this logic on UnicodeReader's shoulders means that a tokenizer does not need have any unicode "logical."
>>>>>
>>>>> - The old UnicodeReader modified the source buffer to insert an EOI character at the end to mark the last character.
>>>>> - This meant the buffer had to be large enough (grown) to accommodate.
>>>>> - There really was no need since we can simply return an EOI when trying to read past the end of buffer.
>>>>>
>>>>> - The only buffer mutability left behind is when reading digits.
>>>>> - Unicode digits are still replaced with ASCII digits.
>>>>> - This seems unnecessary, but I didn't want to risk messing around with the existing logic. Maybe someone can enlighten me.
>>>>>
>>>>> - The sequence '\\' is special cased in the UnicodeReader so that the sequence "\\uXXXX" is handled properly.
>>>>> - Thus, tokenizers don't have to special case '\\' (happened frequently in the JavadocTokenizer.)
>>>>>
>>>>> - JavaTokenizer was modified to accumulate scanned literals in a StringBuilder.
>>>>> - This simplified/clarified the code significantly.
>>>>>
>>>>> - Since a lot of the functionality needed by the JavaTokenizer comes directly from a UnicodeReader, I made JavaTokenizer a subclass of UnicodeReader.
>>>>> - Otherwise, I would have had to reference "reader." everywhere or would have to create JavaTokenizer methods to repeat the same logic. This was simpler and cleaner.
>>>>>
>>>>> - Since the pattern "if (ch == 'X') bpos++" occurred a lot, I switched to using "if (accept('X')) " patterns.
>>>>> - Actually, I tightened up a lot of these patterns, as you will see in the code.
>>>>>
>>>>> - There are a lot of great mysteries in JavadocTokenizer, but I think I cracked most of them. The code is simpler and more modular.
>>>>>
>>>>> - The new scanner is slower to warm up due to new layers of method calls (ex. HelloWorld is 5% slower). However, once warmed up, this new scanner is faster than the existing code. The JDK java code compiles 5-10% faster.
>>>>>
>>>>> All comments, suggestions and contributions welcome.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> --- Jim
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/compiler-dev/attachments/20200901/76daf4fb/attachment-0001.htm>
More information about the compiler-dev
mailing list