<i18n dev> RFR: 8354968: Replace unicode sequences in comment text with UTF-8 characters [v2]

Naoto Sato naoto at openjdk.org
Tue May 6 17:21:18 UTC 2025


On Tue, 6 May 2025 15:46:03 GMT, Magnus Ihse Bursie <ihse at openjdk.org> wrote:

>> As part of the UTF-8 cleaning up done in [JDK-8301971](https://bugs.openjdk.org/browse/JDK-8301971), I looked at where and how we are using unicode sequences (`\uXXXX`). In several string literals, I think the unicode sequences still has merit, if they improve clarity or readability of the code. Some instances are more gray zone. But the places where it does not make sense at all are in comments, as part of fluid text comments. There they are just disruptive and not helpful at all. I tried to locate all such places (but I might have missed places, I did not do a proper lexical analysis to find comments) and fix them.
>> 
>> 99% of this fix is to turn poor `Peter von der Ah\u00e9` into `Peter von der Ahé`. 😆 
>> 
>> I checked some random samples on when this was introduced to see if there were some particular commit that mistreated the encoding, but they have been there since the original release of the open JDK source code.
>> 
>> There are likely many more places where direct UTF-8 encoded characters is preferable to unicode sequences, but this seemed like a safe and trivial first start.
>
> Magnus Ihse Bursie has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision:
> 
>  - Merge branch 'master' into unicode-sequence-in-comments
>  - 8354968: Replace unicode sequences in comment text with UTF-8 characters

src/java.base/share/classes/java/text/Collator.java line 141:

> 139:      * considered significant during comparison. The assignment of strengths
> 140:      * to language features is locale dependent. A common example is for
> 141:      * different accented forms of the same base letter ("a" vs "ä") to be

Since this (and the other one in RuleBasedCollator) is in the explanation of text handling, I think keeping the original code point makes sense. So I'd have both UTF-8 string and its Unicode escape notation here.

src/java.base/share/classes/java/text/RuleBasedCollator.java line 594:

> 592:         // a three-digit number, one digit for primary, one for secondary, etc.
> 593:         //
> 594:         // String:              A     a     B   é

Maybe "é (\u00e9, e-acute)"?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24727#discussion_r2075933987
PR Review Comment: https://git.openjdk.org/jdk/pull/24727#discussion_r2075935811


More information about the i18n-dev mailing list