<i18n dev> RFR: 8364007: Add overload without arguments to codePointCount in String etc.

Tatsunori Uchino duke at openjdk.org
Sat Jul 26 07:30:54 UTC 2025


On Thu, 24 Jul 2025 22:07:38 GMT, Mikhail Yankelevich <myankelevich at openjdk.org> wrote:

>> Adds `codePointCount()` overloads to `String`, `Character`, `(Abstract)StringBuilder`, and `StringBuffer` to make it possible to conveniently retrieve the length of a string as code points without extra boundary checks.
>> 
>> 
>> if (superTremendouslyLongExpressionYieldingAString().codePointCount() > limit) {
>>     throw new Exception("exceeding length");
>> }
>> 
>> 
>> Is a CSR required to this change?
>
> src/java.base/share/classes/java/lang/Character.java line 9969:
> 
>> 9967:         int n = length;
>> 9968:         for (int i = 0; i < length; ) {
>> 9969:             if (isHighSurrogate(seq.charAt(i++)) && i < length &&
> 
> Imo this is quite hard to read, especially with `i++` inside of the if statement. What do you think about changing it to this? 
> ```java 
> for (int i = 1; i < length-1; i++) {
>     if (isHighSurrogate(seq.charAt(i)) &&
>         isLowSurrogate(seq.charAt(i + 1))) {
>         n--;
>         i++;
>     }
> }
> ``` 
> 
> edit: fixed a typo in my example

In the first place it yields an _incorrect_ result for sequences whose first character is a supplementary character.


jshell> int len(CharSequence seq) {
   ...>     final int length = seq.length();
   ...>     int n = length;
   ...>     for (int i = 1; i < length-1; i++) {
   ...>             if (isHighSurrogate(seq.charAt(i)) &&
   ...>                 isLowSurrogate(seq.charAt(i + 1))) {
   ...>                     n--;
   ...>                     i++;
   ...>             }
   ...>     }
   ...>     return n;
   ...> }
|  次を作成しました: メソッド len(CharSequence)。しかし、 method isHighSurrogate(char), and method isLowSurrogate(char)が宣言されるまで、起動できません

jshell> boolean isHighSurrogate(char ch) {
   ...>     return 0xd800 <= ch && ch <= 0xdbff;
   ...> }
|  次を作成しました: メソッド isHighSurrogate(char)

jshell> boolean isLowSurrogate(char ch) {
   ...>     return 0xdc00 <= ch && ch <= 0xdfff;
   ...> }
|  次を作成しました: メソッド isLowSurrogate(char)

jshell> len("𠮷");
$5 ==> 2

jshell> len("OK👍");
$6 ==> 3

jshell> len("👍👍");
$7 ==> 3


I will not change it alone unless the existing overload `int codePointCount(CharSequence seq, int beginIndex, int endIndex)` is also planned to be changed.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26461#discussion_r2232751973


More information about the i18n-dev mailing list