RFR: 8291660: Grapheme support in BreakIterator [v4]

Wed Sep 7 23:23:42 UTC 2022

On Fri, 26 Aug 2022 21:48:14 GMT, Naoto Sato <naoto at openjdk.org> wrote:

>> This is to enhance the character break analysis in `java.text.BreakIterator` to conform to the extended grapheme cluster boundaries defined in https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries. A corresponding CSR has also been drafted, as there will be behavioral changes with this modification.
>
> Naoto Sato has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Changed the paragraph to @implSpec

src/java.base/share/classes/jdk/internal/util/regex/Grapheme.java line 47:

> 45:      */
> 46:     public static int nextBoundary(CharSequence src, int off, int limit) {
> 47:         Objects.checkFromToIndex(0, limit - off, src.length());

Is this right? The old code's use of `checkFromToIndex` method seems to be the right way to check that `off` and `limit` are a valid from-to range within `[0, src.length)`. The new code subtracts `off` from both args but the arithmetic seems to allow for some errors. For example, depending on the value of `limit`, this might permit `off` to be a small negative number.

src/java.base/share/classes/sun/util/locale/provider/BreakIteratorProviderImpl.java line 135:

> 133:     public BreakIterator getCharacterInstance(Locale locale) {
> 134:         return new GraphemeBreakIterator();
> 135:     }

It looks like there is some kind of table Since CHARACTER_INDEX is no longer used, does it mean there is now dead code for the CHARACTER break iterator class, and dead resources for CharacterData and CharacterDictionary? Should this be removed? Or maybe this is all in each locale or something and should be cleaned up later?

-------------

PR: https://git.openjdk.org/jdk/pull/9991