Regexp with word-boundary followed by unicode character doesn't work in 19, 21

Naoto Sato naoto.sato at oracle.com
Fri Dec 15 23:25:29 UTC 2023


Or use extended Grapheme Cluster boundary "\\b{g}" instead of "\\b". 
This will correctly search emoji sequences such as 👨‍👩‍👧‍👧, while 
"\\b" with Unicode option won't.

HTH,
Naoto

On 12/15/23 11:29 AM, Stefan Norberg wrote:
> Thanks Raffaello,
> Ah, thanks! Found https://bugs.openjdk.org/browse/JDK-8264160 
> <https://bugs.openjdk.org/browse/JDK-8264160> in the release notes for 
> 19 just now.
> Have a great weekend!
> 
> /Stefan
> 
> On Fri, Dec 15, 2023 at 8:24 PM Raffaello Giulietti 
> <raffaello.giulietti at oracle.com <mailto:raffaello.giulietti at oracle.com>> 
> wrote:
> 
>     By default, a word boundary only considers ASCII letters and digits.
>     See
>     "Predefined character classes" in the documentation.
> 
>     To add Unicode support, you have a choice between adding a flag as a
>     2nd
>     argument to the compile() method
> 
>     Pattern p = Pattern.compile("(\\b" + word + "\\b)",
>     Pattern.UNICODE_CHARACTER_CLASS);
> 
>     or add a flag in the regex pattern, as documented in "Special
>     constructs
>     (named-capturing and non-capturing)"
> 
>     Pattern p = Pattern.compile("(?U)(\\b" + word + "\\b)");
> 
> 
>     Greetings
>     Raffaello
> 
> 
>     On 2023-12-15 20:07, Stefan Norberg wrote:
>      > The following test works in 17 but fails in 19.0.2, and 21.0.1 on
>     the
>      > last assertion. Bug or feature?
>      >
>      > import org.junit.jupiter.api.Assertions;
>      > import org.junit.jupiter.api.Test;
>      >
>      > import java.util.ArrayList;
>      > import java.util.regex.Matcher;
>      > import java.util.regex.Pattern;
>      >
>      > /**
>      > * Tests passes in JDK 17 but fails in JDK 19, 21.
>      > *
>      > * The combination of a \b "word boundary" and a unicode char doesn't
>      > seem to work in 19, 21.
>      > *
>      > */
>      > public class UnicodeTest {
>      > @Test
>      > public void testRegexp() throws Exception {
>      > var text = "En sak som ökas och sedan minskas. Bra va!";
>      > var word = "ökas";
>      > Assertions.assertTrue(text.contains(word));
>      >
>      > Pattern p = Pattern.compile("(\\b" + word + "\\b)");
>      > Matcher m = p.matcher(text);
>      > var matches = new ArrayList<>();
>      >
>      > while (m.find()) {
>      > String matchString = m.group();
>      > System.out.println(matchString);
>      > matches.add(matchString);
>      > }
>      > Assertions.assertEquals(1, matches.size());
>      > }
>      > }
>      >
>      >
>      >
>      > openjdk version "21.0.1" 2023-10-17 LTS
>      >
>      > OpenJDK Runtime Environment Corretto-21.0.1.12.1 (build
>     21.0.1+12-LTS)
>      >
>      > OpenJDK 64-Bit Server VM Corretto-21.0.1.12.1 (build 21.0.1+12-LTS,
>      > mixed mode, sharing)
>      >
>      >
>      > System Version: macOS 14.2 (23C64)
>      >
>      > Kernel Version: Darwin 23.2.0
>      >
>      >
>      > Thanks!
>      >
>      >
>      > /Stefan
>      >
> 


More information about the core-libs-dev mailing list