RFR (s) 8158906: JShell: crashes with extremely long result value

ShinyaYoshida bitterfoxc at gmail.com
Fri Aug 19 17:21:14 UTC 2016


Hi Robert,
According to [1] [2] [3], it seems to me that it is enough to test for the
character which becomes 3bytes in UTF-8.

If the unicode character is bigger than uFFFF, the UTF-8 representation of
such character technically can be 4 bytes[1].
But such character is represented by combination of 2 char value in Java[2].
ie, 3bytes character of UTF-8 is a possible upper case in 1 Java char value.

unicode # of char value in Java # of bytes in UTF-8 # of bytes in UTF-8 for
1 char value in Java
u0000-u007F 1 1 1
u0080-u07FF 1 2 2
u0800-uFFFF 1 3 3
u10000- 2 4 2

If we make u10000- character, we have to use Character#highSurrogate and
Character#lowSurrogate:

jshell> int _4byteInUTF8 = 0x10000
_4byteInUTF8 ==> 65536

jshell>
""+Character.highSurrogate(_4byteInUTF8)+Character.lowSurrogate(_4byteInUTF8)
$23 ==> "��"

jshell> $23.length()
$24 ==> 2

jshell> $23.getBytes().length
$26 ==> 4

Regards,
shinyafox(Shinya Yoshida)

[1]: https://tools.ietf.org/html/rfc3629
[2]: https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html
[3]:
https://docs.oracle.com/javase/8/docs/api/java/io/DataOutput.html#writeUTF-java.lang.String-


2016-08-20 0:55 GMT+09:00 Robert Field <robert.field at oracle.com>:

> Thanks shinyafox!
>
> Testing string length doesn't cover actual encoded UTF-8 length.  Which
> can be four bytes. I should include a test with four bytes. And the check
> should be correspondingly for a smaller number.
>
> Thanks,
> Robert
>
> On August 19, 2016 7:22:34 AM ShinyaYoshida <bitterfoxc at gmail.com> wrote:
>
>> Hi Robert,
>> I think there is still the problem for Japanese(or Chinese?).
>>
>> Could you try this test case?
>>     public void testLongRemoteJapaneseStrings() { //8158906
>>         assertEval("import java.util.stream.*;");
>>         assertEval("String m(int x) { return Stream.generate(() ->
>> \"\u3042\").limit(x).collect(Collectors.joining()); }");
>>         boolean[] shut = new boolean[1];
>>         getState().onShutdown(j -> {shut[0] = true;} );
>>         List<SnippetEvent> el = assertEval("m(65600);");
>>         assertTrue(shut[0] == false, "JShell died with long value");
>>         assertEquals(el.size(), 1, "Excepted one event");
>>         assertTrue(el.get(0).value().length() > 30000,
>>                 "Expected truncated but long String, got: " +
>> el.get(0).value());
>>     }
>>
>> "\u3042" is the first japanese character, あ(a) and it could be 3 bytes in
>> utf-8.
>> ( https://docs.oracle.com/javase/8/docs/api/java/io/
>> DataOutput.html#writeUTF-java.lang.String- )
>> I think it isn't yet resolved due to your code limit String to about
>> 30000 characters.
>> i.e. the byte on the stream is still over about 60000.
>>
>> If the test case pass, LGTM!
>>
>> Regards,
>> shinyafox(Shinya Yoshida)
>>
>>
>> 2016-08-19 10:15 GMT+09:00 Robert Field <robert.field at oracle.com>:
>>
>>> Please review...
>>>
>>> Bug:
>>>     https://bugs.openjdk.java.net/browse/JDK-8158906
>>>
>>> Webrev:
>>>     http://cr.openjdk.java.net/~rfield/8158906v0.webrev/
>>>
>>> Thanks,
>>> Robert
>>>
>>>
>>


More information about the kulla-dev mailing list