RFR(10): 8181147: JNU_GetStringPlatformChars should have a fast path for UTF-8

Thu Jun 15 13:29:42 UTC 2017

Hi Sherman,

On 06/15/2017 05:27 AM, Xueming Shen wrote:
> Hi Claes,
>
> The change looks fine. Yes, encoding the 2-byete latin1 at native 
> looks reasonable.

Thanks for reviewing!

>
> One nick picking is that it might be better to initialize the constant 
> "LATIN1" from the String class
> when intializing String_coder_ID? recently I got a "naked constants" 
> complain #JDK-8156530, so
> I guess it might be better to do so at you change, though I don't 
> expect it's going to change in the
> foreseeable future, even we finally add a "utf8" variants there.
>
> 478 static const jbyte LATIN1 = 0;
> 479 static const jbyte UTF16  = 1;

Not sure how the "naked constants" issue applies, but maybe we can save 
us some
future surprises by adding a precondition check in the new test to 
ensure the
constants defined in jni_util matches the constants defined in 
java.lang.String:

http://cr.openjdk.java.net/~redestad/8181147/jdk.06/

Does this suffice?

Thanks!

/Claes

>
> thanks,
> Sherman
>
> On 6/13/17, 1:27 PM, Claes Redestad wrote:
>>
>>
>> On 2017-06-13 21:09, Martin Buchholz wrote:
>>> Looks good to me, although still hoping for more review from others.
>>
>> I expect Sherman to weigh in. :-)
>>
>>>
>>> +    if (bytes != 0) {
>>> Style: use NULL.
>>
>> Done.
>>
>>>
>>> +static jstring newStringJava(JNIEnv *env, const char *str) {
>>>
>>> I expected two versions, one that took a length and one that did 
>>> not, as you did with newString8859_1.
>>
>> Ah, yes. Refactored and updated in-place.
>>
>>> ---
>>> TIL that jbyte is signed char, not plain char, and so is more 
>>> consistent with java "byte", hence less error prone, at least in 
>>> theory.
>>>
>>> ---
>>> Can we test different native environments by using 
>>> -Dsun.jnu.encoding=... ?
>>
>> I did try, but jtreg currently doesn't appear to support specifying 
>> flags for main/native @run:s (and helpfully suggests using 
>> main/othervm instead...) - maybe someone knows a way around this?
>>
>> /Claes
>>
>>>
>>>
>>>
>>> On Tue, Jun 13, 2017 at 7:03 AM, Claes Redestad 
>>> <claes.redestad at oracle.com <mailto:claes.redestad at oracle.com>> wrote:
>>>
>>> <adding build-dev; see
>>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2017-June/048233.html
>>> <http://mail.openjdk.java.net/pipermail/core-libs-dev/2017-June/048233.html> 
>>>
>>>     for missing context>
>>>
>>>     Updated webrevs:
>>>     http://cr.openjdk.java.net/~redestad/8181147/jdk.05/
>>> <http://cr.openjdk.java.net/%7Eredestad/8181147/jdk.05/>
>>>     http://cr.openjdk.java.net/~redestad/8181147/top.04/
>>> <http://cr.openjdk.java.net/%7Eredestad/8181147/top.04/>
>>>
>>>     In this version there are various cleanups based on Martin's
>>>     comments, improved test coverage and some small changes to
>>>     ensure the updated test builds and runs on all our supported
>>>     platforms.
>>>
>>>     If someone with access to any of the platforms not supported
>>>     by Oracle could take this for a spin and ensure linking the newly
>>>     added test against libjava works as expected (or suggest how
>>>     to make it work if it doesn't) then that'd be much appreciated.
>>>
>>>     Thanks!
>>>
>>>     /Claes
>>>
>>>     On 06/13/2017 02:01 AM, Martin Buchholz wrote:
>>>
>>>         I'm hoping Xueming will review as well.
>>>         +    char asciiCheck;
>>>         +    for (asciiCheck = 0, p = str; *p != '\0'; p++) {
>>>         +        asciiCheck |= *p;
>>>         +    }
>>>         +    len = (p - str);
>>>         Probably conversion from ptrdiff_t to int needs a cast.
>>>
>>>         Someday we might need to worry about input string longer than
>>>         2^31, but that's a pre-existing problem.
>>>
>>>         The signed-ness of char being implementation-defined makes me
>>>         nervous.
>>>         Maybe do all the bit operations using explicit unsigned char,
>>>         then the final test becomes simply
>>>
>>>         asciiCheck <= 0x80
>>>
>>>
>>>         On Mon, Jun 12, 2017 at 4:36 PM, Claes Redestad
>>> <claes.redestad at oracle.com <mailto:claes.redestad at oracle.com>
>>> <mailto:claes.redestad at oracle.com
>>> <mailto:claes.redestad at oracle.com>>> wrote:
>>>
>>>             Hi Martin,
>>>
>>>             thanks for reviewing!
>>>
>>>             On 2017-06-12 22:14, Martin Buchholz wrote:
>>>
>>>                 +/* Optimized for char set UTF-8 */
>>>
>>>                 "charset" is a (poor misnomer!) jargon term, not two
>>>         words.
>>>
>>>
>>>             I got that from the existing use of "char set" in this
>>>         file, but
>>>             will fix it in all places.
>>>
>>>
>>>                 ---
>>>
>>>                 +    for (b = str[len]; b != '\0'; len++, b = 
>>> str[len]) {
>>>                 +        if (isAscii && b & 0x80) {
>>>                 +            isAscii = JNI_FALSE;
>>>                 +        }
>>>                 +    }
>>>
>>>                 I would write this more like
>>>
>>>                 const signed char *p;
>>>                 int isAscii;
>>>
>>>                 for (isAscii = 0, p = (const signed char *) str; *p !=
>>>         '\0';
>>>                 p++) isAscii &= (*p >= 0);
>>>
>>>
>>>             Did you mean for isAscii to be initialized to 1 (true) and
>>>         then be
>>>             cleared to 0 (false) when *p >= 0 is false?
>>>
>>>
>>>                 Then length is (p - str)
>>>
>>>
>>>             How about something like this to hoist the second
>>>         comparison from
>>>             the loop:
>>>
>>>                 int len;
>>>                 char asciiCheck;
>>>                 for (asciiCheck = 0, p = str; *p != '\0'; p++) {
>>>                     asciiCheck |= *p;
>>>                 }
>>>                 len = (p - str);
>>>
>>>                 if (asciiCheck & 0x80) {
>>>                     // ascii fast-path
>>>                     return newSizedString8859_1(env, str, len);
>>>                 }
>>>
>>>                 ...
>>>
>>>                 ---
>>>
>>>                 +    jbyteArray hab = NULL;
>>>
>>>                 I'm having a hard time decoding the name "hab"
>>>
>>>
>>>             Not sure, but my guess is "heap allocated byteArray".
>>>
>>>
>>>                 ---
>>>
>>>                 The code below is not UTF-8 specific. Can it be
>>>         refactored?
>>>
>>>                 +    hab = (*env)->NewByteArray(env, len);
>>>                 +    if (hab != 0) {
>>>                 +        jclass strClazz = JNU_ClassString(env);
>>>                 +        CHECK_NULL_RETURN(strClazz, 0);
>>>                 +        (*env)->SetByteArrayRegion(env, hab, 0, len,
>>>         (jbyte
>>>                 *)str);
>>>                 +        result = (*env)->NewObject(env, strClazz,
>>>                 +  String_init_ID, hab,
>>>                 jnuEncoding);
>>>                 +        (*env)->DeleteLocalRef(env, hab);
>>>                 +        return result;
>>>                 +    }
>>>
>>>
>>>             Yes, probably. The excessive copy-paste here is due to a 
>>> crash
>>>             issue I ran into when
>>>             Charset.isSupported was called without proper
>>>         initialization. It
>>>             has since been
>>>             resolved, but seems I forgot to revisit this part.
>>>
>>>             I also realized I haven't added a test for this method.
>>>         I'll look
>>>             into doing the
>>>             refactoring and adding some sanity testing tomorrow.
>>>
>>>
>>>                 ---
>>>
>>>                 We probably want to use unicode escapes in out java
>>>         sources to
>>>                 keep all source files strictly ASCII.
>>>
>>>
>>>             You mean in the test? Sure.
>>>
>>>             Refreshed the jdk webrev:
>>>         http://cr.openjdk.java.net/~redestad/8181147/jdk.05/
>>> <http://cr.openjdk.java.net/%7Eredestad/8181147/jdk.05/>
>>> <http://cr.openjdk.java.net/%7Eredestad/8181147/jdk.05/
>>> <http://cr.openjdk.java.net/%7Eredestad/8181147/jdk.05/>> -
>>>         more to
>>>             come.
>>>
>>>             Thanks!
>>>
>>>             /Claes
>>>
>>>
>>>
>>>
>>
>