RFR(10): 8181147: JNU_GetStringPlatformChars should have a fast path for UTF-8

Tue Jun 13 00:01:46 UTC 2017

I'm hoping Xueming will review as well.

+    char asciiCheck;
+    for (asciiCheck = 0, p = str; *p != '\0'; p++) {
+        asciiCheck |= *p;
+    }
+    len = (p - str);

Probably conversion from ptrdiff_t to int needs a cast.

Someday we might need to worry about input string longer than 2^31, but
that's a pre-existing problem.

The signed-ness of char being implementation-defined makes me nervous.
Maybe do all the bit operations using explicit unsigned char, then the
final test becomes simply

asciiCheck <= 0x80

On Mon, Jun 12, 2017 at 4:36 PM, Claes Redestad <claes.redestad at oracle.com>
wrote:

> Hi Martin,
>
> thanks for reviewing!
>
> On 2017-06-12 22:14, Martin Buchholz wrote:
>
>> +/* Optimized for char set UTF-8 */
>>
>> "charset" is a (poor misnomer!) jargon term, not two words.
>>
>
> I got that from the existing use of "char set" in this file, but will fix
> it in all places.
>
>
>> ---
>>
>> +    for (b = str[len]; b != '\0'; len++, b = str[len]) {
>> +        if (isAscii && b & 0x80) {
>> +            isAscii = JNI_FALSE;
>> +        }
>> +    }
>>
>> I would write this more like
>>
>> const signed char *p;
>> int isAscii;
>>
>> for (isAscii = 0, p = (const signed char *) str; *p != '\0'; p++) isAscii
>> &= (*p >= 0);
>>
>
> Did you mean for isAscii to be initialized to 1 (true) and then be cleared
> to 0 (false) when *p >= 0 is false?
>
>
>> Then length is (p - str)
>>
>
> How about something like this to hoist the second comparison from the loop:
>
>     int len;
>     char asciiCheck;
>     for (asciiCheck = 0, p = str; *p != '\0'; p++) {
>         asciiCheck |= *p;
>     }
>     len = (p - str);
>
>     if (asciiCheck & 0x80) {
>         // ascii fast-path
>         return newSizedString8859_1(env, str, len);
>     }
>
>     ...
>
> ---
>>
>> +    jbyteArray hab = NULL;
>>
>> I'm having a hard time decoding the name "hab"
>>
>
> Not sure, but my guess is "heap allocated byteArray".
>
>
>> ---
>>
>> The code below is not UTF-8 specific. Can it be refactored?
>>
>> +    hab = (*env)->NewByteArray(env, len);
>> +    if (hab != 0) {
>> +        jclass strClazz = JNU_ClassString(env);
>> +        CHECK_NULL_RETURN(strClazz, 0);
>> +        (*env)->SetByteArrayRegion(env, hab, 0, len, (jbyte *)str);
>> +        result = (*env)->NewObject(env, strClazz,
>> +                                       String_init_ID, hab, jnuEncoding);
>> +        (*env)->DeleteLocalRef(env, hab);
>> +        return result;
>> +    }
>>
>
> Yes, probably. The excessive copy-paste here is due to a crash issue I ran
> into when
> Charset.isSupported was called without proper initialization. It has since
> been
> resolved, but seems I forgot to revisit this part.
>
> I also realized I haven't added a test for this method. I'll look into
> doing the
> refactoring and adding some sanity testing tomorrow.
>
>
>> ---
>>
>> We probably want to use unicode escapes in out java sources to keep all
>> source files strictly ASCII.
>>
>
> You mean in the test? Sure.
>
> Refreshed the jdk webrev: http://cr.openjdk.java.net/~re
> destad/8181147/jdk.05/ - more to come.
>
> Thanks!
>
> /Claes
>
>