RFR (S) 8144940: Broken hash in string table entry in closed/runtime/7158800/BadUtf8.java

Wed Mar 23 02:05:20 UTC 2016

I've done a little investigation on Latin1 encoding of java.lang.String 
and how that relates to this bug. Here's what I found:

  * Latin1 is is a 8-bit encoding of characters.
  * The first 256 Unicode characters are exactly the same as the Latin1
    encoding.
  * So if all the jchars in a java.lang.String are <= 0xff, it can be
    stored in (logically) an unsigned byte array with the upper 16 bits
    truncated.
      o I am not sure if it's REQUIRED for such strings to be stored in
        a byte array. It might be possible to create an equivalent
        String with jchar[] storage. You can certainly do that with
        Unsafe + reflection.

This function, which you removed, should return the exact same hashcode 
regardless whether the String was stored as a byte[] or jchar[]

    unsigned int java_lang_String::hash_string(oop java_string) {
       int          length = java_lang_String::length(java_string);
       // Zero length string doesn't necessarily hash to zero.
       if (length == 0) {
         return StringTable::hash_string((jchar*) NULL, 0);
       }

       typeArrayOop value  = java_lang_String::value(java_string);
       bool      is_latin1 = java_lang_String::is_latin1(java_string);
       if (is_latin1) {
         return StringTable::hash_string(value->byte_at_addr(0), length);
       } else {
         return StringTable::hash_string(value->char_at_addr(0), length);
       }
    }

That's because these two functions should produce the exact same result 
(if all the unsigned arithmetics are correct ...)

       static unsigned int hash_code(const jchar* s, int len) {
         unsigned int h = 0;
         while (len-- > 0) {
           h = 31*h + (unsigned int) *s;
           s++;
         }
         return h;
       }

       static unsigned int hash_code(const jbyte* s, int len) {
         unsigned int h = 0;
         while (len-- > 0) {
           h = 31*h + (((unsigned int) *s) & 0xFF);
           s++;
         }
         return h;
       }

For shared interned strings, we actually use the <jbyte> version during 
dump time (writing into shared strings table), and the <jchar> version 
at run time (look up from shared string table). I wrote a test and 
validated that the the two hashcodes are identical.

So, I believe your fix for 8144940 works not because you force the body 
to be converted to jchar. Rather, it's because 
java_lang_String::hash_string does not consider 
StringTable::use_alternate_hashcode().

Also, I am glad that you removed the <jbyte> version of this template:

    template<typename T>
    unsigned int StringTable::hash_string(const T* s, int len) {
       return use_alternate_hashcode() ? AltHashing::murmur3_32(seed(),
    s, len) :
                                         java_lang_String::hash_code(s,
    len);
    }

    // Explicit instantiation for all supported types.
    template unsigned int StringTable::hash_string<jchar>(const jchar*
    s, int len);
    template unsigned int StringTable::hash_string<jbyte>(const jbyte*
    s, int len);

Otherwise, someone is likely to call it with a UTF8 string and get 
unexpected results. I am not even sure if it would give identical 
results as the <jchar> version if the input was encoded in Latin1.

Thanks
- Ioi

On 3/22/16 10:39 AM, Coleen Phillimore wrote:
>
> Thank you, Jiangli.
> Coleen
>
> On 3/22/16 1:35 PM, Jiangli Zhou wrote:
>> Hi Coleen,
>>
>> Looks good to me. I had same question as Tobias yesterday. Your 
>> answer cleared it.
>>
>> Thanks,
>> Jiangli
>>
>>> On Mar 22, 2016, at 10:07 AM, Coleen Phillimore 
>>> <coleen.phillimore at oracle.com> wrote:
>>>
>>>
>>> Here's another webrev with the changes pointed out by Tobias and 
>>> verified with -XX:+VerifyStringTableAtExit.
>>>
>>> open webrev at http://cr.openjdk.java.net/~coleenp/8144940.02/webrev
>>>
>>> Thanks!
>>> Coleen
>>>
>>> On 3/22/16 12:21 PM, Tobias Hartmann wrote:
>>>> Hi Coleen,
>>>>
>>>> On 22.03.2016 13:40, Coleen Phillimore wrote:
>>>>> On 3/22/16 4:04 AM, Tobias Hartmann wrote:
>>>>>> Hi Coleen,
>>>>>>
>>>>>> On 21.03.2016 22:11, Coleen Phillimore wrote:
>>>>>>> Summary: Fix code broken with compact Strings.
>>>>>>>
>>>>>>> One of the failure modes of an intermittent bug (but this 
>>>>>>> failure is not intermittent).
>>>>>>>
>>>>>>> Tested with the failing test cases that exercise this code. 
>>>>>>> Also, testing in order to find linked bugs.
>>>>>>>
>>>>>>> open webrev at 
>>>>>>> http://cr.openjdk.java.net/~coleenp/8144940.01/webrev
>>>>>>> bug link https://bugs.openjdk.java.net/browse/JDK-8144940
>>>>>> I wonder why the result is different if you first convert the 
>>>>>> latin1 String to Unicode and then use the jchar hash_string() 
>>>>>> version compared to just using the jbyte hash_string() version? 
>>>>>> Is it because the jbyte version of AltHashing::murmur3_32() is used?
>>>>> Yes, I believe it is.
>>>> Okay, thanks for checking.
>>>>
>>>>>> Now we don't need the StringTable::hash_string<jbyte> version 
>>>>>> anymore, right?
>>>>> This one is used by Symbol* which are jbyte.
>>>> I only see jchar uses of StringTable::hash_string() (after your 
>>>> fix). Are you confusing it with java_lang_String::hash_code() which 
>>>> also has a jbyte and jchar version? This one is indeed used by the 
>>>> SymbolTable.
>>>>
>>>>>> Just noticed that there is an unused "latin1_hash_code" in 
>>>>>> javaClasses.hpp which can be removed as well.
>>>>> Thank you, I'll remove it.
>>>> Thanks!
>>>>
>>>> Best regards,
>>>> Tobias
>>>>
>>>>>> Thanks for fixing this!
>>>>> Thanks for reviewing it!
>>>>> Coleen
>>>>>> Best regards,
>>>>>> Tobias
>>>>>>
>>>>>>> Thanks,
>>>>>>> Coleen
>