RFR (S) 8144940: Broken hash in string table entry in closed/runtime/7158800/BadUtf8.java

Wed Mar 23 16:38:06 UTC 2016

Hi, I realized that I didn't describe this well in either the bug or the 
RFR, so I put a description of what the problem is and why this fixes it 
in the bug.

On 3/22/16 10:05 PM, Ioi Lam wrote:
> I've done a little investigation on Latin1 encoding of 
> java.lang.String and how that relates to this bug. Here's what I found:
>
>  * Latin1 is is a 8-bit encoding of characters.
>  * The first 256 Unicode characters are exactly the same as the Latin1
>    encoding.
>  * So if all the jchars in a java.lang.String are <= 0xff, it can be
>    stored in (logically) an unsigned byte array with the upper 16 bits
>    truncated.
>      o I am not sure if it's REQUIRED for such strings to be stored in
>        a byte array. It might be possible to create an equivalent
>        String with jchar[] storage. You can certainly do that with
>        Unsafe + reflection.
>
> This function, which you removed, should return the exact same 
> hashcode regardless whether the String was stored as a byte[] or jchar[]
>
>    unsigned int java_lang_String::hash_string(oop java_string) {
>       int          length = java_lang_String::length(java_string);
>       // Zero length string doesn't necessarily hash to zero.
>       if (length == 0) {
>         return StringTable::hash_string((jchar*) NULL, 0);
>       }
>
>       typeArrayOop value  = java_lang_String::value(java_string);
>       bool      is_latin1 = java_lang_String::is_latin1(java_string);
>       if (is_latin1) {
>         return StringTable::hash_string(value->byte_at_addr(0), length);
>       } else {
>         return StringTable::hash_string(value->char_at_addr(0), length);
>       }
>    }
>
> That's because these two functions should produce the exact same 
> result (if all the unsigned arithmetics are correct ...)
>
>       static unsigned int hash_code(const jchar* s, int len) {
>         unsigned int h = 0;
>         while (len-- > 0) {
>           h = 31*h + (unsigned int) *s;
>           s++;
>         }
>         return h;
>       }
>
>       static unsigned int hash_code(const jbyte* s, int len) {
>         unsigned int h = 0;
>         while (len-- > 0) {
>           h = 31*h + (((unsigned int) *s) & 0xFF);
>           s++;
>         }
>         return h;
>       }
>
> For shared interned strings, we actually use the <jbyte> version 
> during dump time (writing into shared strings table), and the <jchar> 
> version at run time (look up from shared string table). I wrote a test 
> and validated that the the two hashcodes are identical.

For the alternate hashcode, the two hashcodes were different, which 
caused this bug.  I don't think you should rely on the jbyte and jchar 
versions returning the same thing.  I think you should always use the 
jchar version of the hash code for shared intern strings.
>
> So, I believe your fix for 8144940 works not because you force the 
> body to be converted to jchar. Rather, it's because 
> java_lang_String::hash_string does not consider 
> StringTable::use_alternate_hashcode().
>
> Also, I am glad that you removed the <jbyte> version of this template:
>
>    template<typename T>
>    unsigned int StringTable::hash_string(const T* s, int len) {
>       return use_alternate_hashcode() ? AltHashing::murmur3_32(seed(),
>    s, len) :
> java_lang_String::hash_code(s,
>    len);
>    }
>
>    // Explicit instantiation for all supported types.
>    template unsigned int StringTable::hash_string<jchar>(const jchar*
>    s, int len);
>    template unsigned int StringTable::hash_string<jbyte>(const jbyte*
>    s, int len);
>
> Otherwise, someone is likely to call it with a UTF8 string and get 
> unexpected results. I am not even sure if it would give identical 
> results as the <jchar> version if the input was encoded in Latin1.

I think it doesn't get the same results.

Hopefully, I've put enough information in the bug and yeah, removing 
java_lang_String::hash_string() removed some confusion on my part at 
least, since there's also a StringTable::hash_string, which probably 
should be something like hash_string_for_StringTable but not for 
String.hashCode().

Thanks!
Coleen

>
> Thanks
> - Ioi
>
> On 3/22/16 10:39 AM, Coleen Phillimore wrote:
>>
>> Thank you, Jiangli.
>> Coleen
>>
>> On 3/22/16 1:35 PM, Jiangli Zhou wrote:
>>> Hi Coleen,
>>>
>>> Looks good to me. I had same question as Tobias yesterday. Your 
>>> answer cleared it.
>>>
>>> Thanks,
>>> Jiangli
>>>
>>>> On Mar 22, 2016, at 10:07 AM, Coleen Phillimore 
>>>> <coleen.phillimore at oracle.com> wrote:
>>>>
>>>>
>>>> Here's another webrev with the changes pointed out by Tobias and 
>>>> verified with -XX:+VerifyStringTableAtExit.
>>>>
>>>> open webrev at http://cr.openjdk.java.net/~coleenp/8144940.02/webrev
>>>>
>>>> Thanks!
>>>> Coleen
>>>>
>>>> On 3/22/16 12:21 PM, Tobias Hartmann wrote:
>>>>> Hi Coleen,
>>>>>
>>>>> On 22.03.2016 13:40, Coleen Phillimore wrote:
>>>>>> On 3/22/16 4:04 AM, Tobias Hartmann wrote:
>>>>>>> Hi Coleen,
>>>>>>>
>>>>>>> On 21.03.2016 22:11, Coleen Phillimore wrote:
>>>>>>>> Summary: Fix code broken with compact Strings.
>>>>>>>>
>>>>>>>> One of the failure modes of an intermittent bug (but this 
>>>>>>>> failure is not intermittent).
>>>>>>>>
>>>>>>>> Tested with the failing test cases that exercise this code. 
>>>>>>>> Also, testing in order to find linked bugs.
>>>>>>>>
>>>>>>>> open webrev at 
>>>>>>>> http://cr.openjdk.java.net/~coleenp/8144940.01/webrev
>>>>>>>> bug link https://bugs.openjdk.java.net/browse/JDK-8144940
>>>>>>> I wonder why the result is different if you first convert the 
>>>>>>> latin1 String to Unicode and then use the jchar hash_string() 
>>>>>>> version compared to just using the jbyte hash_string() version? 
>>>>>>> Is it because the jbyte version of AltHashing::murmur3_32() is 
>>>>>>> used?
>>>>>> Yes, I believe it is.
>>>>> Okay, thanks for checking.
>>>>>
>>>>>>> Now we don't need the StringTable::hash_string<jbyte> version 
>>>>>>> anymore, right?
>>>>>> This one is used by Symbol* which are jbyte.
>>>>> I only see jchar uses of StringTable::hash_string() (after your 
>>>>> fix). Are you confusing it with java_lang_String::hash_code() 
>>>>> which also has a jbyte and jchar version? This one is indeed used 
>>>>> by the SymbolTable.
>>>>>
>>>>>>> Just noticed that there is an unused "latin1_hash_code" in 
>>>>>>> javaClasses.hpp which can be removed as well.
>>>>>> Thank you, I'll remove it.
>>>>> Thanks!
>>>>>
>>>>> Best regards,
>>>>> Tobias
>>>>>
>>>>>>> Thanks for fixing this!
>>>>>> Thanks for reviewing it!
>>>>>> Coleen
>>>>>>> Best regards,
>>>>>>> Tobias
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Coleen
>>
>