RFR (S) 8144940: Broken hash in string table entry in closed/runtime/7158800/BadUtf8.java

Wed Mar 23 07:15:55 UTC 2016

Hi Ioi,

On 23.03.2016 03:05, Ioi Lam wrote:
> I've done a little investigation on Latin1 encoding of java.lang.String and how that relates to this bug. Here's what I found:
> 
>  * Latin1 is is a 8-bit encoding of characters.
>  * The first 256 Unicode characters are exactly the same as the Latin1
>    encoding.
>  * So if all the jchars in a java.lang.String are <= 0xff, it can be
>    stored in (logically) an unsigned byte array with the upper 16 bits
>    truncated.
>      o I am not sure if it's REQUIRED for such strings to be stored in
>        a byte array. It might be possible to create an equivalent
>        String with jchar[] storage. You can certainly do that with
>        Unsafe + reflection.

With Compact Strings enabled, it is required that all Latin1 encoded Strings are "compressed", i.e., all jchars are truncated to jbytes and String.coder == LATIN1. For example, String.equals() relies on this by returning false if the coders are different (without looking at the actual content of the Strings). Of course, you can break this with reflection.

> This function, which you removed, should return the exact same hashcode regardless whether the String was stored as a byte[] or jchar[]
> 
>    unsigned int java_lang_String::hash_string(oop java_string) {
>       int          length = java_lang_String::length(java_string);
>       // Zero length string doesn't necessarily hash to zero.
>       if (length == 0) {
>         return StringTable::hash_string((jchar*) NULL, 0);
>       }
> 
>       typeArrayOop value  = java_lang_String::value(java_string);
>       bool      is_latin1 = java_lang_String::is_latin1(java_string);
>       if (is_latin1) {
>         return StringTable::hash_string(value->byte_at_addr(0), length);
>       } else {
>         return StringTable::hash_string(value->char_at_addr(0), length);
>       }
>    }
> 
> That's because these two functions should produce the exact same result (if all the unsigned arithmetics are correct ...)
> 
>       static unsigned int hash_code(const jchar* s, int len) {
>         unsigned int h = 0;
>         while (len-- > 0) {
>           h = 31*h + (unsigned int) *s;
>           s++;
>         }
>         return h;
>       }
> 
>       static unsigned int hash_code(const jbyte* s, int len) {
>         unsigned int h = 0;
>         while (len-- > 0) {
>           h = 31*h + (((unsigned int) *s) & 0xFF);
>           s++;
>         }
>         return h;
>       }
> 
> For shared interned strings, we actually use the <jbyte> version during dump time (writing into shared strings table), and the <jchar> version at run time (look up from shared string table). I wrote a test and validated that the the two hashcodes are identical.
> 
> So, I believe your fix for 8144940 works not because you force the body to be converted to jchar. Rather, it's because java_lang_String::hash_string does not consider StringTable::use_alternate_hashcode().
> 
> Also, I am glad that you removed the <jbyte> version of this template:
> 
>    template<typename T>
>    unsigned int StringTable::hash_string(const T* s, int len) {
>       return use_alternate_hashcode() ? AltHashing::murmur3_32(seed(),
>    s, len) :
>                                         java_lang_String::hash_code(s,
>    len);
>    }
> 
>    // Explicit instantiation for all supported types.
>    template unsigned int StringTable::hash_string<jchar>(const jchar*
>    s, int len);
>    template unsigned int StringTable::hash_string<jbyte>(const jbyte*
>    s, int len);
> 
> Otherwise, someone is likely to call it with a UTF8 string and get unexpected results. I am not even sure if it would give identical results as the <jchar> version if the input was encoded in Latin1.

Yes, that's what I meant by "Is it because the jbyte version of AltHashing::murmur3_32() is used?". Thanks for clarifying.

Best regards,
Tobias

> 
> Thanks
> - Ioi
> 
> On 3/22/16 10:39 AM, Coleen Phillimore wrote:
>>
>> Thank you, Jiangli.
>> Coleen
>>
>> On 3/22/16 1:35 PM, Jiangli Zhou wrote:
>>> Hi Coleen,
>>>
>>> Looks good to me. I had same question as Tobias yesterday. Your answer cleared it.
>>>
>>> Thanks,
>>> Jiangli
>>>
>>>> On Mar 22, 2016, at 10:07 AM, Coleen Phillimore <coleen.phillimore at oracle.com> wrote:
>>>>
>>>>
>>>> Here's another webrev with the changes pointed out by Tobias and verified with -XX:+VerifyStringTableAtExit.
>>>>
>>>> open webrev at http://cr.openjdk.java.net/~coleenp/8144940.02/webrev
>>>>
>>>> Thanks!
>>>> Coleen
>>>>
>>>> On 3/22/16 12:21 PM, Tobias Hartmann wrote:
>>>>> Hi Coleen,
>>>>>
>>>>> On 22.03.2016 13:40, Coleen Phillimore wrote:
>>>>>> On 3/22/16 4:04 AM, Tobias Hartmann wrote:
>>>>>>> Hi Coleen,
>>>>>>>
>>>>>>> On 21.03.2016 22:11, Coleen Phillimore wrote:
>>>>>>>> Summary: Fix code broken with compact Strings.
>>>>>>>>
>>>>>>>> One of the failure modes of an intermittent bug (but this failure is not intermittent).
>>>>>>>>
>>>>>>>> Tested with the failing test cases that exercise this code. Also, testing in order to find linked bugs.
>>>>>>>>
>>>>>>>> open webrev at http://cr.openjdk.java.net/~coleenp/8144940.01/webrev
>>>>>>>> bug link https://bugs.openjdk.java.net/browse/JDK-8144940
>>>>>>> I wonder why the result is different if you first convert the latin1 String to Unicode and then use the jchar hash_string() version compared to just using the jbyte hash_string() version? Is it because the jbyte version of AltHashing::murmur3_32() is used?
>>>>>> Yes, I believe it is.
>>>>> Okay, thanks for checking.
>>>>>
>>>>>>> Now we don't need the StringTable::hash_string<jbyte> version anymore, right?
>>>>>> This one is used by Symbol* which are jbyte.
>>>>> I only see jchar uses of StringTable::hash_string() (after your fix). Are you confusing it with java_lang_String::hash_code() which also has a jbyte and jchar version? This one is indeed used by the SymbolTable.
>>>>>
>>>>>>> Just noticed that there is an unused "latin1_hash_code" in javaClasses.hpp which can be removed as well.
>>>>>> Thank you, I'll remove it.
>>>>> Thanks!
>>>>>
>>>>> Best regards,
>>>>> Tobias
>>>>>
>>>>>>> Thanks for fixing this!
>>>>>> Thanks for reviewing it!
>>>>>> Coleen
>>>>>>> Best regards,
>>>>>>> Tobias
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Coleen
>>
>