RFR: 8327156: Avoid copying in StringTable::intern(oop, TRAPS) [v3]

David Holmes dholmes at openjdk.org
Tue Oct 22 04:57:17 UTC 2024


On Mon, 21 Oct 2024 17:34:49 GMT, Casper Norrbin <cnorrbin at openjdk.org> wrote:

>> Hi everyone,
>> 
>> String interning can be done on 4 different types of strings:
>> - oop-strings (unicode)
>> - oop-strings (latin1)
>> - Symbols (non-null-terminated utf8)
>> - null-terminated utf8 char arrays
>> 
>> Currently, when doing interning, all 4 types are first converted to unicode and copied to a jchar array. This array is used when looking in the CDS- and interning tables. If an existing string does not exist, this array is converted to a new string object, which is then inserted into the interning table.
>> 
>> This is less efficient than it has to be. As strings are likely to exist in the table(s), it would be beneficial to avoid the initial jchar array allocation. When inserting into the interning table, there is also a possibility to reuse the original string object, avoiding another allocation.
>> 
>> This change makes it possible to search in the tables using the different string types, avoiding that initial allocation. This is done by wrapping the string and tagging it with a type, with helper functions directing to the correct hashing/lookup/equal functions. When inserting into the table, we can now reuse the original object or go directly from the input type to an object. To do this, functionality had to be added to hash utf8-strings and to compare oop-strings with utf8. These convert utf8 into unicode character by character and operates on those, thus avoiding needing extra allocations.
>> 
>> Some quick rudimentary JMH benchmarks show a ~20% increase in throughput when interning the same string repeatedly, and a ~5% increase in throughput interning only unique strings. (Only tested on my local mac aarch debug build)
>> 
>> 2 new tests have also been added. The first test tests that hash codes and string equality remain consistent when converting between different string types. The second test tests that string interning works as expected when equal strings are interned from different string types.
>> Also tested and passes tiers 1-3.
>
> Casper Norrbin has updated the pull request incrementally with one additional commit since the last revision:
> 
>   missed cleanup

This is a lot clearer now - thanks. Though I confess I still have some trouble following the `len` parameter through all the code paths.

A few minor suggestions and some queries.

Thanks

src/hotspot/share/classfile/javaClasses.hpp line 186:

> 184:   static unsigned int hash_code(const char *utf8_str, int utf8_len) {
> 185:     unsigned int h = 0;
> 186:     int num_unicode_points = UTF8::unicode_length(utf8_str, utf8_len);

Suggestion:

    int unicode_length = UTF8::unicode_length(utf8_str, utf8_len);

It is clearer to always use the same name for the local assigned from `UTF8::unicode_length`.

src/hotspot/share/classfile/javaClasses.hpp line 188:

> 186:     int num_unicode_points = UTF8::unicode_length(utf8_str, utf8_len);
> 187: 
> 188:     jchar a;

Suggestion:

    jchar c;

Use a consistent pattern for this.

src/hotspot/share/classfile/stringTable.cpp line 144:

> 142:     return java_lang_String::hash_code(wrapped_str.utf8_str, len);
> 143:   }
> 144:   ShouldNotReachHere();

Suggestion:

  case default:
    ShouldNotReachHere();
  }

src/hotspot/share/classfile/stringTable.cpp line 159:

> 157:     return java_lang_String::equals(java_string, wrapped_str.utf8_str, len);
> 158:   }
> 159:   ShouldNotReachHere();

Suggestion:

 case default:
   ShouldNotReachHere();
 }

src/hotspot/share/classfile/stringTable.cpp line 434:

> 432:   }
> 433:   }
> 434:   ShouldNotReachHere();

Suggestion:

  case default:
    ShouldNotReachHere();
    }

src/hotspot/share/classfile/stringTable.cpp line 449:

> 447:     return java_lang_String::create_from_str(wrapped_str.utf8_str, THREAD);
> 448:   }
> 449:   ShouldNotReachHere();

Suggestion:

  case default:
    ShouldNotReachHere();
  }

src/hotspot/share/classfile/stringTable.cpp line 473:

> 471: oop StringTable::intern(const char* utf8_string, TRAPS) {
> 472:   if (utf8_string == nullptr) return nullptr;
> 473:   int length = static_cast<int>(strlen(utf8_string));

Is the source for such a string always a constant pool entry, and so known to < 64K in length?

src/hotspot/share/classfile/stringTable.cpp line 491:

> 489:     // Convert to unicode for alt hashing
> 490:     int length = len;
> 491:     const jchar *chars = to_unicode(name, length, CHECK_NULL);

Suggestion:

    const jchar* chars = to_unicode(name, length, CHECK_NULL);

src/hotspot/share/oops/symbol.hpp line 262:

> 260: 
> 261:   // Returns the non-null-terminated utf8 string stored in the symbol
> 262:   const char* get_utf8() const { return reinterpret_cast<const char*>(bytes()); }

Not sure this really helps as it will cause confusion between `as_utf8` and `get_utf8`. A `Symbol` is a canonicalized string in modified UTF8 form - so `bytes` is that raw representation.

test/hotspot/gtest/classfile/test_stringConversion.cpp line 30:

> 28: // Tests that string functions (hash code/equals) stay consistant when comparing equal strings and converting between strings types
> 29: 
> 30: // Simple ASCII string " Java(R)!! "

Suggestion:

// Simple ASCII string "Java(R)!!"

Please remove leading and trailing space within the quotes in all cases.

test/hotspot/gtest/classfile/test_stringConversion.cpp line 35:

> 33: static const jchar static_ascii_unicode_str[] = {0x004A, 0x0061, 0x0076, 0x0061, 0x0028, 0x0052, 0x0029, 0x0021, 0x0021};
> 34: 
> 35: // Complex string " Jāvá®!☺☻ ", UTF8 has character lengths 13122133 = 16

Not sure it is okay to have those special characters in the source code. ??

test/hotspot/gtest/classfile/test_stringConversion.cpp line 36:

> 34: 
> 35: // Complex string " Jāvá®!☺☻ ", UTF8 has character lengths 13122133 = 16
> 36: static const unsigned char static_utf8_str[] = {0x4A, 0x61, 0xCC, 0x84, 0x76, 0xC3, 0xA1, 0xC2, 0xAE, 0x21, 0xE2, 0x98, 0xBA, 0xE2, 0x98, 0xBB};

0x61 is `a` not a-with-tilde ??

-------------

Changes requested by dholmes (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/21325#pullrequestreview-2383720565
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1809823372
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1809823686
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1809825027
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1809825653
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1809900301
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1809902426
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1809912885
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1809913933
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1809820730
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1809931686
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1809927400
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1809938717


More information about the hotspot-dev mailing list