RFR: 8327156: Avoid copying in StringTable::intern(oop, TRAPS)
Johan Sjölen
jsjolen at openjdk.org
Fri Oct 11 13:56:58 UTC 2024
On Thu, 3 Oct 2024 13:54:02 GMT, Casper Norrbin <cnorrbin at openjdk.org> wrote:
> Hi everyone,
>
> String interning can be done on 4 different types of strings:
> - oop-strings (unicode)
> - oop-strings (latin1)
> - Symbols (non-null-terminated utf8)
> - null-terminated utf8 char arrays
>
> Currently, when doing interning, all 4 types are first converted to unicode and copied to a jchar array. This array is used when looking in the CDS- and interning tables. If an existing string does not exist, this array is converted to a new string object, which is then inserted into the interning table.
>
> This is less efficient than it has to be. As strings are likely to exist in the table(s), it would be beneficial to avoid the initial jchar array allocation. When inserting into the interning table, there is also a possibility to reuse the original string object, avoiding another allocation.
>
> This change makes it possible to search in the tables using the different string types, avoiding that initial allocation. This is done by wrapping the string and tagging it with a type, with helper functions directing to the correct hashing/lookup/equal functions. When inserting into the table, we can now reuse the original object or go directly from the input type to an object. To do this, functionality had to be added to hash utf8-strings and to compare oop-strings with utf8. These convert utf8 into unicode character by character and operates on those, thus avoiding needing extra allocations.
>
> Some quick rudimentary JMH benchmarks show a ~20% increase in throughput when interning the same string repeatedly, and a ~5% increase in throughput interning only unique strings. (Only tested on my local mac aarch debug build)
>
> 2 new tests have also been added. The first test tests that hash codes and string equality remain consistent when converting between different string types. The second test tests that string interning works as expected when equal strings are interned from different string types.
> Also tested and passes tiers 1-3.
Some style comments.
Some more comments. The general outline looks good and I like how we avoid an allocation. We do now take the cost of a virtual call, but I think that's negligible. If we can have some sort of micro benchmark for this, just something simple, that showcases the difference, that would be great.
I'm not sure, but when I look at the code it seems like the `virtual equals` method is unnecessary and can just be removed and we will regain static calls in the `ConcurrentHashTable`.
src/hotspot/share/classfile/javaClasses.cpp line 765:
> 763: }
> 764:
> 765: bool java_lang_String::equals(oop java_string, const char* utf8_string, int len) {
I find it super confusing to use generic `len` here and would very much appreciate it if we called this `nr_unicode_points` or something like that. You could even add a comment. Honestly, I think that maybe we should change all of these functions for additional clarity.
src/hotspot/share/classfile/javaClasses.cpp line 775:
> 773: bool is_latin1 = java_lang_String::is_latin1(java_string);
> 774: jchar c;
> 775: const char *ptr = utf8_string;
No need for the `ptr`, just use `utf8_str` directly.
src/hotspot/share/classfile/javaClasses.hpp line 184:
> 182: }
> 183:
> 184: static unsigned int hash_code(const char *utf8_str, int num_chars) {
`num_unicode_points`
src/hotspot/share/classfile/javaClasses.hpp line 185:
> 183:
> 184: static unsigned int hash_code(const char *utf8_str, int num_chars) {
> 185: const char *ptr = utf8_str;
No need for the `ptr`, just use `utf8_str` directly.
src/hotspot/share/classfile/javaClasses.hpp line 200:
> 198:
> 199: static bool equals(oop java_string, const jchar* chars, int len);
> 200: static bool equals(oop java_string, const char* chars, int len);
`num_unicode_points`
src/hotspot/share/classfile/stringTable.cpp line 143:
> 141: return java_lang_String::hash_code(wrapped_str.utf8_str, len);
> 142: }
> 143: return 0;
Add `ShouldNotReachHere();`
src/hotspot/share/classfile/stringTable.cpp line 156:
> 154: case StringType::utf8_str:
> 155: return java_lang_String::equals(java_string, wrapped_str.utf8_str, len);
> 156: }
Add `ShouldNotReachHere();`
src/hotspot/share/classfile/stringTable.cpp line 212:
> 210: };
> 211:
> 212: class StringTableLookupJchar : public StringTableLookup {
`StringTableLookupUnicode`?
src/hotspot/share/classfile/stringTable.cpp line 236:
> 234: };
> 235:
> 236: class StringTableLookupChar : public StringTableLookup {
Maybe call `StringTableLookupUtf8` instead?
src/hotspot/share/classfile/stringTable.cpp line 438:
> 436: return java_lang_String::create_from_str(wrapped_str.utf8_str, THREAD);
> 437: }
> 438: return Handle();
`ShouldNotReachHere`
src/hotspot/share/classfile/stringTable.hpp line 59:
> 57: static double get_dead_factor(size_t num_dead);
> 58:
> 59: typedef enum { obj_str, unicode_str, symbol_str, utf8_str } StringType;
Style:
```c++
enum class StringType {
OopStr, UnicodeStr, SymbolStr, Utf8Str
};
You can also add a short comment for each case to differentiate them further. Note: `OopStr`, not `Objstr`, I think that makes the most sense.
src/hotspot/share/classfile/stringTable.hpp line 73:
> 71: StringWrapperInternal(const jchar* unicodeStr) : unicode_str(unicodeStr), type(StringType::unicode_str) {}
> 72: StringWrapperInternal(const Symbol* symbolStr) : symbol_str(symbolStr), type(StringType::symbol_str) {}
> 73: StringWrapperInternal(const char* utf8Str) : utf8_str(utf8Str), type(StringType::utf8_str) {}
Style: Use `snake_case` for names which aren't types.
src/hotspot/share/oops/symbol.hpp line 262:
> 260:
> 261: // Returns the non-null-terminated utf8 string stored in the symbol
> 262: const char* get_utf8() const { return (char*)bytes(); }
I'd prefer to see `static_cast<char*>(bytes)`
-------------
PR Review: https://git.openjdk.org/jdk/pull/21325#pullrequestreview-2347681196
PR Review: https://git.openjdk.org/jdk/pull/21325#pullrequestreview-2347998553
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1787677952
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1787688260
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1787684958
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1787687909
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1787688754
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1787690968
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1787691668
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1787701230
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1787700656
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1787707958
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1787512169
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1787501944
PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1787493537
More information about the hotspot-dev
mailing list