RFR: 8327156: Avoid copying in StringTable::intern(oop, TRAPS)

Wed Oct 16 00:17:11 UTC 2024

On Tue, 15 Oct 2024 21:12:11 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> Hi everyone,
>> 
>> String interning can be done on 4 different types of strings:
>> - oop-strings (unicode)
>> - oop-strings (latin1)
>> - Symbols (non-null-terminated utf8)
>> - null-terminated utf8 char arrays
>> 
>> Currently, when doing interning, all 4 types are first converted to unicode and copied to a jchar array. This array is used when looking in the CDS- and interning tables. If an existing string does not exist, this array is converted to a new string object, which is then inserted into the interning table.
>> 
>> This is less efficient than it has to be. As strings are likely to exist in the table(s), it would be beneficial to avoid the initial jchar array allocation. When inserting into the interning table, there is also a possibility to reuse the original string object, avoiding another allocation.
>> 
>> This change makes it possible to search in the tables using the different string types, avoiding that initial allocation. This is done by wrapping the string and tagging it with a type, with helper functions directing to the correct hashing/lookup/equal functions. When inserting into the table, we can now reuse the original object or go directly from the input type to an object. To do this, functionality had to be added to hash utf8-strings and to compare oop-strings with utf8. These convert utf8 into unicode character by character and operates on those, thus avoiding needing extra allocations.
>> 
>> Some quick rudimentary JMH benchmarks show a ~20% increase in throughput when interning the same string repeatedly, and a ~5% increase in throughput interning only unique strings. (Only tested on my local mac aarch debug build)
>> 
>> 2 new tests have also been added. The first test tests that hash codes and string equality remain consistent when converting between different string types. The second test tests that string interning works as expected when equal strings are interned from different string types.
>> Also tested and passes tiers 1-3.
>
> src/hotspot/share/classfile/stringTable.hpp line 88:
> 
>> 86:   static const jchar *to_unicode(StringWrapper wrapped_str, int len, TRAPS);
>> 87:   static Handle to_handle(StringWrapper wrapped_str, int len, TRAPS);
>> 88:   static void print_string(StringWrapper wrapped_str, int len, TRAPS);
> 
> What is `len` supposed to represent in all of these methods? The code only makes sense to me if `len` here is actually "number of unicode characters" (which need not be the same as the length of any wrapped UTF8 sequence).

Actually that in itself is not enough to make the code correct AFAICS. Consider this example from Table 2-11, page 71, in https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf The Unicode character 00010384 in UTF-32, consists of the surrogate pair D800, DF84 in UTF-16, and the four byte sequence F0, 90, 8E, 84 in UTF-8. So if we have a java.lang.String that represents this single unicode character, the String will consist of an array of 2 char values, and the UTF8 representation would consist of 4 byte values. So if you were doing an equals comparison between the String's value array and the utf_str, what length would you pass to the equals method?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/21325#discussion_r1802150535