RFR: 8352075: Perf regression accessing fields [v28]

Tue Jun 10 07:58:38 UTC 2025

On Mon, 9 Jun 2025 21:42:03 GMT, Ioi Lam <iklam at openjdk.org> wrote:

>> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Revert removing FieldInfoReader::next_uint()
>
> src/hotspot/share/utilities/packedTable.cpp line 83:
> 
>> 81:   assert((element & ~((uint64_t) _key_mask | ((uint64_t) _value_mask << _value_shift))) == 0, "read too much");
>> 82:   return element;
>> 83: }
> 
> Since `_element_bytes` can be smaller than 8, memcpy will not work on big endian.
> 
> Why are you trying to optimize this? This PR already cuts down the iterations from O(n) to O(log(n)). You are already doing a lot at each iteration -- decoding the name_index and signature_index from the unsigned5 stream, looking up the symbols from the constant pool, etc.  So a few bit operations in read_element isn't going to make any substantial difference:
> 
> 
> uint64_t element = 0;
> for (int i = 0; i < _elements_bytes; i++) {
>    element <<= 8;
>    element |= _table[offset++]; // Need to rewrite fill() accordingly.
> }

The idea comes from this comment: https://github.com/openjdk/jdk/pull/24847#discussion_r2106110163

> you can load 1..8 bytes in a single (misaligned) memory operation, loading garbage into unused bytes, and then using shift or mask to clear the garbage. That may be faster than asking C++ to do a bunch of branchy logic and byte assembly on every access.

I like that idea, though it appears that the plethora of platforms that JDK supports (but one cannot simply test) makes it difficult to express. Let's rely on the compiler to get the idea from for cycle and do the right thing on each platform, then...

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24847#discussion_r2137178186