RFR: 8352075: Perf regression accessing fields [v28]

Johan Sjölen jsjolen at openjdk.org
Tue Jun 10 10:17:40 UTC 2025


On Mon, 9 Jun 2025 15:34:44 GMT, Radim Vansa <rvansa at openjdk.org> wrote:

>> This optimization is a followup to https://github.com/openjdk/jdk/pull/24290 trying to reduce the performance regression in some scenarios introduced in https://bugs.openjdk.org/browse/JDK-8292818 . Based both on performance and memory consumption it is a (better) alternative to https://github.com/openjdk/jdk/pull/24713 .
>> 
>> This PR optimizes local field lookup in classes with more than 16 fields; rather than sequentially iterating through all fields during lookup we sort the fields based on the field name. The stream includes extra table after the field information: for field at position 16, 32 ... we record the (variable-length-encoded) offset of the field info in this stream. On field lookup, rather than iterating through all fields, we iterate through this table, resolve names for given fields and continue field-by-field iteration only after the last record (hence at most 16 fields).
>> 
>> In classes with <= 16 fields this PR reduces the memory consumption by 1 byte that was left with value 0 at the end of stream. In classes with > 16 fields we add extra 4 bytes with offset of the table, and the table contains one varint for each 16 fields. The terminal byte is not used either.
>> 
>> My measurements on the attached reproducer
>> 
>> hyperfine -w 50 -r 100 '/path/to/jdk-17/bin/java -cp /tmp CCC'
>> Benchmark 1: /path/to/jdk-17/bin/java -cp /tmp CCC
>>   Time (mean ± σ):      51.3 ms ±   2.8 ms    [User: 44.7 ms, System: 13.7 ms]
>>   Range (min … max):    45.1 ms …  53.9 ms    100 runs
>> 
>> hyperfine -w 50 -r 100 '/path/to/jdk25-master/bin/java -cp /tmp CCC'
>> Benchmark 1: /path/to/jdk25-master/bin/java -cp /tmp CCC
>>   Time (mean ± σ):      78.2 ms ±   1.0 ms    [User: 74.6 ms, System: 17.3 ms]
>>   Range (min … max):    73.8 ms …  79.7 ms    100 runs
>> 
>> (the jdk25-master above already contains JDK-8353175)
>> 
>> hyperfine -w 50 -r 100 '/path/to/jdk25-this-pr/bin/java -cp /tmp CCC'
>> Benchmark 1: /path/to/jdk25-this-pr/jdk/bin/java -cp /tmp CCC
>>   Time (mean ± σ):      38.5 ms ±   0.5 ms    [User: 34.4 ms, System: 17.3 ms]
>>   Range (min … max):    37.7 ms …  42.1 ms    100 runs
>> 
>> While https://github.com/openjdk/jdk/pull/24713 returned the performance to previous levels, this PR improves it by 25% compared to JDK 17 (which does not contain the regression)! This time, the undisclosed production-grade reproducer shows even higher improvement:
>> 
>> JDK 17: 1.6 s
>> JDK 21 (no patches): 22 s
>> JDK25-master: 12.3 s
>> JDK25-this-pr: 0.5 s
>
> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Revert removing FieldInfoReader::next_uint()

src/hotspot/share/utilities/packedTable.hpp line 56:

> 54:     // Packed table does NOT support duplicate keys.
> 55:     virtual bool next(uint32_t* key, uint32_t* value) = 0;
> 56:   };

Does it make sense to take the cost of an indirect call for each kv pair? You can't inline it, so the stack frame needs to be popped and pushed, and you're taking 2 registers (16 bytes) to give 8 bytes and 1 bit of information.

 We can amortize the cost by implementing this signature instead:


virtual uint32_t next(Pair<uint32_t, uint32_t>* kvs, uint32_t kvs_size);

src/hotspot/share/utilities/packedTable.hpp line 69:

> 67:   // by the supplier (when Supplier::next() returns false the whole array should
> 68:   // be filled).
> 69:   void fill(u1* table, size_t table_length, Supplier &supplier) const;

Let the ampersand hug the type.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24847#discussion_r2137289775
PR Review Comment: https://git.openjdk.org/jdk/pull/24847#discussion_r2137269715


More information about the hotspot-dev mailing list