RFR: 8352075: Perf regression accessing fields [v15]

Fri May 30 21:17:56 UTC 2025

On Fri, 30 May 2025 14:34:45 GMT, Radim Vansa <rvansa at openjdk.org> wrote:

>> This optimization is a followup to https://github.com/openjdk/jdk/pull/24290 trying to reduce the performance regression in some scenarios introduced in https://bugs.openjdk.org/browse/JDK-8292818 . Based both on performance and memory consumption it is a (better) alternative to https://github.com/openjdk/jdk/pull/24713 .
>> 
>> This PR optimizes local field lookup in classes with more than 16 fields; rather than sequentially iterating through all fields during lookup we sort the fields based on the field name. The stream includes extra table after the field information: for field at position 16, 32 ... we record the (variable-length-encoded) offset of the field info in this stream. On field lookup, rather than iterating through all fields, we iterate through this table, resolve names for given fields and continue field-by-field iteration only after the last record (hence at most 16 fields).
>> 
>> In classes with <= 16 fields this PR reduces the memory consumption by 1 byte that was left with value 0 at the end of stream. In classes with > 16 fields we add extra 4 bytes with offset of the table, and the table contains one varint for each 16 fields. The terminal byte is not used either.
>> 
>> My measurements on the attached reproducer
>> 
>> hyperfine -w 50 -r 100 '/path/to/jdk-17/bin/java -cp /tmp CCC'
>> Benchmark 1: /path/to/jdk-17/bin/java -cp /tmp CCC
>>   Time (mean ± σ):      51.3 ms ±   2.8 ms    [User: 44.7 ms, System: 13.7 ms]
>>   Range (min … max):    45.1 ms …  53.9 ms    100 runs
>> 
>> hyperfine -w 50 -r 100 '/path/to/jdk25-master/bin/java -cp /tmp CCC'
>> Benchmark 1: /path/to/jdk25-master/bin/java -cp /tmp CCC
>>   Time (mean ± σ):      78.2 ms ±   1.0 ms    [User: 74.6 ms, System: 17.3 ms]
>>   Range (min … max):    73.8 ms …  79.7 ms    100 runs
>> 
>> (the jdk25-master above already contains JDK-8353175)
>> 
>> hyperfine -w 50 -r 100 '/path/to/jdk25-this-pr/bin/java -cp /tmp CCC'
>> Benchmark 1: /path/to/jdk25-this-pr/jdk/bin/java -cp /tmp CCC
>>   Time (mean ± σ):      38.5 ms ±   0.5 ms    [User: 34.4 ms, System: 17.3 ms]
>>   Range (min … max):    37.7 ms …  42.1 ms    100 runs
>> 
>> While https://github.com/openjdk/jdk/pull/24713 returned the performance to previous levels, this PR improves it by 25% compared to JDK 17 (which does not contain the regression)! This time, the undisclosed production-grade reproducer shows even higher improvement:
>> 
>> JDK 17: 1.6 s
>> JDK 21 (no patches): 22 s
>> JDK25-master: 12.3 s
>> JDK25-this-pr: 0.5 s
>
> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision:
> 
>   More debug logs

> I like the idea of mapping each element in the table as raw bits, though handling of access to the end of the array would be a bit inconvenient (or we would have to allocate a few extra bytes).

The code snippet I shared above shows a better way:  You load a full 8 (or 4) bytes where the END (not the START) of the word lines up with the LAST (not FIRST) byte.  Then you will never run past the end of the array!  So, fine, but what about the start of the array?  Well, it's inside an `Array<u1>` object, which has a length header, which is guaranteed to be safe to load (under a cast or bytewise or whatever).  Problem solved.  The only thing to avoid is to load an 8-byte word when the packed word size is 1..5 bytes; then you load a 4-byte word.  You can load both components at once, and then use a configurable shift (from one machine word) to separate them.  This is why I say it saves a half-byte on average.

These tweaky ideas have three effects:  They probably make the code a little simpler (or at least no worse), they reduce the number of memory operations to query a packed array, and they probably use fewer ALU instructions overall.  They are certainly worth considering for the general-purpose "searchable packed array" I am envisioning; they are optional for this particular bug, viewed in isolation.

> I've changed the algorithm to use unsigned integers; in fact I find a bit annoying that most of the indices used throughout the related code are signed.

Yes, it annoys me also.  It's playing with fire (or walking the firepit).

> I've also added a test generating class with a different number of fields, though running it through the full range of fields (0-65535, though in practice the upper bound is rather 26k) would be excessive; even now it takes more than a minute on my machine. Also, I realize that varying the number of fields does not result in full coverage of possible stream sizes; per-field records have probably rather uniform lengths.

Yeah, a gtest on the binary search would cover most of those issues, faster and cleaner.  Then loading many gigantic classfiles will be unnecessary.  Just a few classfiles at several scales, probably, and thorough gtest-level unit testing, gets a better result in less time.  As I said above, I'm willing to put off some of the refactoring, given that it should cover other, prior occurrences of binary search (so it's got a larger scope than this bug).

> @rose00 OK, so I have refactored out the PackedTable that now hosts the logic for packing two uint32_t values with arbitrary number of bits into a record, and binary search using custom comparator. Haven't added any gtests to test the functionality independently, though.

Thank you.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24847#issuecomment-2923513912