RFR: 8352075: Perf regression accessing fields [v5]

Fri May 16 23:53:53 UTC 2025

On Wed, 14 May 2025 12:46:41 GMT, Radim Vansa <rvansa at openjdk.org> wrote:

>> This optimization is a followup to https://github.com/openjdk/jdk/pull/24290 trying to reduce the performance regression in some scenarios introduced in https://bugs.openjdk.org/browse/JDK-8292818 . Based both on performance and memory consumption it is a (better) alternative to https://github.com/openjdk/jdk/pull/24713 .
>> 
>> This PR optimizes local field lookup in classes with more than 16 fields; rather than sequentially iterating through all fields during lookup we sort the fields based on the field name. The stream includes extra table after the field information: for field at position 16, 32 ... we record the (variable-length-encoded) offset of the field info in this stream. On field lookup, rather than iterating through all fields, we iterate through this table, resolve names for given fields and continue field-by-field iteration only after the last record (hence at most 16 fields).
>> 
>> In classes with <= 16 fields this PR reduces the memory consumption by 1 byte that was left with value 0 at the end of stream. In classes with > 16 fields we add extra 4 bytes with offset of the table, and the table contains one varint for each 16 fields. The terminal byte is not used either.
>> 
>> My measurements on the attached reproducer
>> 
>> hyperfine -w 50 -r 100 '/path/to/jdk-17/bin/java -cp /tmp CCC'
>> Benchmark 1: /path/to/jdk-17/bin/java -cp /tmp CCC
>>   Time (mean ± σ):      51.3 ms ±   2.8 ms    [User: 44.7 ms, System: 13.7 ms]
>>   Range (min … max):    45.1 ms …  53.9 ms    100 runs
>> 
>> hyperfine -w 50 -r 100 '/path/to/jdk25-master/bin/java -cp /tmp CCC'
>> Benchmark 1: /path/to/jdk25-master/bin/java -cp /tmp CCC
>>   Time (mean ± σ):      78.2 ms ±   1.0 ms    [User: 74.6 ms, System: 17.3 ms]
>>   Range (min … max):    73.8 ms …  79.7 ms    100 runs
>> 
>> (the jdk25-master above already contains JDK-8353175)
>> 
>> hyperfine -w 50 -r 100 '/path/to/jdk25-this-pr/bin/java -cp /tmp CCC'
>> Benchmark 1: /path/to/jdk25-this-pr/jdk/bin/java -cp /tmp CCC
>>   Time (mean ± σ):      38.5 ms ±   0.5 ms    [User: 34.4 ms, System: 17.3 ms]
>>   Range (min … max):    37.7 ms …  42.1 ms    100 runs
>> 
>> While https://github.com/openjdk/jdk/pull/24713 returned the performance to previous levels, this PR improves it by 25% compared to JDK 17 (which does not contain the regression)! This time, the undisclosed production-grade reproducer shows even higher improvement:
>> 
>> JDK 17: 1.6 s
>> JDK 21 (no patches): 22 s
>> JDK25-master: 12.3 s
>> JDK25-this-pr: 0.5 s
>
> Radim Vansa has updated the pull request incrementally with five additional commits since the last revision:
> 
>  - Revert change in array.hpp
>  - Revert changes in VerifyRawIndexesTest
>  - Improve FieldInfo::print
>  - Load constant for SORTED_FIELD_TABLE_THRESHOLD
>  - Replace JumpTable with SortedFieldTable

This change adds smaller "fast-forward" table to accelerate random-access queries within a larger stream.

I am not against this change, but I think it should be refactored (either now or later) into a library that can be used with other streamy data.  I'm thinking of debug-info, dependency lists, line number tables, reloc-info, or maybe PC descs as future clients of such a thing.

It could be a query accelerator (index), for any kind of "streamy data" that can get long, and needs occasional random access.

It is good to switch to binary search for big tables, while preserving the speed of linear search for small ones.  A similar technique can also be seen in `PcDescContainer::find_pc_desc_internal`.  Both binary-search algorithms are complicated and difficult to validate.   If we had a suitable library, we could tune it and test it carefully, and use it to simplify field streams, PC descs, and whatever else we need to work with.

BTW, it's sometimes useful to add a front-side cache for such searches.  This is a bookmark of the last query point.  This helps with queries which are correlated; you don't re-search from scratch on every step.  This is the function of `last_pc_desc` in the PC desc query; it can accelerate the binary search.

The fast-forward table here uses an ad hoc var-int scheme which is delicate and possibly buggy.

One way to do without the specialized table is to inject synchronization markers into the streamy data itself.  Note that the null byte is never used to encode UNSIGNED5 data.  Therefore, if you have a stream of UNSIGNED5 tokens, you can add boundaries encoded null bytes.   In the present application, you could perform binary search on the raw stream pointer, with each probe first re-synchronizing to the first null byte, and then decoding with a short sprint of linear access.  The placement of null bytes is a tuning tactic:  More null bytes add footprint but decrease the length of the short sprints.

I'm not asking for any changes at this point because I have not read the PR carefully enough.  And I don't intend to hold up reviews, but I do want us to put refactoring on our roadmap.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24847#issuecomment-2887857145