RFR: 8352075: Perf regression accessing fields [v8]

Tue May 27 12:14:56 UTC 2025

On Sun, 25 May 2025 07:18:59 GMT, John R Rose <jrose at openjdk.org> wrote:

>> src/hotspot/share/oops/fieldInfo.cpp line 252:
>> 
>>> 250: int FieldInfoReader::search_table_lookup(const Array<u1> *search_table, const Symbol *name, const Symbol *signature, ConstantPool *cp, int java_fields) {
>>> 251:   UNSIGNED5::Reader<const u1*, int> r2(_r.array());
>>> 252:   int low = 0, high = java_fields - 1;
>> 
>> This is probably correct, but I recommend a few changes:
>> A. add an assert `low <= high` at the beginning (defend against surprise `java_fields` number).
>> B. repeat the assert after each new calculation of high or low.
>> C. declare high and low and mid as `uint32_t` to prove overflow (UB) impossible without the need to reason about integral range
>
> I looked over the code base for other occurrences of binary search, and I see your code is not an outlier.
> 
> The tendency is to use int for high/mid/low, and there are cases where UB would be possible in the absence of extra knowledge about the possible ranges of the ints.  `InstanceKlass::quick_search` is especially scary, until one realizes that method lists are probably on the order of 2^16.  `vmSymbols::find_sid` is the same story; the number of symbols is much less than 2^30..
> 
> `GrowableArray::find_sorted` also uses int indexes, but uses a `uint` cast at just the right place when computing mid, to avoid UB, so can handle arrays sized near 2^31.
> 
> With those precedents, I'm not going to pick on this code here.  The recommended changes A/B/C are optional.
> 
> But I am going to think about what it would take to reduce the number of occurences of binary switch, and build a template that can re-implement at least some of them (probably including yours) with bullet-proof, unit-tested code.
> 
> Binary search over 3/4/5-byte packed records is easy, which is what you are doing here.  The use of shortened integers is clever.  I think that can be made a template also, and then used for more than one index, as I've observed before.
> 
> One good way to think about these variable-sized packed records is to treat them as physical `uint64_t` values, with high order zero bytes removed.  If you have a word or so of slop area at the bottom of the packed array, you can load 1..8 bytes in a single (misaligned) memory operation, loading garbage into unused bytes, and then using shift or mask to clear the garbage.  That may be faster than asking C++ to do a bunch of branchy logic and byte assembly on every access.  Loading an N-byte word is simply `w=*(uint64_t*)(p+N-8)>>(N*8)`.  Your packed structs have two fields x,y, but they can be reimagined as bitfields `x=w&((1<<xlen)-1)`, `y=w>>xlen`.  This would save half a byte per entry in your tables, on average, since you can often share a byte between the two fields.  And the decode logic would be simpler than the closure-based stuff you have now.
> 
> These thoughts are just another observation about the design space you are working in, not a request for changes, although I do think they would simplify your code (and make it more efficient).  This sort of careful design makes the most sense when packaged in a separate class or template,  that can be carefully reviewed and unit-tested in isolation.  If we were to refactor some old searchable...

I like the idea of mapping each element in the table as raw bits, though handling of access to the end of the array would be a bit inconvenient (or we would have to allocate a few extra bytes).

I've changed the algorithm to use unsigned integers; in fact I find a bit annoying that most of the indices used throughout the related code are signed.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24847#discussion_r2109021670