RFR: 8352075: Perf regression accessing fields [v5]

Wed May 21 14:33:56 UTC 2025

On Tue, 20 May 2025 23:01:08 GMT, John R Rose <jrose at openjdk.org> wrote:

>> This change adds smaller "fast-forward" table to accelerate random-access queries within a larger stream.
>> 
>> I am not against this change, but I think it should be refactored (either now or later) into a library that can be used with other streamy data.  I'm thinking of debug-info, dependency lists, line number tables, reloc-info, or maybe PC descs as future clients of such a thing.
>> 
>> It could be a query accelerator (index), for any kind of "streamy data" that can get long, and needs occasional random access.
>> 
>> It is good to switch to binary search for big tables, while preserving the speed of linear search for small ones.  A similar technique can also be seen in `PcDescContainer::find_pc_desc_internal`.  Both binary-search algorithms are complicated and difficult to validate.   If we had a suitable library, we could tune it and test it carefully, and use it to simplify field streams, PC descs, and whatever else we need to work with.
>> 
>> BTW, it's sometimes useful to add a front-side cache for such searches.  This is a bookmark of the last query point.  This helps with queries which are correlated; you don't re-search from scratch on every step.  This is the function of `last_pc_desc` in the PC desc query; it can accelerate the binary search.
>> 
>> The fast-forward table here uses an ad hoc var-int scheme which is delicate and possibly buggy.
>> 
>> One way to do without the specialized table is to inject synchronization markers into the streamy data itself.  Note that the null byte is never used to encode UNSIGNED5 data.  Therefore, if you have a stream of UNSIGNED5 tokens, you can add boundaries encoded null bytes.   In the present application, you could perform binary search on the raw stream pointer, with each probe first re-synchronizing to the first null byte, and then decoding with a short sprint of linear access.  The placement of null bytes is a tuning tactic:  More null bytes add footprint but decrease the length of the short sprints.
>> 
>> I'm not asking for any changes at this point because I have not read the PR carefully enough.  And I don't intend to hold up reviews, but I do want us to put refactoring on our roadmap.
>
>> @rose00 While refactoring might be useful, this PR is trying to address a performance regression (in fact by improving the scenario significantly), and I'd prefer that to land in 25
> 
> It will land when it's ready.  It's not quite ready yet.  I'm not asking for reusability, but I am saying that doing a one-off index hack here is exchanging one kind of technical debt (maintenance complexity) for another (performance pothole).  If we hustle to make 25, we need a plan to fix it after 25.
> 
>> rather than to delay just for the sake of better 'reusability'.
> 
> Yes, 'reusability'.  That's how big systems are maintained.
> 
>> And TBH I don't consider binary search algorithm too difficult to validate.
> 
> TBH I do.  I've seen enough OBOEs for one lifetime.  I don't consider even that simple an algorithm to be fully tested unless it is either unit tested (gtest) or stress tested (threshold set small), and preferably both.  Testing it for just a few classes that have many fields is not full coverage.  (Normally for this kind of thing, your hard-coded constant 16 would be made a debug-only globals, and we'd stress-test with it set to 1 or 2.)   Writing a unit test requires enough factorization to write the unit test.  Which also tends towards virtuous reusability.
> 
>> While a simple cache might be useful, I don't have a good example to validate its usefulness. And turning a read-only scenario into mutating scenario can turn out badly for multithreaded code, as in https://bugs.openjdk.org/browse/JDK-8180450 .
> 
> No worries about this one; I'm not asking for such a thing, just noting in passing that it sometimes helps.  Probably not in this case, since you didn't resort the original array, and the usual sequential encounter order is direct, not requiring a cache.  If we were re-engineering PC desc stuff on top of CompressedStream, we'd want such a cache.  Maybe some day.  Some one-element caches have the problem you point out, but if you look at the one-element cache for PC descs you'll see one with no such reported problems.  (IIRC I wrote both.)
> 
>> While you suggest 0 byte as synchronization markers, I've tried to synchronize the stream in #24713 and experiments have shown that while it improves things, it was not sufficient in my case. That's why I've abandoned that approach.
> 
> I looked at that code; thanks.  That wasn't what I meant.  The point of a null resynch byte is to allow *approximate* entry into the main stream.  It is an optional technique that would work like this:
> 
>  ...

@rose00 I've attempted to extract the binary search in https://github.com/openjdk/jdk/commit/62a40f6b8aafb4a986add0bdac0425dfaf197d62 , but I am not convinced this is better.
The `SearchTable::lookup` is static now - I am not sure if I could have `class SearchTable: private Array<u1> { ... }` and have that allocated in metaspace...

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24847#issuecomment-2898186528