RFR: 8372285: G1: Micro-optimize x86 barrier code [v4]
Vladimir Kozlov
kvn at openjdk.org
Fri Nov 21 18:33:00 UTC 2025
On Fri, 21 Nov 2025 16:09:17 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
>> We know from [JDK-8372284](https://bugs.openjdk.org/browse/JDK-8372284) that G1 C2 stubs can take ~10% of total instructions. So minor optimizations in hand-written assembly pay off for code density. This PR does a little x86-specific polishing: `testptr` where possible, short forward branches where possible. I rewired some code to make it abundantly clear the branches in question are short. It also makes clear that lots of the affected methods are essentially fall-through.
>>
>> The patch is deliberately on simpler side, so we can backport it to 25u, if need arises.
>>
>> Additional testing:
>> - [x] Linux x86_64 server fastdebug, `tier1`
>> - [ ] Linux x86_64 server fastdebug, `all`
>
> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision:
>
> - Adjust label name
> - Merge branch 'master' into JDK-8372285-g1-barrier-micro
> - Make some backward branches explicitly short
> - Comment
> - Shorten a few more branches
> - Also reflow generate_pre_barrier_slow_path, so it is obvious the branches are short
> - More touchups
> - Also optimize queue insertion
> - Touchups
> - WIP
Comments.
src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 92:
> 90: void G1BarrierSetAssembler::gen_write_ref_array_post_barrier(MacroAssembler* masm, DecoratorSet decorators,
> 91: Register addr, Register count, Register tmp) {
> 92: Label done;
Since you are touching this code can you add `L_` to labels in this code?
This is our usual practice for labels to clear see them.
src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 193:
> 191: // Is the previous value null?
> 192: __ testptr(pre_val, pre_val);
> 193: __ jccb(Assembler::equal, L_null);
I know that this short jump will be fused to one instruction with testptr on modern x86. But you will have jump-to-jump sequence. So you may win size wise but "throughput" could be worser. Especially if it is "fast" path.
Can you check performance of these changes vs using `jcc(Assembler::equal, L_done);` here.
src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp line 282:
> 280: Register thread = r15_thread;
> 281:
> 282: Label done;
Please use `L_done`.
-------------
PR Review: https://git.openjdk.org/jdk/pull/28446#pullrequestreview-3493923593
PR Review Comment: https://git.openjdk.org/jdk/pull/28446#discussion_r2550629465
PR Review Comment: https://git.openjdk.org/jdk/pull/28446#discussion_r2550627841
PR Review Comment: https://git.openjdk.org/jdk/pull/28446#discussion_r2550630291
More information about the hotspot-gc-dev
mailing list