RFR: 8338727: RISC-V: Avoid synthetic data dependency in nmethod barrier on Ztso

Sun Aug 25 12:23:07 UTC 2024

On Sat, 24 Aug 2024 14:50:17 GMT, Fei Yang <fyang at openjdk.org> wrote:

>> Hi please consider,
>> 
>> On TSO we don't need the synthetic data dependency in between the loads.
>> Also added some comment about this.
>> 
>> Sanity tested
>
> src/hotspot/cpu/riscv/gc/shared/barrierSetAssembler_riscv.cpp line 281:
> 
>> 279:           // Embed an synthetic data dependency to order the guard load
>> 280:           // before the epoch load. (xor + add is standard way)
>> 281:           // Note: This may be slower than using a membar(load|load) (fence r,r).
> 
> But the RV ISA spec says that this is lightweight ordering mechanism compared with a FENCE R, R.
> Here is what I read from the spec:
> 
> Like other modern memory models, the RVWMO memory model uses syntactic rather than semantic dependencies.
> In other words, this definition depends on the identities of the registers being accessed by different instructions,
> not the actual contents of those registers. This means that an address, control, or data dependency must be enforced
> even if the calculation could seemingly be “optimized away”. This choice ensures that RVWMO remains compatible
> with code that uses these false syntactic dependencies as a lightweight ordering mechanism.
> 
>     ld a1,0(s0)
>     xor a2,a1,a1
>     add s1,s1,a2
>     ld a5,0(s1)
> 
> Figure A.10: A syntactic address dependency
> 
> For example, there is a syntactic address dependency from the memory operation generated by the
> first instruction to the memory operation generated by the last instruction in Figure A.10, even
> though a1 XOR a1 is zero and hence has no effect on the address accessed by the second load.
> The benefit of using dependencies as a lightweight synchronization mechanism is that the ordering
> enforcement requirement is limited only to the specific two instructions in question.
> Other non-dependent instructions may be freely reordered by aggressive implementations.
> One alternative would be to use a load-acquire, but this would enforce ordering for the first load
> with respect to all subsequent instructions. Another would be to use a FENCE R,R, but this would
> include all previous and all subsequent loads, making this option more expensive

Not sure what you mean, but there is no contradiction here.
Manual says:

load guard
<guard =>  epoch data dep>
load epoch
load thread_gurad_epoch //unaffected by data dep.

In RVWMO load thread_gurad_epoch is uneffected yes, and can be loaded eariler, yes.

But we branch on the value of epoch (plus guard), delaying the load of epoch more than neccessary means we delay the branch instruction. As that branch have a control dependency it stop the all following instructions:

Control dependencies behave differently from address and data dependencies in the sense that a
control dependency always extends to all instructions following the original target in program order.

Which means the main goal is get throught the branch as quick as possible.

My comment says delaying the load of epoch in favour of loading thread_gurad_epoch eariler may be slower.
I have not look to deep but it seems like we can also move the load of thread_gurad_epoch before data dep? (i.e. before any such fence r,r)

As the load of guard and load epoch cannot overlap, they happen sequentially (due to data dep).
`fence r,r` only says the load will happen in global memory order, it do not force them to be sequential.

As we are going very close to CPU implmentation here there maybe differences.
So the point of the comment was, maybe revisit this in a few years.

Thanks! I'll wait until we can agree if this is a good comment or if it should be in other wording.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20661#discussion_r1730325122