[External] : Re: premain: Possible solutions to use runtime G1 region grain size in AOT code (JDK-8335440)

Thu Aug 15 15:02:34 UTC 2024

 > Ashu is proposing adding support for these latter instructions to C1 so
 > the barrier code does not have to pin the shift count to ECX. Do we
 > care? Or can we live with the effect of pinning?

We only care about correctness currently. Performance (registers 
spilling) is not issue. And I am concern about size and complexity of 
Leyden specific changes which eventually will be pushed into mainline.
Unless you see benefits for new C1 shift instructions in mainline I 
would go with using ECX. And again, do this if code for using ECX is 
simpler than adding new instructions. If it is not - use new instructions.

I was hopping that "Late Barrier Expansion for G1" JEP may help here too 
but it has only C2 changes :(

I agree with updating this patch and may be even pushing it into premain 
(for testing) regardless "Late Barrier Expansion for G1" JEP status.

Thanks,
Vladimir K

On 8/15/24 3:18 AM, Andrew Dinn wrote:
> On 15/08/2024 00:50, Vladimir Kozlov wrote:
>> Do we really need AOTRuntimeConstants instance be a stub in CodeCache?
>> Looking on this code and I think this complicate implementation if use 
>> ExternalAddress relocation anyway. Can it be static instance in 
>> SCCache.cpp which we initialize the same way as in current changes?
> 
> Yes, you are right: we don't need the AOTRuntimeConstants struct to be 
> embedded in the cache if we are using an ExternalRelocation. The initial 
> idea for doing that was because it provides a 32-bit bound on the offset 
> from any struct field to a corresponding field load (lea). That would 
> allow us to use PC-relative loads of the field addresses -- which saves 
> instruction footprint on AArch64, possibly also on x86? However, we can 
> only do that if we also implement a new relocation (which seems 
> unnecessary).
> 
> So, we will redefine the AOTRuntimeConstants struct as a C++ static field.
> 
>> I do not understand why you have concern about having separate 
>> relocation for each constant in AOTRuntimeConstants. We have only 2 
>> now. Even they grow I don't think we will have a lot of them. Actually 
>> the question: will it simplify code if each constant's address is 
>> treated as separate and not offset in AOTRuntimeConstants? Instead of 
>> loading AOTRuntimeConstants address and then adding offset we just 
>> load address of constant.
> 
> Yes, you are right again: if we register the address of every field in 
> the struct in the SCC address table then that will simplify the back end 
> code. I was not sure how much we wanted to rely on address table 
> registration but given that we *are* reliant on it just now then we 
> might as well have the benefit.
> 
> So, we will do the extra registration and limit the back end changes to 
> detect ConP occurrences in the AOTRuntimeConstants range and load them 
> using an ExternalAddress reloc.
> 
>> I don't understand question about C1 (I am not familiar with it). What 
>> it generates for shifts now? Why you need 3-operand shift?
> 
> I'm not sure '3-operand' is the right term to use to describe what is at 
> stake here. It's really about the kind of operand rather than their 
> number. The current region/card shift operations employ an immediate 
> shift count. The AOT region/card shift operations need to be fed the 
> shift count via a register (the target of the constant load).
> 
> C1 does currently generate shifts where the shift count is in a register 
> but it only does so using the flavours of shift (sar/sal/shr/shl) that 
> require the register to be ECX. It doesn't currently generate the 
> alternative instructions (sarx/shlx/shrx) that accept a shift count in 
> any register. I'm not 100% sure why but I think this is because not all 
> CPUs implement them (they require CPUID feature BM2) or it may be to do 
> with them setting/not setting flags.
> 
> Ashu is proposing adding support for these latter instructions to C1 so 
> the barrier code does not have to pin the shift count to ECX. Do we 
> care? Or can we live with the effect of pinning?
> 
>> Code style: I noticed and aarch64.ad still using old stile of 
>> instructions definition using `enc_class`. We don't do this any more 
>> (at least in x86*.ad) when we adding new instructions with one 
>> implementation. Please use `ins_encode %{ %}` in such cases.
> 
> Sure, we will fix that.
> 
>> I think, if you need additional register you should push/pop to 
>> guarantee we don't stomp over some used value.
> Ok, we will play safe and do that.
> 
> One final comment: Much of this discussion is going to be side-stepped 
> by the G1 late barrier implementation that is currently working its way 
> into mainline. That will remove the need to inject barrier IR subgraphs 
> into the C1 and C2 IR graphs and eliminate most (all?) of the associated 
> C2 reduction rules from the ad files. I think we should still update the 
> patch to accommodate the feedback above. We can simplify it further once 
> the G1 changes come in from mainline and pursue in parallel the idea of 
> accommodating changes to oop encodings using this model.
> 
> regards,
> 
> 
> Andrew Dinn
> -----------
>