<div dir="ltr">Hi Vladimir,<div>Here is the updated patch. We have refactored the code as per your suggestions.</div><div><a href="https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-8335440-load-via-codecache?expand=1">https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-8335440-load-via-codecache?expand=1</a><br></div><div><br></div><div>Let us know if you have any more suggestions.</div><div><br></div><div>Thanks,</div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">- Ashutosh Mehra</div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Aug 15, 2024 at 11:02 AM Vladimir Kozlov <<a href="mailto:vladimir.kozlov@oracle.com" target="_blank">vladimir.kozlov@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> > Ashu is proposing adding support for these latter instructions to C1 so<br>
> the barrier code does not have to pin the shift count to ECX. Do we<br>
> care? Or can we live with the effect of pinning?<br>
<br>
We only care about correctness currently. Performance (registers <br>
spilling) is not issue. And I am concern about size and complexity of <br>
Leyden specific changes which eventually will be pushed into mainline.<br>
Unless you see benefits for new C1 shift instructions in mainline I <br>
would go with using ECX. And again, do this if code for using ECX is <br>
simpler than adding new instructions. If it is not - use new instructions.<br>
<br>
I was hopping that "Late Barrier Expansion for G1" JEP may help here too <br>
but it has only C2 changes :(<br>
<br>
I agree with updating this patch and may be even pushing it into premain <br>
(for testing) regardless "Late Barrier Expansion for G1" JEP status.<br>
<br>
Thanks,<br>
Vladimir K<br>
<br>
<br>
On 8/15/24 3:18 AM, Andrew Dinn wrote:<br>
> On 15/08/2024 00:50, Vladimir Kozlov wrote:<br>
>> Do we really need AOTRuntimeConstants instance be a stub in CodeCache?<br>
>> Looking on this code and I think this complicate implementation if use <br>
>> ExternalAddress relocation anyway. Can it be static instance in <br>
>> SCCache.cpp which we initialize the same way as in current changes?<br>
> <br>
> Yes, you are right: we don't need the AOTRuntimeConstants struct to be <br>
> embedded in the cache if we are using an ExternalRelocation. The initial <br>
> idea for doing that was because it provides a 32-bit bound on the offset <br>
> from any struct field to a corresponding field load (lea). That would <br>
> allow us to use PC-relative loads of the field addresses -- which saves <br>
> instruction footprint on AArch64, possibly also on x86? However, we can <br>
> only do that if we also implement a new relocation (which seems <br>
> unnecessary).<br>
> <br>
> So, we will redefine the AOTRuntimeConstants struct as a C++ static field.<br>
> <br>
>> I do not understand why you have concern about having separate <br>
>> relocation for each constant in AOTRuntimeConstants. We have only 2 <br>
>> now. Even they grow I don't think we will have a lot of them. Actually <br>
>> the question: will it simplify code if each constant's address is <br>
>> treated as separate and not offset in AOTRuntimeConstants? Instead of <br>
>> loading AOTRuntimeConstants address and then adding offset we just <br>
>> load address of constant.<br>
> <br>
> Yes, you are right again: if we register the address of every field in <br>
> the struct in the SCC address table then that will simplify the back end <br>
> code. I was not sure how much we wanted to rely on address table <br>
> registration but given that we *are* reliant on it just now then we <br>
> might as well have the benefit.<br>
> <br>
> So, we will do the extra registration and limit the back end changes to <br>
> detect ConP occurrences in the AOTRuntimeConstants range and load them <br>
> using an ExternalAddress reloc.<br>
> <br>
>> I don't understand question about C1 (I am not familiar with it). What <br>
>> it generates for shifts now? Why you need 3-operand shift?<br>
> <br>
> I'm not sure '3-operand' is the right term to use to describe what is at <br>
> stake here. It's really about the kind of operand rather than their <br>
> number. The current region/card shift operations employ an immediate <br>
> shift count. The AOT region/card shift operations need to be fed the <br>
> shift count via a register (the target of the constant load).<br>
> <br>
> C1 does currently generate shifts where the shift count is in a register <br>
> but it only does so using the flavours of shift (sar/sal/shr/shl) that <br>
> require the register to be ECX. It doesn't currently generate the <br>
> alternative instructions (sarx/shlx/shrx) that accept a shift count in <br>
> any register. I'm not 100% sure why but I think this is because not all <br>
> CPUs implement them (they require CPUID feature BM2) or it may be to do <br>
> with them setting/not setting flags.<br>
> <br>
> Ashu is proposing adding support for these latter instructions to C1 so <br>
> the barrier code does not have to pin the shift count to ECX. Do we <br>
> care? Or can we live with the effect of pinning?<br>
> <br>
>> Code style: I noticed and <a href="http://aarch64.ad" rel="noreferrer" target="_blank">aarch64.ad</a> still using old stile of <br>
>> instructions definition using `enc_class`. We don't do this any more <br>
>> (at least in x86*.ad) when we adding new instructions with one <br>
>> implementation. Please use `ins_encode %{ %}` in such cases.<br>
> <br>
> Sure, we will fix that.<br>
> <br>
>> I think, if you need additional register you should push/pop to <br>
>> guarantee we don't stomp over some used value.<br>
> Ok, we will play safe and do that.<br>
> <br>
> One final comment: Much of this discussion is going to be side-stepped <br>
> by the G1 late barrier implementation that is currently working its way <br>
> into mainline. That will remove the need to inject barrier IR subgraphs <br>
> into the C1 and C2 IR graphs and eliminate most (all?) of the associated <br>
> C2 reduction rules from the ad files. I think we should still update the <br>
> patch to accommodate the feedback above. We can simplify it further once <br>
> the G1 changes come in from mainline and pursue in parallel the idea of <br>
> accommodating changes to oop encodings using this model.<br>
> <br>
> regards,<br>
> <br>
> <br>
> Andrew Dinn<br>
> -----------<br>
> <br>
<br>
</blockquote></div>