<div dir="ltr">Hi Vladimir,<div>Here is the updated patch. We have refactored the code as per your suggestions.</div><div><a href="https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-8335440-load-via-codecache?expand=1">https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-8335440-load-via-codecache?expand=1</a><br></div><div><br></div><div>Let us know if you have any more suggestions.</div><div><br></div><div>Thanks,</div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">- Ashutosh Mehra</div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Aug 15, 2024 at 11:02 AM Vladimir Kozlov <<a href="mailto:vladimir.kozlov@oracle.com" target="_blank">vladimir.kozlov@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> > Ashu is proposing adding support for these latter instructions to C1 so<br>

 > the barrier code does not have to pin the shift count to ECX. Do we<br>

 > care? Or can we live with the effect of pinning?<br>

<br>

We only care about correctness currently. Performance (registers <br>

spilling) is not issue. And I am concern about size and complexity of <br>

Leyden specific changes which eventually will be pushed into mainline.<br>

Unless you see benefits for new C1 shift instructions in mainline I <br>

would go with using ECX. And again, do this if code for using ECX is <br>

simpler than adding new instructions. If it is not - use new instructions.<br>

<br>

I was hopping that "Late Barrier Expansion for G1" JEP may help here too <br>

but it has only C2 changes :(<br>

<br>

I agree with updating this patch and may be even pushing it into premain <br>

(for testing) regardless "Late Barrier Expansion for G1" JEP status.<br>

<br>

Thanks,<br>

Vladimir K<br>

<br>

<br>

On 8/15/24 3:18 AM, Andrew Dinn wrote:<br>

> On 15/08/2024 00:50, Vladimir Kozlov wrote:<br>

>> Do we really need AOTRuntimeConstants instance be a stub in CodeCache?<br>

>> Looking on this code and I think this complicate implementation if use <br>

>> ExternalAddress relocation anyway. Can it be static instance in <br>

>> SCCache.cpp which we initialize the same way as in current changes?<br>

> <br>

> Yes, you are right: we don't need the AOTRuntimeConstants struct to be <br>

> embedded in the cache if we are using an ExternalRelocation. The initial <br>

> idea for doing that was because it provides a 32-bit bound on the offset <br>

> from any struct field to a corresponding field load (lea). That would <br>

> allow us to use PC-relative loads of the field addresses -- which saves <br>

> instruction footprint on AArch64, possibly also on x86? However, we can <br>

> only do that if we also implement a new relocation (which seems <br>

> unnecessary).<br>

> <br>

> So, we will redefine the AOTRuntimeConstants struct as a C++ static field.<br>

> <br>

>> I do not understand why you have concern about having separate <br>

>> relocation for each constant in AOTRuntimeConstants. We have only 2 <br>

>> now. Even they grow I don't think we will have a lot of them. Actually <br>

>> the question: will it simplify code if each constant's address is <br>

>> treated as separate and not offset in AOTRuntimeConstants? Instead of <br>

>> loading AOTRuntimeConstants address and then adding offset we just <br>

>> load address of constant.<br>

> <br>

> Yes, you are right again: if we register the address of every field in <br>

> the struct in the SCC address table then that will simplify the back end <br>

> code. I was not sure how much we wanted to rely on address table <br>

> registration but given that we *are* reliant on it just now then we <br>

> might as well have the benefit.<br>

> <br>

> So, we will do the extra registration and limit the back end changes to <br>

> detect ConP occurrences in the AOTRuntimeConstants range and load them <br>

> using an ExternalAddress reloc.<br>

> <br>

>> I don't understand question about C1 (I am not familiar with it). What <br>

>> it generates for shifts now? Why you need 3-operand shift?<br>

> <br>

> I'm not sure '3-operand' is the right term to use to describe what is at <br>

> stake here. It's really about the kind of operand rather than their <br>

> number. The current region/card shift operations employ an immediate <br>

> shift count. The AOT region/card shift operations need to be fed the <br>

> shift count via a register (the target of the constant load).<br>

> <br>

> C1 does currently generate shifts where the shift count is in a register <br>

> but it only does so using the flavours of shift (sar/sal/shr/shl) that <br>

> require the register to be ECX. It doesn't currently generate the <br>

> alternative instructions (sarx/shlx/shrx) that accept a shift count in <br>

> any register. I'm not 100% sure why but I think this is because not all <br>

> CPUs implement them (they require CPUID feature BM2) or it may be to do <br>

> with them setting/not setting flags.<br>

> <br>

> Ashu is proposing adding support for these latter instructions to C1 so <br>

> the barrier code does not have to pin the shift count to ECX. Do we <br>

> care? Or can we live with the effect of pinning?<br>

> <br>

>> Code style: I noticed and <a href="http://aarch64.ad" rel="noreferrer" target="_blank">aarch64.ad</a> still using old stile of <br>

>> instructions definition using `enc_class`. We don't do this any more <br>

>> (at least in x86*.ad) when we adding new instructions with one <br>

>> implementation. Please use `ins_encode %{ %}` in such cases.<br>

> <br>

> Sure, we will fix that.<br>

> <br>

>> I think, if you need additional register you should push/pop to <br>

>> guarantee we don't stomp over some used value.<br>

> Ok, we will play safe and do that.<br>

> <br>

> One final comment: Much of this discussion is going to be side-stepped <br>

> by the G1 late barrier implementation that is currently working its way <br>

> into mainline. That will remove the need to inject barrier IR subgraphs <br>

> into the C1 and C2 IR graphs and eliminate most (all?) of the associated <br>

> C2 reduction rules from the ad files. I think we should still update the <br>

> patch to accommodate the feedback above. We can simplify it further once <br>

> the G1 changes come in from mainline and pursue in parallel the idea of <br>

> accommodating changes to oop encodings using this model.<br>

> <br>

> regards,<br>

> <br>

> <br>

> Andrew Dinn<br>

> -----------<br>

> <br>

<br>

</blockquote></div>