[aarch64-port-dev ] RFD: scratch registers cleanup [Long post]

Tue Nov 5 14:58:33 UTC 2019

On 11/5/19 12:24 PM, Andrew Dinn wrote:

> I think this is a very good start to remedying a nasty problem. However,
> I'm not yet convinced the model for how you manage and use scratch
> registers is the best way to do this. I'll explain why below and suggest
> a variation that might or might not work. If it doesn't then correct me
> and that will at least help understand why you ended up with the current
> design. If it does work or, at least, suggests how to move towards
> something better then let's try another round.
> 
> What troubles me most of all is that the mechanism you have provided can
> be quite opaque about ownership/liveness of registers. I think that
> happens because at root it does not require that declaration +
> allocation, release and subsequent re-allocation are co-located in the
> same program scope (or,at least, in the same method).
>
> Let's work through some examples to clarify that view. As it stands it
> is possible to allocate some scratch registers in a caller method and
> then release those registers in a called method (possibly more than one
> call down the stack) -- indeed, the latter operation occurs in this code
> from the patch:

Absolutely so, yes. Note that at this point I am not trying to reorganize
code but to make is clear(er) what is going on, and when a programmer does
something random to force that programmer to declare that randomness.

Abuses are possible, I agree.

> c1_LIRAssembler_aarch64.cpp:
> 
>  960 void LIR_Assembler::mem2reg(LIR_Opr src, LIR_Opr dest, BasicType
> type, LIR_PatchCode patch_code, CodeEmitInfo* info, bool wide, bool /*
> unaligned */) {
>  961   LIR_Address* addr = src->as_address_ptr();
>        . . .
>  976   int null_check_here = code_offset();
>  977   FreeScratchRegs dummy(__ as());
>        . . .
> 
> ote that we are not in the scope of a visible ScratchRegister
> declaration here. So, the use fo a FreeScratchRegs declaration implies
> that we may be in the scope of a declaration up the call chain. If that
> may be the case then how can the callee safely cancel that allocation?
> How can it be sure that the caller is not relying on a scratch register
> retaining its value across the call.

It can't. "Naked" FreeScratchRegs are an abomination to be be avoided
wherever possible. The only really justified use of them is when we're
making a callout to runtime code, at which point the programmer is
expected to know that the native ABI will clobber the scratch regs.

I don't want to clutter every single call to the runtime with
FreeScratchRegs. I don't think it would help anyone.

> Likewise from the POV of the caller which might have declared a
> ScratchRegister there is nothing in the code (except perhaps for
> comments) to indicate that said scratch register might be overwritten
> under the call to this method (or perhaps even a call to an indirect
> caller). To those who do not know the details of the callee a
> ScratchRegister declared in the caller will appear to remain valid
> across the call even though it may be overwritten.
> 
> Another example occurs later in the same file
> 
> 1464 void LIR_Assembler::emit_opTypeCheck(LIR_OpTypeCheck* op) {
> 1465   const bool should_profile = op->should_profile();
> 1466
> 1467   LIR_Code code = op->code();
> 1468   if (code == lir_store_check) {
>        . . .
> 1495     if (should_profile) {
> 1496       Label not_null;
> 1497       __ cbnz(value, not_null);
> 1498       // Object is null; update MDO and exit
> 1499       ScratchRegister rscratch1(__ as(), r8);
> 1500       ScratchRegister rscratch2(__ as(), r9);       . . .
>        . . .
> 1554   } else if (code == lir_checkcast) {
> 1555     FreeScratchRegs dummy(__ as());
>        . . .
> 1564   } else if (code == lir_instanceof) {
> 1565     FreeScratchRegs dummy(__ as());
>        . . .
> 
> Once again the release operations occur in a scope where there is no
> prior declaration in the same scope. So, this use of release
> declarations in the else-if clauses implies that a caller may have
> allocated a scratch register yet they presume it is safe to release it.

That one looks like a mistake. It's perfectly possible to FreeScratchRegs
where none are allocated. I don't believe that any registers will be
allocated before the start of emit_opTypeCheck().

> Meanwhile, to make matters worse, the ScratchRegister declarations in
> the initial if clause imply that a caller which exercises this path will
> /not/ have declared a scratch register. That further implies that the
> validity of these combined assumptions about what declarations might be
> in scope depends on the callee's conditional logic. Now that's starting
> to get a tad too opaque and I fear we are heading down a garden path.

Indeed.

> Of course, this may just be down to the fact that the inclusion of those
> release operations was superfluous; that they could be removed from
> these examples. However, I don't like the fact that one can construct
> such examples. I would much prefer it if this usage was avoided where
> possible by requiring the release to occur in the scope where the
> scratch register was declared. i.e. the caller must arrange to free
> scratch registers if a caller might need them. That would make it much
> easier to detect circumstances where the caller was trying to have its
> cake and eat it.

Sure, I'm happy with that in all cases except in the special case of
the runtime calls.

> This model of usage is indeed what happens much of the time e.g. also
> from the same file:
> 
> 1280 void LIR_Assembler::type_profile_helper(Register mdo,
> 1281                                         ciMethodData *md,
> ciProfileData *data,
> 1282                                         Register recv, Label*
> update_done) {
> 1283   ScratchRegister rscratch1(__ as(), r8);
> 1284   ScratchRegister rscratch2(__ as(), r9);
>        . . .
> 1293     {
> 1294       FreeScratchRegs dummy(__ as());
>        . . .
> 1298     }
>        . . .
> 
> With this usage it is very clear that any scratch value established
> before line 1294 cannot be relied on in code after line 1298.

That's what it's supposed to look like!

I don't really believe that it makes sense for us to prevent "Naked"
FreeScratchRegs automagically, but it would be good to reject such
things in code review

> I suspect that there are circumstances where a caller can or even
> must free scratch registers down the call tree from a allocation so
> perhaps we cannot afford always to dispense with this current usage
> for FreeScratchRegs altogether. However, I'd prefer it if an
> explicit scope-local release was used for wherever possible to avoid
> this 2nd-guessing of callers' intentions. In which case, it would be
> much better if we could use the grammar of language definitions to
> support that .. .

Now that's a good idea. I'm not sure how you'd actually enforce it in
C++, but you could have something like CalloutFreeScratchRegs for
callouts. Actually, maybe I *can* think of a way to do it with macro magic.

> . . . which also provides the opportunity to remedy another shortcoming.
> The current release operation does not specify which scratch register(s)
> is (are) to be released. The only option is free all registers. However,
> if a release operation must always be located in the scope of a
> ScratchRegister then it is perfectly possible to do so by passing the
> relevant ScratchRegister as a parameter. So, perhaps we could use macros
> like:
> 
>   FreeScratchRegister dummy(rscratch1)
> 
>   FreeScratchRegisters dummy(rscratch1, rscratch2);

Sure, it is, but IMO it's an over-elaboration. When calling down to a
sub-macro it's easer to follow what's going on if you just say "this
macro clobbers scratch", but I won't fight you if you're really keen
to do that.

> That makes it harder for scratch registers to be freed down the stack in
> a caller because the caller will not have access to rscratch1 and
> rscratch2 unless they are passed in by reerence or pointer. I'm assuming
> code won't pass a ScratchRegister as a call argument, merely the
> encapsulated Register (we might perhaps be able to knobble that by
> ensuring that copying/dereferencing reset the register to noreg).

Yeah. Passing scratch registers by other names is one of the worst
abuses I've had to deal with, and after some version of this patch
goes in such things may be fixed.

> If this revised model works then the current FreeScratchRegs really
> needs renaming to something less appealing like
> DeleteCallerScratchBindings that won't tempt anyone to use it.

That's a nice idea.

> Also, it would perhaps be useful to provide a reverse operation for
> re-allocating a previouly declared and freed scratch register e.g.
> 
> public void Foo::bar(...)
>   ScratchRegister rscratch1(...);
>   ScratchRegister rscratch2(...);
>   . . .
>   <use rscratch1 and rscratch2>
>   . . .
>   FreeScratchRegister dummy(rscratch1);
>   call_user_of_rscratch1(...);
>   UseScratchRegister dummy2(rscratch1);
>   ...

No: too complicated, excessive API service. Using nested scopes for
allocation and release corresponds will with the usage of C++, and
that's a good thing.

> This explicitly acknowledges that the call may want to use  rsctratch1
> and explicitly signals in the caller code that rscratch1 cannot be
> expected to remain constant across the call. Of course, one can always
> achieve the same effect with block nesting but I think this is neater.

I don't. That's what nesting is for!

> There is a bit more to this to do with conditional allocation of a
> scratch register that I will discuss in the detailed comments (for
> c1_LIRAssembler) but that needs to be done in context so ...
> 
> ... on to the specific comments:
> 
> aarch64.ad:
> 
> Firstly, just a humdrum error. You missed a conversion for a specific
> use of rscratch1
> 
> 2851   enc_class aarch64_enc_stlrb(iRegI src, memory mem) %{
> 2852     MOV_VOLATILE(as_Register($src$$reg), $mem$$base, $mem$$index,
> $mem$$scale, $mem$$disp,
> 2853                  rscratch1, stlrb);
> 2854   %}
> 
> This is still passing rscratch1 in the MOV_VOLATILE macro call where
> subsequent encoding classes use r8_scratch. It doesn't actually matter
> (i.e. it doesn't fail to compile) because MOV_VOLATILE ignores the
> SCRATCH argument. I know you are trying to minimize changes but dropping
> that argument in all uses would be a whole lot better . . . yes, maybe
> in the next patch.

Yes.

> This story continues . . .
> 
> 2930   enc_class aarch64_enc_fldars(vRegF dst, memory mem) %{
> 2931     MOV_VOLATILE(r8_scratch, $mem$$base, $mem$$index, $mem$$scale,
> $mem$$disp,
> 2932              r8_scratch, ldarw);
> 2933     __ fmovs(as_FloatRegister($dst$$reg), r8_scratch);
> 2934   %}
> 
> I'm not sure why you invented the name r8_scratch as an alias for r8.
> Does that really help? I couldn't discern why you sometimes used one and
> why sometimes the other. If there is a rationale it probaly ought to be
> documented somewhere (at least at the point where r8_scratch aand
> r9_scratch are declared).

I'm not sure, which is perhaps why it wasn't used consistently. Just a
reminder, I guess.

> Anyway, what I don't follow is why are these uses just employ a bare
> register name. Why are they exempt from the need to declare a scratch
> register for r8 before using it?
> Why not:
> 
>    enc_class aarch64_enc_fldars(vRegF dst, memory mem) %{
>      ScratchRegister rscratch1(__ as(), r8);
>      MOV_VOLATILE(rscratch1, $mem$$base, $mem$$index, $mem$$scale,
> $mem$$disp,
>               rscratch1, ldarw);
>      __ fmovs(as_FloatRegister($dst$$reg), rscratch1);
>      . . .
> 
> or, if need be
> 
>    enc_class aarch64_enc_fldars(vRegF dst, memory mem) %{
>      {
>        ScratchRegister rscratch1(__ as(), r8);
>        MOV_VOLATILE(rscratch1, $mem$$base, $mem$$index, $mem$$scale,
> $mem$$disp,
>                 rscratch1, ldarw);
>        __ fmovs(as_FloatRegister($dst$$reg), rscratch1);
>      }
>      . . .

I guess I could live with that. I don't want to introduce extra noise
where it can be avoided.

> This is not the only place where your usage is confusing. You do
> (although not consistently) use declarations in some of the later
> encodings. For example, further down the file:
> 
> 3526   enc_class aarch64_enc_fast_lock(iRegP object, iRegP box, iRegP
> tmp, iRegP tmp2) %{
>        . . .
> 3570     // markWord of object (disp_hdr) with the stack pointer.
> 3571     __ mov(r8_scratch, sp);
> 3572     __ sub(disp_hdr, disp_hdr, r8_scratch);
>        . . .
> 
> has no declaration ... but is immediately followed by
> 
> 3604   enc_class aarch64_enc_fast_unlock(iRegP object, iRegP box, iRegP
> tmp, iRegP tmp2) %{
>        . . .
> 3644     ScratchRegister rscratch1(__ as(), r8);
>        . . .

Fair enough. It's PoC, WIP. :-)

> Declarations are also used in some of the instruction definitions:
> 
> 12103 instruct rolI_rReg(iRegINoSp dst, iRegI src, iRegI shift,
> rFlagsReg cr)
>        . . .
> 12109   ins_encode %{
> 12110     ScratchRegister rscratch1(__ as(), r8);
> 12111     __ subw(rscratch1, zr, as_Register($shift$$reg));
>        . . .
> 
> Is there a clear rationale for whether or not to declare a
> ScratchRegister that I am missing? I'd be happier with them being used
> everywhere, avoiding explicit mention of Register names at points of use.

No, I might have just changed my mind partway through. In principle,
for the sake of the sanity of the maintenance programmer, it might be
simplest to insist that scratch register declarations are always used
in AD files. It would mean a change which moves the register tracker
from Assembler to CodeBuffer, because C2 creates Assemblers on the fly,
but such a change is not hard to do.

> Also, one other thing I don't understand but I think is just an error:
> 
> 3008   enc_class aarch64_enc_stlxr(iRegLNoSp src, memory mem) %{
> 3009     MacroAssembler _masm(&cbuf);
> 3010     Register src_reg = as_Register($src$$reg);
> 3011     Register base = as_Register($mem$$base);
> 3012     ScratchRegister rscratch2(__ as(), r9);
>          . . .
> 3017        if (disp != 0) {
> 3018         __ lea(r9_scratch, Address(base, disp));
> 3019         __ stlxr(r8_scratch, src_reg, r9_scratch);
>          . . .
> 
> Why are you declaring r9 as temp rscratch2, not declaring r8 as
> rscratch1 and then using both as r9_scratch and r8_scratch? Note also
> that in the previous encoding (aarch64_enc_ldaxr) you use r9_scratch and
> r8_scratch and don't declare any ScratchRegister.

I'm sure there was a reason, but...

> Why name the scratch register r8_scratch in this decl (shadowing an
> existing name) rather than calling it rscratch1?

That too.

> c1_LIRAssembler_aarch64.cpp:
> 
> I found the comment here to be misleading
> 
> 1581 void LIR_Assembler::casw(Register addr, Register newval, Register
> cmpval) {
> 1582   // r8 is used to pass an argument here, not as scratch. See
> 1583   // LIRGenerator::atomic_cmpxchg.
> 1584   __ cmpxchg(addr, cmpval, newval, Assembler::word, /* acquire*/
> true, /* release*/ true, /* weak*/ false, r8);
> 1585   __ cset(r8, Assembler::NE);
> 1586   __ membar(__ AnyAny);
> 1587 }
> 
> Firstly, to be more strict/precise, r8 is used to return a result from
> cmpxchg in order to then pass it on to cset (it read to me like you were
> saying that it passes a value into cmpxchg).

Passing scratch registers to macros as arguments is really awkward. It's
just about the most confusing and error-prone thing you can do. In a
later patch I'd like to get rid of all of it.

Probably the greatest contribution of this work is that it detects and
forces us to do something about all such usages.

> More importantly, what the comment does not clarify is that the reason
> you cannot allocate r8 as a scratch register here at the point of call
> with a ScratchRegister decl is because cmpxchg itself conditionally
> reserves r8 as a scratch register in the case where a client passes
> noreg. So using a ScratchRegister here would break cmpxchg and not using
> a ScratchRegister in cmpxchg would disallow other callers from passing
> in noreg or require it to provide two paths to handle the two cases,
> noreg or a scratch reg, allocating a ScratchRegister in only one of
> those paths. This is all really a bit clumsy and certainly unclear.

Right. The problem here is that the way the registers are used is
confusing and clumsy; the declarations (and comments) reflect that
clumsiness.

> This sort of case where a called method conditionally allocates a
> scratch register is an important one to be able to handle. I think a
> way to deal with this might be to allow methods like cmpxchg to
> declare a local scratch register which /uses/ a supplied register if
> provided and only /allocates/ a specific scratch register if noreg
> is supplied. That would allow the caller to allocate a scratch
> register and pass it in or not allocate the register and pass in
> noreg. Of course it also allows a caller to pass in a non-scratch
> register.

Better still, I think, for the caller to allocate the scratch register
and pass it down under the name rscratch1; ownership remains with the
caller.

> So, given the above definition and your current definition for cmpxchg, viz:
> 
> 2497 void MacroAssembler::cmpxchg(Register addr, Register expected,
> 2498                              Register new_val,
> 2499                              enum operand_size size,
> 2500                              bool acquire, bool release,
> 2501                              bool weak,
> 2502                              Register result) {
> 2503   ScratchRegister rscratch1(as(), r8);
> 2504   if (result == noreg)  result = rscratch1;
>        <blithely use result>
> 
> we could replace them with
> 
>  void LIR_Assembler::casw(Register addr, Register newval, Register cmpval) {
>    // allocate scratch register to return result
>    // and pass it on to cset
>    ScratchRegister rscratch1( __ as(), r8);
>    __ cmpxchg(addr, cmpval, newval, Assembler::word, /* acquire*/ true,
> /* release*/ true, /* weak*/ false, rscratch1);
>    __ cset(rscratch1, Assembler::NE);
>    __ membar(__ AnyAny);
>  }
> 
> and
> 
>  void MacroAssembler::cmpxchg(Register addr, Register expected,
>                               Register new_val,
>                               enum operand_size size,
>                               bool acquire, bool release,
>                               bool weak,
>                               Register result) {
>    ScratchRegister rscratch1(as(),
>                         /*use without alloc if reg*/ result,
>                         /*alloc if noreg*/ r8);
>    <blithely use rscratch1 instead of result>
> 
> This doesn't really work quite right because the caller has to inhibit
> allocation in the called method either by passing 1) an explicit
> non-scratch register or 2) an allocated scratch register i.e. it is not
> quite uniform. However, in cases where noreg is passed the implication
> is clear that that rscratch1 (possibly also rscratch2 if two noreg args
> may be passed) will be allocated by the callee as an alternative.

Hmm, maybe. I suggest that we put up with the ugliness of nakedly
using r8 for now (accompanied by suitable comments) and then fix it up
later.

> n.b. the comment in the original version of LIR_Assembler::casw refers
> to LIRGenerator::atomic_cmpxchg. Do you not actually mean
> MacroAssembler::cmpxchg?

I can't remember.

> c1_Runtime_aarch64.cpp
> 
> I don't really like this:
> 
>   51 int StubAssembler::call_RT(Register oop_result1, Register
> metadata_result, address entry, int args_size) {
>   52   // setup registers
>        . . .
>   58   FreeScratchRegs dummy(as());
>   59   ScratchRegister rscratch1(as(), r8);
>   60   ScratchRegister rscratch2(as(), r9);
> 
> Why are the regs freed here rather than in the caller? I realise this
> case is special because the callee can know that scratch regs are now
> invalid (since we are about to plant a blr all bets are off). However,
> this is a disparity with other use cases where the caller needs to make
> the decision to free temps. Intentions and expectations would be clearer
> if the caller were required to explicitly release scratch vars before a
> call to call_RT (and then maybe reallocate them again afterwards).

No, I don't want to do that, as I said above. It's a callout, which
nukes more than just the scratch registers.

> interp_masm_aarch64.cpp:
> 
> You have these top level declarations:
> 
>   47 static const Register rscratch1 = r8;
>   48 static const Register rscratch2 = r9;
> 
> So, we don't (now or in future) use ScratchRegister here? Why not? Or is
> this just an expedient hack? Once again, I'd prefer scratch uses to be
> uniform across all the code.

It's very expedient, and in particular the interpreter has its own
convention for register usage. I don't believe that rewriting all of
the templates etc. to use scratch register declarations would reduce
errors but it might even introduce them.

I could live with the interpreter being changed later, but there's no
need for it now.

> interp_masm_aarch64.hpp:
> 
>  293     set_last_Java_frame(esp, rfp, (address) pc(), r8);
> 
> So, here you are using r8 rather than r8_scratch. Was there a rationale
> for that?

It's because of declaration is in the wrong place. If we moved it, it
could just use the same name as the rest of the interpreter.

> interpreterRT_aarch64.cpp:
> 
>   41 static const Register rscratch1 = r8;
>   42 static const Register rscratch2 = r9;
> 
> Same comment as above for interp_masm_aarch64.cpp.
> 
> 
> macroAssembler_aarch64.cpp:
> 
> Several more occurrences of callee freeing that make sense given that a
> call is being planted but which I think would be clearer if done in the
> relevant callers:

See above.

> Also, I don't follow why you sometimes use r8 and other times
> r9_scratch in this file.

Some cleanups required.

> methodHandles_aarch64.cpp
> 
> Once again free in a callee that I'd prefer to see done in the callers:
> 
>   97 void MethodHandles::jump_from_method_handle(MacroAssembler* _masm,
> Register method, Register temp,
>   98                                             bool for_compiler_entry) {
>   99   FreeScratchRegs dummy(__ as());
>  100   ScratchRegister rscratch1(__ as(), r8);

There's even less justification for putting FreeScratchRegs in the
caller at this point: jump_from_method_handle is not going to return.

> stubGenerator_aarch64.cpp
> 
> Two cases where it is clear a more selective release would be useful
> 
> 2104     {
> 2105       FreeScratchRegs dummy(__ as());
> 2106       ScratchRegister rscratch2(__ as(), r9);
>          . . .
> 2188     {
> 2189       FreeScratchRegs dummy(__ as());
> 2190       ScratchRegister rscratch2(__ as(), r9);

Maybe, but it would make for a more complicated API, which I'm not
convinced would really carry its weight. I can go with a more
complicated FreeScratchRegs which says what you've freed rather than
what you've reserved if you're really keen. NBut it's not obvious to
me that

FreeScratchRegs dummy(r8, __ as());

is really better than

FreeScratchRegs dummy(__ as());
ScratchRegister rscratch2(__ as(), r9);

... especially since the latter actually relies on the programmer
reading upwards to see that r9 is allocated.

> templateTable_aarch64.cpp
> 
>   46 static const Register rscratch1 = r8;
>   47 static const Register rscratch2 = r9;
> 
> Same comment as above for interp_masm_aarch64.cpp.

Same reply.  :-)

Thanks.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671