[aarch64-port-dev ] RFD: scratch registers cleanup [Long post]

Tue Nov 5 12:24:02 UTC 2019

Hi Andrew,

This is a good start and it could be used as is. However,I think it
needs some polishing to improve the way it works. That could involve
upgrading the model for how scratch registers are declared, allocated,
released and re-allocated. Alternatively, it could just involve
improving the way the current definitions are employed.

I'll start with general comments which address that broader point before
going on to mention specific issues to do with the current patch. These
latter comments will also help to motivate the general point but I think
it is better to make them in context. See after the sig for these comments.

regards,

Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill

----- 8< -------- 8< -------- 8< -------- 8< -------- 8< ---

I think this is a very good start to remedying a nasty problem. However,
I'm not yet convinced the model for how you manage and use scratch
registers is the best way to do this. I'll explain why below and suggest
a variation that might or might not work. If it doesn't then correct me
and that will at least help understand why you ended up with the current
design. If it does work or, at least, suggests how to move towards
something better then let's try another round.

What troubles me most of all is that the mechanism you have provided can
be quite opaque about ownership/liveness of registers. I think that
happens because at root it does not require that declaration +
allocation, release and subsequent re-allocation are co-located in the
same program scope (or,at least, in the same method).

Let's work through some examples to clarify that view. As it stands it
is possible to allocate some scratch registers in a caller method and
then release those registers in a called method (possibly more than one
call down the stack) -- indeed, the latter operation occurs in this code
from the patch:

c1_LIRAssembler_aarch64.cpp:

 960 void LIR_Assembler::mem2reg(LIR_Opr src, LIR_Opr dest, BasicType
type, LIR_PatchCode patch_code, CodeEmitInfo* info, bool wide, bool /*
unaligned */) {
 961   LIR_Address* addr = src->as_address_ptr();
       . . .
 976   int null_check_here = code_offset();
 977   FreeScratchRegs dummy(__ as());
       . . .

ote that we are not in the scope of a visible ScratchRegister
declaration here. So, the use fo a FreeScratchRegs declaration implies
that we may be in the scope of a declaration up the call chain. If that
may be the case then how can the callee safely cancel that allocation?
How can it be sure that the caller is not relying on a scratch register
retaining its value across the call.

Likewise from the POV of the caller which might have declared a
ScratchRegister there is nothing in the code (except perhaps for
comments) to indicate that said scratch register might be overwritten
under the call to this method (or perhaps even a call to an indirect
caller). To those who do not know the details of the callee a
ScratchRegister declared in the caller will appear to remain valid
across the call even though it may be overwritten.

Another example occurs later in the same file

1464 void LIR_Assembler::emit_opTypeCheck(LIR_OpTypeCheck* op) {
1465   const bool should_profile = op->should_profile();
1466
1467   LIR_Code code = op->code();
1468   if (code == lir_store_check) {
       . . .
1495     if (should_profile) {
1496       Label not_null;
1497       __ cbnz(value, not_null);
1498       // Object is null; update MDO and exit
1499       ScratchRegister rscratch1(__ as(), r8);
1500       ScratchRegister rscratch2(__ as(), r9);       . . .
       . . .
1554   } else if (code == lir_checkcast) {
1555     FreeScratchRegs dummy(__ as());
       . . .
1564   } else if (code == lir_instanceof) {
1565     FreeScratchRegs dummy(__ as());
       . . .

Once again the release operations occur in a scope where there is no
prior declaration in the same scope. So, this use of release
declarations in the else-if clauses implies that a caller may have
allocated a scratch register yet they presume it is safe to release it.

Meanwhile, to make matters worse, the ScratchRegister declarations in
the initial if clause imply that a caller which exercises this path will
/not/ have declared a scratch register. That further implies that the
validity of these combined assumptions about what declarations might be
in scope depends on the callee's conditional logic. Now that's starting
to get a tad too opaque and I fear we are heading down a garden path.

Of course, this may just be down to the fact that the inclusion of those
release operations was superfluous; that they could be removed from
these examples. However, I don't like the fact that one can construct
such examples. I would much prefer it if this usage was avoided where
possible by requiring the release to occur in the scope where the
scratch register was declared. i.e. the caller must arrange to free
scratch registers if a caller might need them. That would make it much
easier to detect circumstances where the caller was trying to have its
cake and eat it.

This model of usage is indeed what happens much of the time e.g. also
from the same file:

1280 void LIR_Assembler::type_profile_helper(Register mdo,
1281                                         ciMethodData *md,
ciProfileData *data,
1282                                         Register recv, Label*
update_done) {
1283   ScratchRegister rscratch1(__ as(), r8);
1284   ScratchRegister rscratch2(__ as(), r9);
       . . .
1293     {
1294       FreeScratchRegs dummy(__ as());
       . . .
1298     }
       . . .

With this usage it is very clear that any scratch value established
before line 1294 cannot be relied on in code after line 1298.

I suspect that there are circumstances where a caller can or even must
free scratch registers down the call tree from a allocation so perhaps
we cannot afford always to dispense with this current usage for
FreeScratchRegs altogether. However, I'd prefer it if an explicit
scope-local release was used for wherever possible to avoid this
2nd-guessing of callers' intentions. In which case, it would be much
better if we could use the grammar of language definitions to support
that .. .

. . . which also provides the opportunity to remedy another shortcoming.
The current release operation does not specify which scratch register(s)
is (are) to be released. The only option is free all registers. However,
if a release operation must always be located in the scope of a
ScratchRegister then it is perfectly possible to do so by passing the
relevant ScratchRegister as a parameter. So, perhaps we could use macros
like:

  FreeScratchRegister dummy(rscratch1)

  FreeScratchRegisters dummy(rscratch1, rscratch2);

That makes it harder for scratch registers to be freed down the stack in
a caller because the caller will not have access to rscratch1 and
rscratch2 unless they are passed in by reerence or pointer. I'm assuming
code won't pass a ScratchRegister as a call argument, merely the
encapsulated Register (we might perhaps be able to knobble that by
ensuring that copying/dereferencing reset the register to noreg).

If this revised model works then the current FreeScratchRegs really
needs renaming to something less appealing like
DeleteCallerScratchBindings that won't tempt anyone to use it.

Also, it would perhaps be useful to provide a reverse operation for
re-allocating a previouly declared and freed scratch register e.g.

public void Foo::bar(...)
  ScratchRegister rscratch1(...);
  ScratchRegister rscratch2(...);
  . . .
  <use rscratch1 and rscratch2>
  . . .
  FreeScratchRegister dummy(rscratch1);
  call_user_of_rscratch1(...);
  UseScratchRegister dummy2(rscratch1);
  ...

This explicitly acknowledges that the call may want to use  rsctratch1
and explicitly signals in the caller code that rscratch1 cannot be
expected to remain constant across the call. Of course, one can always
achieve the same effect with block nesting but I think this is neater.

There is a bit more to this to do with conditional allocation of a
scratch register that I will discuss in the detailed comments (for
c1_LIRAssembler) but that needs to be done in context so ...

... on to the specific comments:

aarch64.ad:

Firstly, just a humdrum error. You missed a conversion for a specific
use of rscratch1

2851   enc_class aarch64_enc_stlrb(iRegI src, memory mem) %{
2852     MOV_VOLATILE(as_Register($src$$reg), $mem$$base, $mem$$index,
$mem$$scale, $mem$$disp,
2853                  rscratch1, stlrb);
2854   %}

This is still passing rscratch1 in the MOV_VOLATILE macro call where
subsequent encoding classes use r8_scratch. It doesn't actually matter
(i.e. it doesn't fail to compile) because MOV_VOLATILE ignores the
SCRATCH argument. I know you are trying to minimize changes but dropping
that argument in all uses would be a whole lot better . . . yes, maybe
in the next patch.

This story continues . . .

2930   enc_class aarch64_enc_fldars(vRegF dst, memory mem) %{
2931     MOV_VOLATILE(r8_scratch, $mem$$base, $mem$$index, $mem$$scale,
$mem$$disp,
2932              r8_scratch, ldarw);
2933     __ fmovs(as_FloatRegister($dst$$reg), r8_scratch);
2934   %}

I'm not sure why you invented the name r8_scratch as an alias for r8.
Does that really help? I couldn't discern why you sometimes used one and
why sometimes the other. If there is a rationale it probaly ought to be
documented somewhere (at least at the point where r8_scratch aand
r9_scratch are declared).

Anyway, what I don't follow is why are these uses just employ a bare
register name. Why are they exempt from the need to declare a scratch
register for r8 before using it?
Why not:

   enc_class aarch64_enc_fldars(vRegF dst, memory mem) %{
     ScratchRegister rscratch1(__ as(), r8);
     MOV_VOLATILE(rscratch1, $mem$$base, $mem$$index, $mem$$scale,
$mem$$disp,
              rscratch1, ldarw);
     __ fmovs(as_FloatRegister($dst$$reg), rscratch1);
     . . .

or, if need be

   enc_class aarch64_enc_fldars(vRegF dst, memory mem) %{
     {
       ScratchRegister rscratch1(__ as(), r8);
       MOV_VOLATILE(rscratch1, $mem$$base, $mem$$index, $mem$$scale,
$mem$$disp,
                rscratch1, ldarw);
       __ fmovs(as_FloatRegister($dst$$reg), rscratch1);
     }
     . . .

I realize this is extra 'protocol' for cases where register use is
managed by other means and where the invoked assembler methods are not
going to risk reuse of the register. However, the bare use of r8 (or
even under the name r8_scratch) is actually quite confusing. Using the
same protocol for acquiring a scratch register has the important virtue
of consistency. It also allows us to find a way later to avoid having to
explicitly name r8 and r9 in the declaration and, equally, having to
explicitly name them in these encodings e.g. we might just pass an index
in the constructor:

   ScratchRegister rscratch1(__ as(), 1);
   ScratchRegister rscratch2(__ as(), 2);

This is not the only place where your usage is confusing. You do
(although not consistently) use declarations in some of the later
encodings. For example, further down the file:

3526   enc_class aarch64_enc_fast_lock(iRegP object, iRegP box, iRegP
tmp, iRegP tmp2) %{
       . . .
3570     // markWord of object (disp_hdr) with the stack pointer.
3571     __ mov(r8_scratch, sp);
3572     __ sub(disp_hdr, disp_hdr, r8_scratch);
       . . .

has no declaration ... but is immediately followed by

3604   enc_class aarch64_enc_fast_unlock(iRegP object, iRegP box, iRegP
tmp, iRegP tmp2) %{
       . . .
3644     ScratchRegister rscratch1(__ as(), r8);
       . . .

Declarations are also used in some of the instruction definitions:

12103 instruct rolI_rReg(iRegINoSp dst, iRegI src, iRegI shift,
rFlagsReg cr)
       . . .
12109   ins_encode %{
12110     ScratchRegister rscratch1(__ as(), r8);
12111     __ subw(rscratch1, zr, as_Register($shift$$reg));
       . . .

Is there a clear rationale for whether or not to declare a
ScratchRegister that I am missing? I'd be happier with them being used
everywhere, avoiding explicit mention of Register names at points of use.

Also, one other thing I don't understand but I think is just an error:

3008   enc_class aarch64_enc_stlxr(iRegLNoSp src, memory mem) %{
3009     MacroAssembler _masm(&cbuf);
3010     Register src_reg = as_Register($src$$reg);
3011     Register base = as_Register($mem$$base);
3012     ScratchRegister rscratch2(__ as(), r9);
         . . .
3017        if (disp != 0) {
3018         __ lea(r9_scratch, Address(base, disp));
3019         __ stlxr(r8_scratch, src_reg, r9_scratch);
         . . .

Why are you declaring r9 as temp rscratch2, not declaring r8 as
rscratch1 and then using both as r9_scratch and r8_scratch? Note also
that in the previous encoding (aarch64_enc_ldaxr) you use r9_scratch and
r8_scratch and don't declare any ScratchRegister.

Oh and

3117   enc_class aarch64_enc_prefetchw(memory mem) %{
       . . .
3130         ScratchRegister r8_scratch(__ as(), r8);
3131         __ lea(r8_scratch, Address(base, disp));
3132         __ prfm(Address(r8_scratch, index_reg,
Address::lsl(scale)), PSTL1KEEP);

Why name the scratch register r8_scratch in this decl (shadowing an
existing name) rather than calling it rscratch1?

c1_LIRAssembler_aarch64.cpp:

I found the comment here to be misleading

1581 void LIR_Assembler::casw(Register addr, Register newval, Register
cmpval) {
1582   // r8 is used to pass an argument here, not as scratch. See
1583   // LIRGenerator::atomic_cmpxchg.
1584   __ cmpxchg(addr, cmpval, newval, Assembler::word, /* acquire*/
true, /* release*/ true, /* weak*/ false, r8);
1585   __ cset(r8, Assembler::NE);
1586   __ membar(__ AnyAny);
1587 }

Firstly, to be more strict/precise, r8 is used to return a result from
cmpxchg in order to then pass it on to cset (it read to me like you were
saying that it passes a value into cmpxchg).

More importantly, what the comment does not clarify is that the reason
you cannot allocate r8 as a scratch register here at the point of call
with a ScratchRegister decl is because cmpxchg itself conditionally
reserves r8 as a scratch register in the case where a client passes
noreg. So using a ScratchRegister here would break cmpxchg and not using
a ScratchRegister in cmpxchg would disallow other callers from passing
in noreg or require it to provide two paths to handle the two cases,
noreg or a scratch reg, allocating a ScratchRegister in only one of
those paths. This is all really a bit clumsy and certainly unclear.

This sort of case where a called method conditionally allocates a
scratch register is an important one to be able to handle. I think a way
to deal with this might be to allow methods like cmpxchg to declare a
local scratch register which /uses/ a supplied register if provided and
only /allocates/ a specific scratch register if noreg is supplied. That
would allow the caller to allocate a scratch register and pass it in or
not allocate the register and pass in noreg. Of course it also allows a
caller to pass in a non-scratch register.

So, given the above definition and your current definition for cmpxchg, viz:

2497 void MacroAssembler::cmpxchg(Register addr, Register expected,
2498                              Register new_val,
2499                              enum operand_size size,
2500                              bool acquire, bool release,
2501                              bool weak,
2502                              Register result) {
2503   ScratchRegister rscratch1(as(), r8);
2504   if (result == noreg)  result = rscratch1;
       <blithely use result>

we could replace them with

 void LIR_Assembler::casw(Register addr, Register newval, Register cmpval) {
   // allocate scratch register to return result
   // and pass it on to cset
   ScratchRegister rscratch1( __ as(), r8);
   __ cmpxchg(addr, cmpval, newval, Assembler::word, /* acquire*/ true,
/* release*/ true, /* weak*/ false, rscratch1);
   __ cset(rscratch1, Assembler::NE);
   __ membar(__ AnyAny);
 }

and

 void MacroAssembler::cmpxchg(Register addr, Register expected,
                              Register new_val,
                              enum operand_size size,
                              bool acquire, bool release,
                              bool weak,
                              Register result) {
   ScratchRegister rscratch1(as(),
                        /*use without alloc if reg*/ result,
                        /*alloc if noreg*/ r8);
   <blithely use rscratch1 instead of result>

This doesn't really work quite right because the caller has to inhibit
allocation in the called method either by passing 1) an explicit
non-scratch register or 2) an allocated scratch register i.e. it is not
quite uniform. However, in cases where noreg is passed the implication
is clear that that rscratch1 (possibly also rscratch2 if two noreg args
may be passed) will be allocated by the callee as an alternative.

n.b. the comment in the original version of LIR_Assembler::casw refers
to LIRGenerator::atomic_cmpxchg. Do you not actually mean
MacroAssembler::cmpxchg?

c1_Runtime_aarch64.cpp

I don't really like this:

  51 int StubAssembler::call_RT(Register oop_result1, Register
metadata_result, address entry, int args_size) {
  52   // setup registers
       . . .
  58   FreeScratchRegs dummy(as());
  59   ScratchRegister rscratch1(as(), r8);
  60   ScratchRegister rscratch2(as(), r9);

Why are the regs freed here rather than in the caller? I realise this
case is special because the callee can know that scratch regs are now
invalid (since we are about to plant a blr all bets are off). However,
this is a disparity with other use cases where the caller needs to make
the decision to free temps. Intentions and expectations would be clearer
if the caller were required to explicitly release scratch vars before a
call to call_RT (and then maybe reallocate them again afterwards).

interp_masm_aarch64.cpp:

You have these top level declarations:

  47 static const Register rscratch1 = r8;
  48 static const Register rscratch2 = r9;

So, we don't (now or in future) use ScratchRegister here? Why not? Or is
this just an expedient hack? Once again, I'd prefer scratch uses to be
uniform across all the code.

interp_masm_aarch64.hpp:

 293     set_last_Java_frame(esp, rfp, (address) pc(), r8);

So, here you are using r8 rather than r8_scratch. Was there a rationale
for that?

interpreterRT_aarch64.cpp:

  41 static const Register rscratch1 = r8;
  42 static const Register rscratch2 = r9;

Same comment as above for interp_masm_aarch64.cpp.

macroAssembler_aarch64.cpp:

Several more occurrences of callee freeing that make sense given that a
call is being planted but which I think would be clearer if done in the
relevant callers:

 679 void MacroAssembler::call_VM_base(Register oop_result,
 680                                   Register java_thread,
 681                                   Register last_java_sp,
 682                                   address  entry_point,
 683                                   int      number_of_arguments,
 684                                   bool     check_exceptions) {
 685   FreeScratchRegs regs(as());
       . . .

 804 address MacroAssembler::emit_trampoline_stub(int
insts_call_instruction_offset,
 805                                              address dest) {
       . . .
 821   FreeScratchRegs dummy(as());
       . . .

1456 void MacroAssembler::call_VM_leaf_base(address entry_point,
1457                                        int number_of_arguments,
1458                                        Label *retaddr) {
1459   Label E, L;
1460
1461   FreeScratchRegs dummy(as());  // VM calls clobber all registers but
1462                                 // we preserve rscratch1.
1463   ScratchRegister rscratch1(as(), r8);
       . . .

Also, I don't follow why you sometimes use r8 and other times
r9_scratch in this file.

methodHandles_aarch64.cpp

Once again free in a callee that I'd prefer to see done in the callers:

  97 void MethodHandles::jump_from_method_handle(MacroAssembler* _masm,
Register method, Register temp,
  98                                             bool for_compiler_entry) {
  99   FreeScratchRegs dummy(__ as());
 100   ScratchRegister rscratch1(__ as(), r8);

 128 void MethodHandles::jump_to_lambda_form(MacroAssembler* _masm,
 129                                         Register recv, Register
method_temp,
 130                                         Register temp2,
 131                                         bool for_compiler_entry) {
 132   FreeScratchRegs(__ as());

stubGenerator_aarch64.cpp

Two cases where it is clear a more selective release would be useful

2104     {
2105       FreeScratchRegs dummy(__ as());
2106       ScratchRegister rscratch2(__ as(), r9);
         . . .
2188     {
2189       FreeScratchRegs dummy(__ as());
2190       ScratchRegister rscratch2(__ as(), r9);

templateTable_aarch64.cpp

  46 static const Register rscratch1 = r8;
  47 static const Register rscratch2 = r9;

Same comment as above for interp_masm_aarch64.cpp.