[aarch64-port-dev ] RFR: 8144498: aarch64: large code cache generates SEGV

Mon Dec 7 12:22:14 UTC 2015

On Fri, 2015-12-04 at 17:38 +0000, Andrew Haley wrote:
> On 12/04/2015 04:14 PM, Andrew Haley wrote:
> I'm going to suggest this as a simpler fix:
> 
> address Relocation::pd_call_destination(address orig_addr) {
>   assert(is_call(), "should be a call here");
>   if (NativeCall::is_call_at(addr())) {  // is a BL instruction
>     address trampoline = nativeCall_at(addr())->get_trampoline();
>     if (trampoline) {
>       return nativeCallTrampolineStub_at(trampoline)->destination();
>     }
>   }
>   if (orig_addr != NULL) {
>     return MacroAssembler::pd_call_destination(orig_addr);
>   }
>   return MacroAssembler::pd_call_destination(addr());
> }
> 
> I think it's right because this way we only follow real BL
> instructions, and if these point to trampolines they must be within
> the blob which is being relocated.  I think this will fix your problem
> because such BL instructions cannot point to anywhere wild.

I am not sure this works.

Firstly, in the case that far_branches are not enabled (IE the code cache is <= 128m), then there could be BL instructions to other addresses outside the current code blob. These are generated by far_call as follows.

  if (far_branches()) {
    unsigned long offset;
    // We can use ADRP here because we know that the total size of
    // the code cache cannot exceed 2Gb.
    adrp(tmp, entry, offset);
    add(tmp, tmp, offset);
    if (cbuf) cbuf->set_insts_mark();
    blr(tmp);
  } else {
    if (cbuf) cbuf->set_insts_mark();
    bl(entry);
  }

I cannot see what prevents one of these BLs from being followed and since they may have been copied but not relocated then they may end up pointing somewhere random in the code buffer which just happens to look like a trampoline. Admittedly, the probability of failure is vastly reduced because there are no genuine trampolines for it to latch on to.

This case can be avoided by adding a far_branches() predicate to pd_call_destination as follows.

  if (far_branches() && NativeCall::is_call_at(addr())) {  // is a BL instruction

Second, I am not such that your assertion

> (When a trampoline call is first created it is a call to self; the
> reloc is the only way to find the trampoline.  For this reason, you
> must use nativeCall_at(addr())->get_trampoline().)

is correct. In MacroAssembler::trampoline_call I see

  if (Assembler::reachable_from_branch_at(pc(), entry.target())) {
    bl(entry.target());
  } else {
    bl(pc());
  }

so it only creates a call to self if the branch does not reach and as before you could have a dangling BL when this is copied.

I believe it would be possible to replace the above code section with simply

  bl(pc());

since it will always be relocated and therefore you can always generate the call to self.

All of this seems very fragile and I am wondering about the value of trampolines. The alternative to using trampolines would be to always generate

  adrp Xn, target & ~0xfff
  add  Xn, Xn, target & 0xfff
  blr  Xn

On most modern, out of order, dual issue implementations the ADRP and ADD will be folded into a single micro-op which will then be dual issued with the BLR so it doesn't end up costing us anything.

I did some experiments on 2 different implementations comparing the following 3 code fragments (where 'tramp_dest' is the final destination to be called).

1) Straight BL

tramp_test:
        mov     x2, x30
tramp1: 
        bl      tramp_dest
        subs    x0, x0, #1
        bne     tramp1
        ret     x2

2) Straight ADRP/ADD

tramp_test:
        mov     x2, x30
tramp1: 
        adr     x3, tramp_dest
        add     x3, x3, #0x0
        blr     x3
        subs    x0, x0, #1
        bne     tramp1
        ret     x2

3) Trampoline

tramp_test:
        mov     x2, x30
tramp1: 
        bl      tramp
        subs    x0, x0, #1
        bne     tramp1
        ret     x2

tramp:  
        ldr     x1, tramp_adcon
        br      x1
tramp_adcon:
        .dword  tramp_dest

I ran the above tests on 2 different implementations for 1E9 iteration. The results were

Imp 1: Straight BL = 4.50157 sec, ADRP/ADD = 4.50157 sec, trampoline = 6.00209 sec
Imp 2: Straight BL = 3.00107 sec, ADRP/ADD = 3.00106 sec, trampoline = 4.16815 sec

Maybe we could just get rid of trampolines?

All the best,
Ed.