ARM: Reduce size of safepoint code, etc.

Wed May 30 04:31:25 PDT 2012

On 29/05/12 16:31, Andrew Haley wrote:
> A few miscellaneous improvements.  These lead to reduced code size and
> less memory traffic in the normal (i.e. non-safepoint) case.
> 
> The result is like this:
> 
>     0 : 10 24          bipush
> 0x4085ec58:	movs	r3, #36	; 0x24
>     2 : ac             ireturn
> 0x4085ec5a:	movw	r0, #0
> 0x4085ec5e:	movt	r0, #16670	; 0x411e
> 0x4085ec62:	ldr	r0, [r0, #8]
> 0x4085ec64:	b.n	0x4085ec78
> 0x4085ec66:			; <UNDEFINED> instruction: 0xdead
> 0x4085ec68:	str.w	r3, [r4, #-4]!
> 0x4085ec6c:	movs	r1, #50	; 0x32
> 0x4085ec6e:	adds	r2, r4, #4
> 0x4085ec70:	bl	0x4072f782
> 0x4085ec74:	ldr.w	r3, [r4], #4
> 0x4085ec78:
> 
> Note that r3 is only pushed to memory if we take a safepoint.

Nice.

The only thing I found mildly (very mildly) out of kilter was your
definition of the SAVE_STACK and RESTORE_STACK macros. SAVE_STACK caches
the current register set locally but expects an immediately succeeding
call to Thumb2_Flush to generate the stack push instruction(s)
(basically Thumb2_Flush just calls Thumb2_Push_Multiple). RESTORE_STACK
explicitly calls Thumb2_Pop_Multiple to generate the stack pop(s) then
repopulates the register set from the local cache. This slightly
obfuscated what the macros were doing when I saw the definitions until I
reconciled them with the point of use. Bundling the call to
Thumb2_Push_Multiple into the definition of SAVE_STACK might be clearer.

> Also, the code to tear down the stack frame and place the return
> value is only generated once, and all the returns jump to it.

Also very nice. It might possibly pay to try to do this same
optimisation for lreturn and dreturn -- with 132 combinations of
registers there is maybe not a high chance of reusing a return segment
but it would not cost much in execution time at compile time to save and
check a compiled_2word_return array to see if an lreturn/dreturn with
the same register pair has already been generated. The saved code space
might help with cache pressure.

I think there is one further special case optimization possible although
it is not critical to implement it. When reusing a generated return
segment for a simple return the code ends up looking like

  load safepoint page address
  read safepoint page
  branch forward
  <
   safepoint code
  >
  branch back to generated return code

where everything up to the last branch is planted by Thumb2_Safepoint
and the backwards branch is planted by Thumb2_Return. If the backwards
branch target were passed to Thumb2_Safepoint in this special case then
the skip round could be avoided. n.b.this does not apply for ireturn etc
as they need to Fill and POP before taking the backward branch.

I have built with this patch and tested it with a variety of simple
programs employing loops and returns and also with the SpecJVM
compiler.compiler. I also eyeballed the code generated for the test
programs and verified that it is doing the more conservative
save/restore and reusing generated return code where possible. All works
and looks good.

regards,

Andrew Dinn
-----------