[aarch64-port-dev ] AArch64 register usage questions

Mon Mar 20 22:39:43 UTC 2017

Sorry, mailing list ate my formatting. This table should be readable in fix-width font:

---------------------------------------------------------------------------------------------
	   	       		          ORACLE   OPENJDK
  pd_nof_cpu_regs_frame_map               33	   32		# number of registers used during code emission
  pd_nof_caller_save_cpu_regs_frame_map   27	   17		# number of registers killed by calls
  pd_nof_cpu_regs_reg_alloc               27	   17	  	# number of registers that are visible to register allocator ()
  pd_nof_cpu_regs_linearscan 		  33       32           # number of registers visible to linear scan
  pd_nof_cpu_regs_processed_in_linearscan 28  	    -	        # number of registers processed in linear scan; includes LR (in arm prt)
  pd_first_cpu_reg			   0	    0
  pd_last_cpu_reg			  32	   16

  pd_first_callee_saved_reg		  -	   17
  pd_last_callee_saved_reg		  -	   24
  pd_last_allocatable_cpu_reg		  -	   16
  pd_first_byte_reg			  - 	    0		# unused!
  pd_last_byte_reg			  - 	   16		# unused, except by unused last_byte_reg().

  pd_nof_fpu_regs_frame_map               32       32           # number of float registers used during code emission
  pd_nof_caller_save_fpu_regs_frame_map	  32       32           # number of float registers killed by calls
  pd_nof_fpu_regs_reg_alloc               32 	    8		# number of float registers that are visible to register allocator
  pd_nof_fpu_regs_linearscan           	  32       32           # number of float registers visible to linear scan
  pd_first_fpu_reg 			  33	   32		 = pd_nof_cpu_regs_frame_map,
  pd_last_fpu_reg  			  64	   63

  pd_first_callee_saved_fpu_reg		  -	   40
  pd_last_callee_saved_fpu_reg		  - 	   47

  pd_nof_xmm_regs_linearscan		   0	    0
  pd_nof_caller_save_xmm_regs		   0	    -
  pd_first_xmm_reg			  -1	    -
  pd_last_xmm_reg			  -1	    -
---------------------------------------------------------------------------------------------
I don't expect these values to match, but some items stand out:

- AArch64 has fewer caller-saves registers, but more callee-saves registers defined above.
 - But the Aarch64 code has comments like: // FIXME: There are no callee-saved.
 - And C2 does not define any SOE registers.
 - Is C1 using few registers than it could? Than it should?
 - And/or is C2?
 - There are other comments, such as above generate_call_stub(), that says:
    - // we don't need to save r16-18 because Java does not use them
    - The comment says r16, but doesn't seem to match pd_first_callee_saved_reg.
    -  I don't see where r18 came from.
- pd_first_byte_reg, pd_last_byte_reg and last_byte_reg() seem unused.

Looking at the register definitions in aarch64.ad and other places:
-  Aarch64 always allocates r27 to use for compressed oops (rheapbase). Arm32/64 only allocates the register if CompressedOops is enabled.
   - In one sense the Aarch64 approach seems reasonable. I think the default setting for CompressedOops will be true until heap sizes get huge (e.g. somewhere past 256GB.), so there may not be much reason to optimize the non-CompressedOops path.
    - If we really wanted to use r27 for compiled code, we could probably only allocate r27 for rheapbase if (Universe::narrow_oop_base() != NULL).

Thanks for any thoughts you might have...

-        Derek

-----Original Message-----
From: aarch64-port-dev [mailto:aarch64-port-dev-bounces at openjdk.java.net] On Behalf Of White, Derek
Sent: Monday, March 20, 2017 5:57 PM
To: aarch64-port-dev at openjdk.java.net
Subject: [aarch64-port-dev ] AArch64 register usage questions

Hi,

I've been looking at the aarch64 port's register usage, and compared against Oracle's arm port, and have a few questions and observations:

Comparing the enum in c1_defs_aarch54.hpp vs. the enum in the arm code (and doing some constant folding by hand), you get:

ARM32/64

AArch64

Notes

pd_nof_cpu_regs_frame_map

33

32

number of registers used during code emission

pd_nof_caller_save_cpu_regs_frame_map

27

17

number of registers killed by calls

pd_nof_cpu_regs_reg_alloc

27

17

number of registers that are visible to register allocator ()

pd_nof_cpu_regs_linearscan

33

32

number of registers visible to linear scan

pd_nof_cpu_regs_processed_in_linearscan

28

-0

number of registers processed in linear scan; includes LR (in arm prt)

pd_first_cpu_reg

0

0

pd_last_cpu_reg

32

16

pd_first_callee_saved_reg

-0

17

pd_last_callee_saved_reg

-0

24

pd_last_allocatable_cpu_reg

-0

16

pd_first_byte_reg

-0

0

unused!

pd_last_byte_reg

-0

16

unused, except by unused last_byte_reg().

pd_nof_fpu_regs_frame_map

32

32

number of float registers used during code emission

pd_nof_caller_save_fpu_regs_frame_map

32

32

number of float registers killed by calls

pd_nof_fpu_regs_reg_alloc

32

8

number of float registers that are visible to register allocator

pd_nof_fpu_regs_linearscan

32

32

number of float registers visible to linear scan

pd_first_fpu_reg

33

32

'= pd_nof_cpu_regs_frame_map,

pd_last_fpu_reg

64

63

pd_first_callee_saved_fpu_reg

-0

40

pd_last_callee_saved_fpu_reg

-0

47

pd_nof_xmm_regs_linearscan

0

0

pd_nof_caller_save_xmm_regs

0

-0

pd_first_xmm_reg

-1

-0

pd_last_xmm_reg

-1

-0

I don't expect these values to match, but some items stand out:

-        AArch64 has fewer caller-saves registers, but more callee-saves registers defined above.

o   But the Aarch64 code has comments like: // FIXME: There are no callee-saved.

o   And C2 does not define any SOE registers.

o   Is C1 using few registers than it could? Than it should?

o   And/or is C2?

o   There are other comments, such as above generate_call_stub(), that says:

?  // we don't need to save r16-18 because Java does not use them

?  The comment says r16, but doesn't seem to match pd_first_callee_saved_reg.

?  I don't see where r18 came from.

-        pd_first_byte_reg, pd_last_byte_reg and last_byte_reg() seem unused.

Looking at the register definitions in aarch64.ad and other places:

-        Aarch64 always allocates r27 to use for compressed oops (rheapbase). Arm32/64 only allocates the register if CompressedOops is enabled.

o   In one sense the Aarch64 approach seems reasonable. I think the default setting for CompressedOops will be true until heap sizes get huge (e.g. somewhere past 256GB.), so there may not be much reason to optimize the non-CompressedOops path.

o   If we really wanted to use r27 for compiled code, we could probably only allocate r27 for rheapbase if (Universe::narrow_oop_base() != NULL).

Thanks for any thoughts you might have...

-        Derek