AArch64 register usage questions
Hi, I've been looking at the aarch64 port's register usage, and compared against Oracle's arm port, and have a few questions and observations: Comparing the enum in c1_defs_aarch54.hpp vs. the enum in the arm code (and doing some constant folding by hand), you get: ARM32/64 AArch64 Notes pd_nof_cpu_regs_frame_map 33 32 number of registers used during code emission pd_nof_caller_save_cpu_regs_frame_map 27 17 number of registers killed by calls pd_nof_cpu_regs_reg_alloc 27 17 number of registers that are visible to register allocator () pd_nof_cpu_regs_linearscan 33 32 number of registers visible to linear scan pd_nof_cpu_regs_processed_in_linearscan 28 -0 number of registers processed in linear scan; includes LR (in arm prt) pd_first_cpu_reg 0 0 pd_last_cpu_reg 32 16 pd_first_callee_saved_reg -0 17 pd_last_callee_saved_reg -0 24 pd_last_allocatable_cpu_reg -0 16 pd_first_byte_reg -0 0 unused! pd_last_byte_reg -0 16 unused, except by unused last_byte_reg(). pd_nof_fpu_regs_frame_map 32 32 number of float registers used during code emission pd_nof_caller_save_fpu_regs_frame_map 32 32 number of float registers killed by calls pd_nof_fpu_regs_reg_alloc 32 8 number of float registers that are visible to register allocator pd_nof_fpu_regs_linearscan 32 32 number of float registers visible to linear scan pd_first_fpu_reg 33 32 '= pd_nof_cpu_regs_frame_map, pd_last_fpu_reg 64 63 pd_first_callee_saved_fpu_reg -0 40 pd_last_callee_saved_fpu_reg -0 47 pd_nof_xmm_regs_linearscan 0 0 pd_nof_caller_save_xmm_regs 0 -0 pd_first_xmm_reg -1 -0 pd_last_xmm_reg -1 -0 I don't expect these values to match, but some items stand out: - AArch64 has fewer caller-saves registers, but more callee-saves registers defined above. o But the Aarch64 code has comments like: // FIXME: There are no callee-saved. o And C2 does not define any SOE registers. o Is C1 using few registers than it could? Than it should? o And/or is C2? o There are other comments, such as above generate_call_stub(), that says: ? // we don't need to save r16-18 because Java does not use them ? The comment says r16, but doesn't seem to match pd_first_callee_saved_reg. ? I don't see where r18 came from. - pd_first_byte_reg, pd_last_byte_reg and last_byte_reg() seem unused. Looking at the register definitions in aarch64.ad and other places: - Aarch64 always allocates r27 to use for compressed oops (rheapbase). Arm32/64 only allocates the register if CompressedOops is enabled. o In one sense the Aarch64 approach seems reasonable. I think the default setting for CompressedOops will be true until heap sizes get huge (e.g. somewhere past 256GB.), so there may not be much reason to optimize the non-CompressedOops path. o If we really wanted to use r27 for compiled code, we could probably only allocate r27 for rheapbase if (Universe::narrow_oop_base() != NULL). Thanks for any thoughts you might have... - Derek
Sorry, mailing list ate my formatting. This table should be readable in fix-width font: --------------------------------------------------------------------------------------------- ORACLE OPENJDK pd_nof_cpu_regs_frame_map 33 32 # number of registers used during code emission pd_nof_caller_save_cpu_regs_frame_map 27 17 # number of registers killed by calls pd_nof_cpu_regs_reg_alloc 27 17 # number of registers that are visible to register allocator () pd_nof_cpu_regs_linearscan 33 32 # number of registers visible to linear scan pd_nof_cpu_regs_processed_in_linearscan 28 - # number of registers processed in linear scan; includes LR (in arm prt) pd_first_cpu_reg 0 0 pd_last_cpu_reg 32 16 pd_first_callee_saved_reg - 17 pd_last_callee_saved_reg - 24 pd_last_allocatable_cpu_reg - 16 pd_first_byte_reg - 0 # unused! pd_last_byte_reg - 16 # unused, except by unused last_byte_reg(). pd_nof_fpu_regs_frame_map 32 32 # number of float registers used during code emission pd_nof_caller_save_fpu_regs_frame_map 32 32 # number of float registers killed by calls pd_nof_fpu_regs_reg_alloc 32 8 # number of float registers that are visible to register allocator pd_nof_fpu_regs_linearscan 32 32 # number of float registers visible to linear scan pd_first_fpu_reg 33 32 = pd_nof_cpu_regs_frame_map, pd_last_fpu_reg 64 63 pd_first_callee_saved_fpu_reg - 40 pd_last_callee_saved_fpu_reg - 47 pd_nof_xmm_regs_linearscan 0 0 pd_nof_caller_save_xmm_regs 0 - pd_first_xmm_reg -1 - pd_last_xmm_reg -1 - --------------------------------------------------------------------------------------------- I don't expect these values to match, but some items stand out: - AArch64 has fewer caller-saves registers, but more callee-saves registers defined above. - But the Aarch64 code has comments like: // FIXME: There are no callee-saved. - And C2 does not define any SOE registers. - Is C1 using few registers than it could? Than it should? - And/or is C2? - There are other comments, such as above generate_call_stub(), that says: - // we don't need to save r16-18 because Java does not use them - The comment says r16, but doesn't seem to match pd_first_callee_saved_reg. - I don't see where r18 came from. - pd_first_byte_reg, pd_last_byte_reg and last_byte_reg() seem unused. Looking at the register definitions in aarch64.ad and other places: - Aarch64 always allocates r27 to use for compressed oops (rheapbase). Arm32/64 only allocates the register if CompressedOops is enabled. - In one sense the Aarch64 approach seems reasonable. I think the default setting for CompressedOops will be true until heap sizes get huge (e.g. somewhere past 256GB.), so there may not be much reason to optimize the non-CompressedOops path. - If we really wanted to use r27 for compiled code, we could probably only allocate r27 for rheapbase if (Universe::narrow_oop_base() != NULL). Thanks for any thoughts you might have... - Derek -----Original Message----- From: aarch64-port-dev [mailto:aarch64-port-dev-bounces@openjdk.java.net] On Behalf Of White, Derek Sent: Monday, March 20, 2017 5:57 PM To: aarch64-port-dev@openjdk.java.net Subject: [aarch64-port-dev ] AArch64 register usage questions Hi, I've been looking at the aarch64 port's register usage, and compared against Oracle's arm port, and have a few questions and observations: Comparing the enum in c1_defs_aarch54.hpp vs. the enum in the arm code (and doing some constant folding by hand), you get: ARM32/64 AArch64 Notes pd_nof_cpu_regs_frame_map 33 32 number of registers used during code emission pd_nof_caller_save_cpu_regs_frame_map 27 17 number of registers killed by calls pd_nof_cpu_regs_reg_alloc 27 17 number of registers that are visible to register allocator () pd_nof_cpu_regs_linearscan 33 32 number of registers visible to linear scan pd_nof_cpu_regs_processed_in_linearscan 28 -0 number of registers processed in linear scan; includes LR (in arm prt) pd_first_cpu_reg 0 0 pd_last_cpu_reg 32 16 pd_first_callee_saved_reg -0 17 pd_last_callee_saved_reg -0 24 pd_last_allocatable_cpu_reg -0 16 pd_first_byte_reg -0 0 unused! pd_last_byte_reg -0 16 unused, except by unused last_byte_reg(). pd_nof_fpu_regs_frame_map 32 32 number of float registers used during code emission pd_nof_caller_save_fpu_regs_frame_map 32 32 number of float registers killed by calls pd_nof_fpu_regs_reg_alloc 32 8 number of float registers that are visible to register allocator pd_nof_fpu_regs_linearscan 32 32 number of float registers visible to linear scan pd_first_fpu_reg 33 32 '= pd_nof_cpu_regs_frame_map, pd_last_fpu_reg 64 63 pd_first_callee_saved_fpu_reg -0 40 pd_last_callee_saved_fpu_reg -0 47 pd_nof_xmm_regs_linearscan 0 0 pd_nof_caller_save_xmm_regs 0 -0 pd_first_xmm_reg -1 -0 pd_last_xmm_reg -1 -0 I don't expect these values to match, but some items stand out: - AArch64 has fewer caller-saves registers, but more callee-saves registers defined above. o But the Aarch64 code has comments like: // FIXME: There are no callee-saved. o And C2 does not define any SOE registers. o Is C1 using few registers than it could? Than it should? o And/or is C2? o There are other comments, such as above generate_call_stub(), that says: ? // we don't need to save r16-18 because Java does not use them ? The comment says r16, but doesn't seem to match pd_first_callee_saved_reg. ? I don't see where r18 came from. - pd_first_byte_reg, pd_last_byte_reg and last_byte_reg() seem unused. Looking at the register definitions in aarch64.ad and other places: - Aarch64 always allocates r27 to use for compressed oops (rheapbase). Arm32/64 only allocates the register if CompressedOops is enabled. o In one sense the Aarch64 approach seems reasonable. I think the default setting for CompressedOops will be true until heap sizes get huge (e.g. somewhere past 256GB.), so there may not be much reason to optimize the non-CompressedOops path. o If we really wanted to use r27 for compiled code, we could probably only allocate r27 for rheapbase if (Universe::narrow_oop_base() != NULL). Thanks for any thoughts you might have... - Derek
On 20/03/17 21:56, White, Derek wrote:
I've been looking at the aarch64 port's register usage, and compared against Oracle's arm port, and have a few questions and observations:
Comparing the enum in c1_defs_aarch54.hpp vs. the enum in the arm code (and doing some constant folding by hand), you get:
I don't expect these values to match, but some items stand out:
- AArch64 has fewer caller-saves registers, but more callee-saves registers defined above.
You have to distinguish between the Java calling convention and the C calling convention. Java saves everything, C only saves a subset. There's no point saving registers that are not callee-clobbered.
o But the Aarch64 code has comments like: // FIXME: There are no callee-saved.
That's right: there are none in the Java convention. Everything (except FP) gets clobbered on a call in the Java calling convention.
o And C2 does not define any SOE registers.
o Is C1 using few registers than it could? Than it should?
o And/or is C2?
Just one: the frame pointer. It might also be that C1 doesn't use the compressed OOPs base, so we could use that. We haven't done much work optimizing C1 because it is C1: there is very little reward.
o There are other comments, such as above generate_call_stub(), that says:
? // we don't need to save r16-18 because Java does not use them
Tha's not a correct comment. We don't need to save r16-18 because the APCS doesn't need us to.
? The comment says r16, but doesn't seem to match pd_first_callee_saved_reg.
r16 is not callee-saved, as inspection of the APCS will show.
? I don't see where r18 came from.
- pd_first_byte_reg, pd_last_byte_reg and last_byte_reg() seem unused.
Looking at the register definitions in aarch64.ad and other places:
- Aarch64 always allocates r27 to use for compressed oops (rheapbase). Arm32/64 only allocates the register if CompressedOops is enabled.
o In one sense the Aarch64 approach seems reasonable. I think the default setting for CompressedOops will be true until heap sizes get huge (e.g. somewhere past 256GB.), so there may not be much reason to optimize the non-CompressedOops path.
o If we really wanted to use r27 for compiled code, we could probably only allocate r27 for rheapbase if (Universe::narrow_oop_base() != NULL).
That's correct. We should do that. We should also use FP: this would be advantageous because it's the only callee-saved register in the Java calling convention. Andrew.
participants (2)
-
Andrew Haley
-
White, Derek