RFR: 8240772: x86_64: Pre-generate Assembler::popa, pusha and vzeroupper
Claes Redestad
claes.redestad at oracle.com
Wed Mar 11 13:20:21 UTC 2020
New webrev: http://cr.openjdk.java.net/~redestad/8240772/open.01/
Rather than *_slow I went with *_uncached.
Reworked initialization, and discovered a bug in open.00:
vzeroupper is speculatively emitted in the VM_Version stub with the CPU
feature flag explicitly set. This meant we pre-computed the vzeroupper
as always enabled, before CPU capabilities had been determined. This
caused an intermittent test issue.
I now expose an uncached version of vzeroupper that the VM_Version stub
can use, then trigger the pre-compute right after running the CPU
feature checks.
Testing: re-running tier1-3, verified 32-bit builds locally
/Claes
On 2020-03-10 20:19, Claes Redestad wrote:
> Hi Ioi,
>
> good suggestions. I will rework this tomorrow, along with Vladimir's
> suggestion to add an explicit call to precompute_instructions from
> the stubGenerator.
>
> Thanks!
>
> /Claes
>
> On 2020-03-10 19:40, Ioi Lam wrote:
>> Hi Claes,
>>
>> This is a really good optimization! Small bang for big bucks!
>>
>> I have a suggestion code coding style:
>>
>> Rename Assembler::popa to Assembler::popa_slow();
>>
>> void Assembler::popa() { // 64bit
>> if (!precomputed) {
>> precompute_instructions();
>> }
>> copy_precomputed_instructions(popa_code, popa_len);
>> }
>>
>> static void precompute_instructions() {
>> ...
>> MacroAssembler masm(&buffer);
>>
>> address begin_popa = masm.code_section()->end();
>> masm.popa_slow();
>> address end_popa = masm.code_section()->end();
>> ...
>> }
>>
>> ----
>>
>> Also, maybe you can add this assert after generating the code for all
>> 3 macros:
>>
>> assert(masm->code()->total_relocation_size() == 0 &&
>> masm->code()->total_oop_size() == 0 &&
>> masm->code()->total_metadata_size() == 0,
>> "precomputed code cannot have any of these");
>>
>>
>> Thanks!
>> - Ioi
>>
>>
>>
>> On 3/10/20 6:46 AM, Claes Redestad wrote:
>>> Hi,
>>>
>>> calculate some invariant Assembler routines at bootstrap, copy on
>>> subsequent invocations.
>>>
>>> For popa and pusha this means an overhead reduction of around 98% (from
>>> ~2500 instructions to emit a pusha to ~50). For vzeroupper an overhead
>>> reduction of ~65% (117 -> 42). Together these add up to about a 1%
>>> reduction of instructions executed on a Hello World - with some
>>> (smaller) scaling impact on larger applications.
>>>
>>> The initialization is very simple/naive, i.e., lacks any kind of
>>> synchronization protocol. But as this setup is guaranteed to happen very
>>> early during bootstrap this should be fine. Thanks Ioi for some helpful
>>> suggestions here!
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8240772
>>> Webrev: http://cr.openjdk.java.net/~redestad/8240772/open.00/
>>>
>>> Testing: tier1-3
>>>
>>> Thanks!
>>>
>>> /Claes
>>
More information about the hotspot-compiler-dev
mailing list