RFR: 8240772: x86_64: Pre-generate Assembler::popa, pusha and vzeroupper

Wed Mar 11 13:20:21 UTC 2020

New webrev: http://cr.openjdk.java.net/~redestad/8240772/open.01/

Rather than *_slow I went with *_uncached.

Reworked initialization, and discovered a bug in open.00:

vzeroupper is speculatively emitted in the VM_Version stub with the CPU
feature flag explicitly set. This meant we pre-computed the vzeroupper
as always enabled, before CPU capabilities had been determined. This
caused an intermittent test issue.

I now expose an uncached version of vzeroupper that the VM_Version stub
can use, then trigger the pre-compute right after running the CPU
feature checks.

Testing: re-running tier1-3, verified 32-bit builds locally

/Claes

On 2020-03-10 20:19, Claes Redestad wrote:
> Hi Ioi,
> 
> good suggestions. I will rework this tomorrow, along with Vladimir's
> suggestion to add an explicit call to precompute_instructions from
> the stubGenerator.
> 
> Thanks!
> 
> /Claes
> 
> On 2020-03-10 19:40, Ioi Lam wrote:
>> Hi Claes,
>>
>> This is a really good optimization! Small bang for big bucks!
>>
>> I have a suggestion code coding style:
>>
>> Rename Assembler::popa to Assembler::popa_slow();
>>
>> void Assembler::popa() { // 64bit
>>    if (!precomputed) {
>>      precompute_instructions();
>>    }
>>    copy_precomputed_instructions(popa_code, popa_len);
>> }
>>
>> static void precompute_instructions() {
>>    ...
>>    MacroAssembler masm(&buffer);
>>
>>    address begin_popa  = masm.code_section()->end();
>>    masm.popa_slow();
>>    address end_popa    = masm.code_section()->end();
>>    ...
>> }
>>
>> ----
>>
>> Also, maybe you can add this assert after generating the code for all 
>> 3 macros:
>>
>>    assert(masm->code()->total_relocation_size() == 0 &&
>> masm->code()->total_oop_size() == 0 &&
>> masm->code()->total_metadata_size() == 0,
>>           "precomputed code cannot have any of these");
>>
>>
>> Thanks!
>> - Ioi
>>
>>
>>
>> On 3/10/20 6:46 AM, Claes Redestad wrote:
>>> Hi,
>>>
>>> calculate some invariant Assembler routines at bootstrap, copy on
>>> subsequent invocations.
>>>
>>> For popa and pusha this means an overhead reduction of around 98% (from
>>> ~2500 instructions to emit a pusha to ~50). For vzeroupper an overhead
>>> reduction of ~65% (117 -> 42). Together these add up to about a 1%
>>> reduction of instructions executed on a Hello World - with some
>>> (smaller) scaling impact on larger applications.
>>>
>>> The initialization is very simple/naive, i.e., lacks any kind of 
>>> synchronization protocol. But as this setup is guaranteed to happen very
>>> early during bootstrap this should be fine. Thanks Ioi for some helpful
>>> suggestions here!
>>>
>>> Bug:    https://bugs.openjdk.java.net/browse/JDK-8240772
>>> Webrev: http://cr.openjdk.java.net/~redestad/8240772/open.00/
>>>
>>> Testing: tier1-3
>>>
>>> Thanks!
>>>
>>> /Claes
>>