RFR: 8240772: x86_64: Pre-generate Assembler::popa, pusha and vzeroupper

Wed Mar 11 18:29:07 UTC 2020

Yes, I also thought about factoring this code into one function. The only drawback is yuo would need call stack to see 
which instruction failed. But I doubt we will see such failure ;)

That leads to question. Claes, what was the reason you used loop instead of memcpy for vzeroupper?

The logic of changes is good.

Thanks,
Vladimir

On 3/11/20 11:04 AM, Ioi Lam wrote:
> Hi Claes,
> 
> Looks good. I would suggest moving these into an inline function to avoid repetition.
> 
>    assert(pusha_code != NULL, "must be pregenerated");
>    assert(code_section()->limit() - code_section()->end() > pusha_len, "code buffer not large enough");
>    address end = code_section()->end();
>    memcpy((char*)end, pusha_code, pusha_len);
>    code_section()->set_end(end + pusha_len);
> 
> 
> Thanks
> - Ioi
> 
> On 3/11/20 6:20 AM, Claes Redestad wrote:
>> New webrev: http://cr.openjdk.java.net/~redestad/8240772/open.01/
>>
>> Rather than *_slow I went with *_uncached.
>>
>> Reworked initialization, and discovered a bug in open.00:
>>
>> vzeroupper is speculatively emitted in the VM_Version stub with the CPU
>> feature flag explicitly set. This meant we pre-computed the vzeroupper
>> as always enabled, before CPU capabilities had been determined. This
>> caused an intermittent test issue.
>>
>> I now expose an uncached version of vzeroupper that the VM_Version stub
>> can use, then trigger the pre-compute right after running the CPU
>> feature checks.
>>
>> Testing: re-running tier1-3, verified 32-bit builds locally
>>
>> /Claes
>>
>> On 2020-03-10 20:19, Claes Redestad wrote:
>>> Hi Ioi,
>>>
>>> good suggestions. I will rework this tomorrow, along with Vladimir's
>>> suggestion to add an explicit call to precompute_instructions from
>>> the stubGenerator.
>>>
>>> Thanks!
>>>
>>> /Claes
>>>
>>> On 2020-03-10 19:40, Ioi Lam wrote:
>>>> Hi Claes,
>>>>
>>>> This is a really good optimization! Small bang for big bucks!
>>>>
>>>> I have a suggestion code coding style:
>>>>
>>>> Rename Assembler::popa to Assembler::popa_slow();
>>>>
>>>> void Assembler::popa() { // 64bit
>>>>    if (!precomputed) {
>>>>      precompute_instructions();
>>>>    }
>>>>    copy_precomputed_instructions(popa_code, popa_len);
>>>> }
>>>>
>>>> static void precompute_instructions() {
>>>>    ...
>>>>    MacroAssembler masm(&buffer);
>>>>
>>>>    address begin_popa  = masm.code_section()->end();
>>>>    masm.popa_slow();
>>>>    address end_popa    = masm.code_section()->end();
>>>>    ...
>>>> }
>>>>
>>>> ----
>>>>
>>>> Also, maybe you can add this assert after generating the code for all 3 macros:
>>>>
>>>>    assert(masm->code()->total_relocation_size() == 0 &&
>>>> masm->code()->total_oop_size() == 0 &&
>>>> masm->code()->total_metadata_size() == 0,
>>>>           "precomputed code cannot have any of these");
>>>>
>>>>
>>>> Thanks!
>>>> - Ioi
>>>>
>>>>
>>>>
>>>> On 3/10/20 6:46 AM, Claes Redestad wrote:
>>>>> Hi,
>>>>>
>>>>> calculate some invariant Assembler routines at bootstrap, copy on
>>>>> subsequent invocations.
>>>>>
>>>>> For popa and pusha this means an overhead reduction of around 98% (from
>>>>> ~2500 instructions to emit a pusha to ~50). For vzeroupper an overhead
>>>>> reduction of ~65% (117 -> 42). Together these add up to about a 1%
>>>>> reduction of instructions executed on a Hello World - with some
>>>>> (smaller) scaling impact on larger applications.
>>>>>
>>>>> The initialization is very simple/naive, i.e., lacks any kind of synchronization protocol. But as this setup is 
>>>>> guaranteed to happen very
>>>>> early during bootstrap this should be fine. Thanks Ioi for some helpful
>>>>> suggestions here!
>>>>>
>>>>> Bug:    https://bugs.openjdk.java.net/browse/JDK-8240772
>>>>> Webrev: http://cr.openjdk.java.net/~redestad/8240772/open.00/
>>>>>
>>>>> Testing: tier1-3
>>>>>
>>>>> Thanks!
>>>>>
>>>>> /Claes
>>>>
>