RFR: 8240772: x86_64: Pre-generate Assembler::popa, pusha and vzeroupper

Wed Mar 11 18:04:53 UTC 2020

Hi Claes,

Looks good. I would suggest moving these into an inline function to 
avoid repetition.

   assert(pusha_code != NULL, "must be pregenerated");
   assert(code_section()->limit() - code_section()->end() > pusha_len, 
"code buffer not large enough");
   address end = code_section()->end();
   memcpy((char*)end, pusha_code, pusha_len);
   code_section()->set_end(end + pusha_len);

Thanks
- Ioi

On 3/11/20 6:20 AM, Claes Redestad wrote:
> New webrev: http://cr.openjdk.java.net/~redestad/8240772/open.01/
>
> Rather than *_slow I went with *_uncached.
>
> Reworked initialization, and discovered a bug in open.00:
>
> vzeroupper is speculatively emitted in the VM_Version stub with the CPU
> feature flag explicitly set. This meant we pre-computed the vzeroupper
> as always enabled, before CPU capabilities had been determined. This
> caused an intermittent test issue.
>
> I now expose an uncached version of vzeroupper that the VM_Version stub
> can use, then trigger the pre-compute right after running the CPU
> feature checks.
>
> Testing: re-running tier1-3, verified 32-bit builds locally
>
> /Claes
>
> On 2020-03-10 20:19, Claes Redestad wrote:
>> Hi Ioi,
>>
>> good suggestions. I will rework this tomorrow, along with Vladimir's
>> suggestion to add an explicit call to precompute_instructions from
>> the stubGenerator.
>>
>> Thanks!
>>
>> /Claes
>>
>> On 2020-03-10 19:40, Ioi Lam wrote:
>>> Hi Claes,
>>>
>>> This is a really good optimization! Small bang for big bucks!
>>>
>>> I have a suggestion code coding style:
>>>
>>> Rename Assembler::popa to Assembler::popa_slow();
>>>
>>> void Assembler::popa() { // 64bit
>>>    if (!precomputed) {
>>>      precompute_instructions();
>>>    }
>>>    copy_precomputed_instructions(popa_code, popa_len);
>>> }
>>>
>>> static void precompute_instructions() {
>>>    ...
>>>    MacroAssembler masm(&buffer);
>>>
>>>    address begin_popa  = masm.code_section()->end();
>>>    masm.popa_slow();
>>>    address end_popa    = masm.code_section()->end();
>>>    ...
>>> }
>>>
>>> ----
>>>
>>> Also, maybe you can add this assert after generating the code for 
>>> all 3 macros:
>>>
>>>    assert(masm->code()->total_relocation_size() == 0 &&
>>> masm->code()->total_oop_size() == 0 &&
>>> masm->code()->total_metadata_size() == 0,
>>>           "precomputed code cannot have any of these");
>>>
>>>
>>> Thanks!
>>> - Ioi
>>>
>>>
>>>
>>> On 3/10/20 6:46 AM, Claes Redestad wrote:
>>>> Hi,
>>>>
>>>> calculate some invariant Assembler routines at bootstrap, copy on
>>>> subsequent invocations.
>>>>
>>>> For popa and pusha this means an overhead reduction of around 98% 
>>>> (from
>>>> ~2500 instructions to emit a pusha to ~50). For vzeroupper an overhead
>>>> reduction of ~65% (117 -> 42). Together these add up to about a 1%
>>>> reduction of instructions executed on a Hello World - with some
>>>> (smaller) scaling impact on larger applications.
>>>>
>>>> The initialization is very simple/naive, i.e., lacks any kind of 
>>>> synchronization protocol. But as this setup is guaranteed to happen 
>>>> very
>>>> early during bootstrap this should be fine. Thanks Ioi for some 
>>>> helpful
>>>> suggestions here!
>>>>
>>>> Bug:    https://bugs.openjdk.java.net/browse/JDK-8240772
>>>> Webrev: http://cr.openjdk.java.net/~redestad/8240772/open.00/
>>>>
>>>> Testing: tier1-3
>>>>
>>>> Thanks!
>>>>
>>>> /Claes
>>>