RFR: 8240772: x86_64: Pre-generate Assembler::popa, pusha and vzeroupper
Vladimir Kozlov
vladimir.kozlov at oracle.com
Wed Mar 11 18:29:07 UTC 2020
Yes, I also thought about factoring this code into one function. The only drawback is yuo would need call stack to see
which instruction failed. But I doubt we will see such failure ;)
That leads to question. Claes, what was the reason you used loop instead of memcpy for vzeroupper?
The logic of changes is good.
Thanks,
Vladimir
On 3/11/20 11:04 AM, Ioi Lam wrote:
> Hi Claes,
>
> Looks good. I would suggest moving these into an inline function to avoid repetition.
>
> assert(pusha_code != NULL, "must be pregenerated");
> assert(code_section()->limit() - code_section()->end() > pusha_len, "code buffer not large enough");
> address end = code_section()->end();
> memcpy((char*)end, pusha_code, pusha_len);
> code_section()->set_end(end + pusha_len);
>
>
> Thanks
> - Ioi
>
> On 3/11/20 6:20 AM, Claes Redestad wrote:
>> New webrev: http://cr.openjdk.java.net/~redestad/8240772/open.01/
>>
>> Rather than *_slow I went with *_uncached.
>>
>> Reworked initialization, and discovered a bug in open.00:
>>
>> vzeroupper is speculatively emitted in the VM_Version stub with the CPU
>> feature flag explicitly set. This meant we pre-computed the vzeroupper
>> as always enabled, before CPU capabilities had been determined. This
>> caused an intermittent test issue.
>>
>> I now expose an uncached version of vzeroupper that the VM_Version stub
>> can use, then trigger the pre-compute right after running the CPU
>> feature checks.
>>
>> Testing: re-running tier1-3, verified 32-bit builds locally
>>
>> /Claes
>>
>> On 2020-03-10 20:19, Claes Redestad wrote:
>>> Hi Ioi,
>>>
>>> good suggestions. I will rework this tomorrow, along with Vladimir's
>>> suggestion to add an explicit call to precompute_instructions from
>>> the stubGenerator.
>>>
>>> Thanks!
>>>
>>> /Claes
>>>
>>> On 2020-03-10 19:40, Ioi Lam wrote:
>>>> Hi Claes,
>>>>
>>>> This is a really good optimization! Small bang for big bucks!
>>>>
>>>> I have a suggestion code coding style:
>>>>
>>>> Rename Assembler::popa to Assembler::popa_slow();
>>>>
>>>> void Assembler::popa() { // 64bit
>>>> if (!precomputed) {
>>>> precompute_instructions();
>>>> }
>>>> copy_precomputed_instructions(popa_code, popa_len);
>>>> }
>>>>
>>>> static void precompute_instructions() {
>>>> ...
>>>> MacroAssembler masm(&buffer);
>>>>
>>>> address begin_popa = masm.code_section()->end();
>>>> masm.popa_slow();
>>>> address end_popa = masm.code_section()->end();
>>>> ...
>>>> }
>>>>
>>>> ----
>>>>
>>>> Also, maybe you can add this assert after generating the code for all 3 macros:
>>>>
>>>> assert(masm->code()->total_relocation_size() == 0 &&
>>>> masm->code()->total_oop_size() == 0 &&
>>>> masm->code()->total_metadata_size() == 0,
>>>> "precomputed code cannot have any of these");
>>>>
>>>>
>>>> Thanks!
>>>> - Ioi
>>>>
>>>>
>>>>
>>>> On 3/10/20 6:46 AM, Claes Redestad wrote:
>>>>> Hi,
>>>>>
>>>>> calculate some invariant Assembler routines at bootstrap, copy on
>>>>> subsequent invocations.
>>>>>
>>>>> For popa and pusha this means an overhead reduction of around 98% (from
>>>>> ~2500 instructions to emit a pusha to ~50). For vzeroupper an overhead
>>>>> reduction of ~65% (117 -> 42). Together these add up to about a 1%
>>>>> reduction of instructions executed on a Hello World - with some
>>>>> (smaller) scaling impact on larger applications.
>>>>>
>>>>> The initialization is very simple/naive, i.e., lacks any kind of synchronization protocol. But as this setup is
>>>>> guaranteed to happen very
>>>>> early during bootstrap this should be fine. Thanks Ioi for some helpful
>>>>> suggestions here!
>>>>>
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8240772
>>>>> Webrev: http://cr.openjdk.java.net/~redestad/8240772/open.00/
>>>>>
>>>>> Testing: tier1-3
>>>>>
>>>>> Thanks!
>>>>>
>>>>> /Claes
>>>>
>
More information about the hotspot-compiler-dev
mailing list