RFR: 8240772: x86_64: Pre-generate Assembler::popa, pusha and vzeroupper
Ioi Lam
ioi.lam at oracle.com
Wed Mar 11 18:04:53 UTC 2020
Hi Claes,
Looks good. I would suggest moving these into an inline function to
avoid repetition.
assert(pusha_code != NULL, "must be pregenerated");
assert(code_section()->limit() - code_section()->end() > pusha_len,
"code buffer not large enough");
address end = code_section()->end();
memcpy((char*)end, pusha_code, pusha_len);
code_section()->set_end(end + pusha_len);
Thanks
- Ioi
On 3/11/20 6:20 AM, Claes Redestad wrote:
> New webrev: http://cr.openjdk.java.net/~redestad/8240772/open.01/
>
> Rather than *_slow I went with *_uncached.
>
> Reworked initialization, and discovered a bug in open.00:
>
> vzeroupper is speculatively emitted in the VM_Version stub with the CPU
> feature flag explicitly set. This meant we pre-computed the vzeroupper
> as always enabled, before CPU capabilities had been determined. This
> caused an intermittent test issue.
>
> I now expose an uncached version of vzeroupper that the VM_Version stub
> can use, then trigger the pre-compute right after running the CPU
> feature checks.
>
> Testing: re-running tier1-3, verified 32-bit builds locally
>
> /Claes
>
> On 2020-03-10 20:19, Claes Redestad wrote:
>> Hi Ioi,
>>
>> good suggestions. I will rework this tomorrow, along with Vladimir's
>> suggestion to add an explicit call to precompute_instructions from
>> the stubGenerator.
>>
>> Thanks!
>>
>> /Claes
>>
>> On 2020-03-10 19:40, Ioi Lam wrote:
>>> Hi Claes,
>>>
>>> This is a really good optimization! Small bang for big bucks!
>>>
>>> I have a suggestion code coding style:
>>>
>>> Rename Assembler::popa to Assembler::popa_slow();
>>>
>>> void Assembler::popa() { // 64bit
>>> if (!precomputed) {
>>> precompute_instructions();
>>> }
>>> copy_precomputed_instructions(popa_code, popa_len);
>>> }
>>>
>>> static void precompute_instructions() {
>>> ...
>>> MacroAssembler masm(&buffer);
>>>
>>> address begin_popa = masm.code_section()->end();
>>> masm.popa_slow();
>>> address end_popa = masm.code_section()->end();
>>> ...
>>> }
>>>
>>> ----
>>>
>>> Also, maybe you can add this assert after generating the code for
>>> all 3 macros:
>>>
>>> assert(masm->code()->total_relocation_size() == 0 &&
>>> masm->code()->total_oop_size() == 0 &&
>>> masm->code()->total_metadata_size() == 0,
>>> "precomputed code cannot have any of these");
>>>
>>>
>>> Thanks!
>>> - Ioi
>>>
>>>
>>>
>>> On 3/10/20 6:46 AM, Claes Redestad wrote:
>>>> Hi,
>>>>
>>>> calculate some invariant Assembler routines at bootstrap, copy on
>>>> subsequent invocations.
>>>>
>>>> For popa and pusha this means an overhead reduction of around 98%
>>>> (from
>>>> ~2500 instructions to emit a pusha to ~50). For vzeroupper an overhead
>>>> reduction of ~65% (117 -> 42). Together these add up to about a 1%
>>>> reduction of instructions executed on a Hello World - with some
>>>> (smaller) scaling impact on larger applications.
>>>>
>>>> The initialization is very simple/naive, i.e., lacks any kind of
>>>> synchronization protocol. But as this setup is guaranteed to happen
>>>> very
>>>> early during bootstrap this should be fine. Thanks Ioi for some
>>>> helpful
>>>> suggestions here!
>>>>
>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8240772
>>>> Webrev: http://cr.openjdk.java.net/~redestad/8240772/open.00/
>>>>
>>>> Testing: tier1-3
>>>>
>>>> Thanks!
>>>>
>>>> /Claes
>>>
More information about the hotspot-compiler-dev
mailing list