CR for RFR 8149421

Sat Feb 13 00:05:30 UTC 2016

Looks good. I will sponsor it.

Thanks,
Vladimir

On 2/12/16 2:11 PM, Berg, Michael C wrote:
> Vladimir, the change below is now logged on the jbs entry at:
>
> http://cr.openjdk.java.net/~mcberg/8149421/webrev.02/
>
> Thanks,
> Michael
>
> -----Original Message-----
> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> Sent: Thursday, February 11, 2016 11:26 AM
> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
> Subject: Re: CR for RFR 8149421
>
> yes, this is good
>
> Thanks,
> Vladimir
>
> On 2/11/16 10:36 AM, Berg, Michael C wrote:
>> Vladimir, how about this (from x86_assembler.hpp):
>>
>>     InstructionAttr(
>>       int vector_len,     	// The length of vector to be applied in encoding - for both AVX and EVEX
>>       bool rex_vex_w,     	// Width of data: if 32-bits or less, false, else if 64-bit or specially defined, true
>>       bool legacy_mode,   	// Details if either this instruction is conditionally encoded to AVX or earlier if true else possibly EVEX
>>       bool no_reg_mask,   	// when true, k0 is used when EVEX encoding is chosen, else k1 is used under the same condition
>>       bool uses_vl)       	// This instruction may have legacy constraints based on vector length for EVEX
>>
>> For documentation?
>>
>> Thanks,
>> Michael
>>
>> -----Original Message-----
>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>> Sent: Thursday, February 11, 2016 10:09 AM
>> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: CR for RFR 8149421
>>
>> On 2/11/16 9:55 AM, Berg, Michael C wrote:
>>> Yes, that is pretty close to it, the unrolled loop, after it initially succeeds as an atomic or unity unroll segment of one vector size is what I replicated to the drain loop, before super unrolling occurs.  In fact, it's precisely what we need.
>>
>> Will you do more to improve it?
>>
>>> I migrated the ss/sd to insns k1 usage for uniformity reasons. You will notice some b and w SIMD components going the other way, that is preparatory, not a bug yet, but could have been if left, to use k0 which has all bits set for masking in the auto code generation path.  The only exception is stub code, for which a webrev will soon be made available which has programmable versions of w and b components that do not fit in the auto code generation path, and for which k1 contents are left to the responsibility of the stub writer. The others insns are set to false like movdqu are in preparation for programmable SIMD, which will need to apply programmed masks into fix-up segments.  Since programmable SIMD is for int/float, long/double sizes only there will be no conflict. Basically the w and b components do not have enough ISA mapping to complete more than very basic vector expressions, so we confine the usage model by idiom wrt masking and exclude them from programmable SI!
 M!
>   D.
>>
>> Please, add comments about this in InstructionAttr. And add comments to all fields of InstructionAttr - shortly describe them. It will help us in a future to set correct values.
>>
>> I may need to be educated about "programmable SIMD" :)
>>
>> Thanks,
>> Vladimir
>>
>>>
>>> Regards,
>>> Michael
>>>
>>> -----Original Message-----
>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>> Sent: Wednesday, February 10, 2016 10:05 PM
>>> To: hotspot-compiler-dev at openjdk.java.net
>>> Cc: Berg, Michael C
>>> Subject: Re: CR for RFR 8149421
>>>
>>> What are the changes in assembler_x86.cpp? You changed no_mask_reg arguments value. Was it bug?
>>>
>>> Looks like you copy-paste code from insert_pre_post_loops() which is fine.
>>> One thing is worry me is that due to ratio of unrolling done before vectorization and vector size you can have several repetitive vector operations. It would be nice if we do unrolling equal vector size then do vectorization to generate one vector instruction, then clone to create vector_post_loop. And then unroll main more.
>>> Or you are already doing something like that?
>>>
>>> Thanks,
>>> Vladimir
>>>
>>> On 2/9/16 3:16 PM, Berg, Michael C wrote:
>>>> Hi Folks,
>>>>
>>>> I would like to contribute vectorized post loops. This patch is
>>>> initially targeted for x86.  The design is versatile so as to be
>>>> portable to other targets as well. This code poses the addition of
>>>> atomic unrolled drain loops which precede fix-up segments and which
>>>> are significantly faster than scalar code. The requirement is that
>>>> the main loop is super unrolled after vectorization. I see up to 54% uplift on micro benchmarks on x86 targets for loops which pass superword vectorization and which meet the above criteria.  Also scimark metrics in SpecJvm2008 like lu.small  and fft.small show the usage of this design for benefit on x86.
>>>>
>>>> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8149421
>>>>
>>>>
>>>> webrev:
>>>>
>>>> http://cr.openjdk.java.net/~mcberg/8149421/webrev.01/
>>>>
>>>> Thanks,
>>>>
>>>> Michael
>>>>