CR for RFR 8153998

Thu Apr 14 00:40:44 UTC 2016

Hi Michael,

Please, split changes. _rex_vex_w_reverted (and other assembler) changes 
can be pushed first. evmovdqul -> evmovdquq and Vectors element_size() 
changes could be pushed separately too.

You don't need MachMskNode place holder methods in other platforms .ad. 
I think Matcher::has_predicated_vectors() will be enough since 
MachMskNode is generated only when has_predicated_vectors() is true. 
This is how we usually do.

macroAssembler_x86.cpp

Why you use table and not instructions to generate mask value? Looking 
on table it very easy to generate (you would need additional instruction 
but it is better than load from memory I think):

(1 << src) - 1

src == 0 could be treated specially.
You can leave the table as comment to see which values are expected.

x86.ad

names should be consistent: MaskCreateINode -> CreateMaskINode, set_mask 
-> createMask. You can also use Matcher::has_predicated_vectors() in 
predicate:

+instruct createMask(rRegI dst, rRegI src) %{
+  predicate(Matcher::has_predicated_vectors());
+  match(Set dst (CreateMaskI src));
+  effect(TEMP dst);
+  format %{ "createmsk   $dst, $src" %}

May be it should setMask as reverse to restoreMask. And more precisely 
setvectmask/restorevectmask.

MaskCreateINode or SetVectMaskINode should be defined in vector.hpp and 
not in subnode.hpp.

block.cpp
Matcher::has_predicated_vectors() should be checked with if 
(found_fixup_loops) to avoid useless looping.

I don't like how you inject MachMskNode. It should be generated on exit 
from loop where you created MaskCreateINode.

Will need additional review after you clean up above comments.

Thanks,
Vladimir

On 4/12/16 11:26 PM, Berg, Michael C wrote:
> Hi Folks,
>
> I would like to contribute Programmable SIMD as implemented on
> multi-versioned post loops. See:
> https://bugs.openjdk.java.net/browse/JDK-8151573 for the first half of
> the implementation.
>
> This component delivers mask programmed post loops which execute in a
> single iteration in place of fixup scalar loops which used to take many
> iterations to complete work for user loops.
>
> Currently I have enabled this optimization for x86 only, specifically
> for machines with masked data predication implemented as per fully
> enabled EVEX targets.  It delivers up to 2x performance and has been
> modeled over a large number of loop lengths and forms of loops.
>
> This code was tested as follows(see jbs entry below):
>
>
> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8153998
>
>
> webrev:
>
> http://cr.openjdk.java.net/~mcberg/8153998/webrev.01a/
>
> Thanks,
>
> Michael
>