RFR(L): 8076284: Improve vectorization of parallel streams

Tue May 5 04:53:06 UTC 2015

I reviewed it and tested in JPRT. If nobody objects I will push it.

Thanks,
Vladimir

On 4/24/15 12:03 PM, Vladimir Kozlov wrote:
> Updated webrev:
>
> http://cr.openjdk.java.net/~kvn/8076284/webrev.01/
>
> Vladimir
>
> On 4/20/15 11:45 PM, Civlin, Jan wrote:
>> Vladimir,
>>
>> Here is the description and new patch with the changes you recommended (except the last one - see below my explanation).
>>
>>
>> The patch description.
>>
>> This patch provides on-demand vectorization/SIMD'ing of a <method> specified in JVM command as
>> -XX:CompileCommand=option,<method>,Vectorize.
>> This optimization may be globally disabled by setting the flag -XX: -AllowVectorizeOnDemand (by default it is true).
>>
>> For each method that was specified with Vectorize option we do the following:
>>
>> 1. On each iteration of loop unroll for a given method (loopopts.cpp) we generate the next _clone_idx (which will be
>> common for all the nodes cloned in this iteration); and on each node cloning we hash _idx of the origin of the node
>> that is cloned (_idx_clone_orig ) and the _clone_idx (cm.verify_insert_and_clone).
>> CloneMap belongs to Compile and is created in CompilerWrapper.
>>
>> 2. In SuperWord optimization, after max_depth has been built, we are hoisting the loads.
>> For this we for each Load_X (subject of the hoisting) find some Load_0 that has the same origin as Load_X but belongs
>> to the first iteration, i.e. if the MemNode::Memory input of Load_0 is memory Phi (collected previously in
>> memory_slice) we set this Phi also as the MemNode::Memory input of Load_X. After this rebuild of the graph we restart
>> the Superword optimization.
>>
>> The major routines here.
>> - SuperWord::mark_generations: computes _ii_first (the index _clone_idx) of the nodes that have MemNode::Memory input
>> coming from a phi node in some slice; computes list of the nodes in the first and last iterations of the loop.
>> - SuperWord::hoist_loads_in_graph: for each memory slice (a phi node) visits each load that has this phi as a memory
>> input and then for each other load that has the same origin makes the memory input coming from the phi.  This routine
>> does not use marking generations mechanism.
>> - SuperWord::pack_parallel - this routine is called only if SuperWord fails to produce any pack after
>> extend_packlist(); it is another algorithm for packing instructions into SIMD.
>> It goes thru the list of all instructions in the _iteration_first, and if it is a Load, Store, Add or Mul it starts a
>> new pack and adds this instruction to this pack. Then the algorithm circulates thru the iterations of the loop (gen <
>> _ii_order.length()) and over the instructions list and finds the node with the origin coinciding with the origin of
>> the nodes already in the pack - then it adds this node to the pack. Once packs are built, SuperWord returns to the
>> normal processing (combine_packs()).
>>
>> Note, that neither 2 or 3 goes thru the data dependency analysis, since the correctness of parallelization was
>> guaranteed by the user.
>> Note, that some checks in added code could be omitted. But we are not assume that there is no optimization (now or in
>> the future) that can change the graph structure between loop unrolling and the SuperWord, so we prefer to run many
>> (probably unneeded) checks in the SuperWord.
>>
>>
>>
>>>> Can you also utilize changes done by Michael Berg for reduction
>>>> optimization (the code in jdk9/hs-comp already)? I mean marking some
>>>> nodes before unrolling and searching Phis.
>> Michael Berg and I looked at what you suggested (phi marking before unroll?) and we both think this marking is very
>> different than what I do: Michael's phi marking is just one bit per node (in the flags) whereas I collect the
>> _idx_clone_orig and the unroll generation (clone_idx).
>>
>>
>>   -----Original Message-----
>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>> Sent: Thursday, April 16, 2015 3:31 PM
>> To: Civlin, Jan; hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: RFR(S): 8076284: Improve vectorization of parallel streams
>>
>> Hi Jan,
>>
>> You did not describe your changes in details (what they do).
>>
>> IgnoreVectorizeMethod flag should positive and enabled by default.
>> Rename it to AllowVectorizeOnDemand (or something similar):
>>
>> +  product(bool, AllowVectorizeOnDemand, true,
>>        \
>>
>> Instead of next you should add intrinsic definition to
>> and classfile/vmSymbols.hpp and then check method()->intrinsic_id():
>>
>> +    if (strcmp("forEachRemaining", method()->name()->as_quoted_ascii())
>> == 0 && method()->signature() != 0
>> +      && method()->signature()->as_symbol() != 0 &&
>> method()->signature()->as_symbol()->as_quoted_ascii() != 0 ) {
>> +      if
>> (strstr(method()->signature()->as_symbol()->as_quoted_ascii(),"Ljava/util/function/IntConsumer"))
>> {
>> +        set_do_vector_loop(true);
>> +      }
>> +    }
>>
>> And that should be under flag too because in general forEachRemaining
>> should be vectorized only if it is safe.
>>
>> Can you also utilize changes done by Michael Berg for reduction
>> optimization (the code in jdk9/hs-comp already)? I mean marking some
>> nodes before unrolling and searching Phis.
>>
>> Regards,
>> Vladimir
>>
>> On 4/13/15 3:33 AM, Civlin, Jan wrote:
>>> Hi All,
>>>
>>>
>>>    We would like to contribute the improvement of vectorization of
>>>    parallel streams  from Intel.
>>>
>>> The contribution Bug ID: 8076284.
>>>
>>> Please review this patch:
>>>
>>> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8076284
>>>
>>> webrev: http://cr.openjdk.java.net/~kvn/8076284/webrev/
>>>
>>>
>>>        *Description*
>>>
>>> Improve vectorization of the unordered parallel streams (by vectorizing
>>> forEachRemaining method).
>>>
>>> For example, this forEach will be vectorized:
>>>
>>> java.util.stream.IntStream iStream = java.util.stream.IntStream.range(0,
>>> RANGE - 1).parallel();
>>>
>>> iStream.forEach( id -> c[id] = c[id] + c[id+1] );
>>>
>>> It also enables on-demand loop vectorization in a given method (by
>>> providing more hints to SuperWord optimization).
>>>
>>> For example, use -XX:CompileCommand=option,computeCall,Vectorizeto
>>> vectorize this loop
>>>
>>> void computeCall(double [] Call, double  puByDf, double  pdByDf)
>>>
>>> {
>>>
>>> for(int i = timeStep; i > 0; i--)
>>>
>>> for(int j = 0; j <= i - 1; j++)
>>>
>>> Call[j] = puByDf * Call[j + 1] + pdByDf * Call[j];
>>>
>>> }
>>>
>>>
>>> This enhancement is contributed by Intel and sponsored by the hotspot
>>> compiler team.
>>>
>>