RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction

Thu Apr 2 10:14:53 UTC 2020

>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/
> 
> Looks good. I'll submit it for testing.

Test results are clean.

Best regards,
Vladimir Ivanov

>> This removes Optimized NotV handling for AVX3, as suggested it will be
>> brought via vectorIntrinsics branch.
>>
>> Thanks for your help in shaping up this patch, please let me know if 
>> there
>> are other comments.
>>
>> Best Regards,
>> Jatin
>> ________________________________________
>> From: Bhateja, Jatin
>> Sent: Wednesday, March 25, 2020 12:14 PM
>> To: Vladimir Ivanov
>> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
>> Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic 
>> Instruction
>>
>> Hi Vladimir,
>>
>> I have placed updated patch at following links:-
>>
>>   1)  Optimized NotV handling:
>> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
>>
>>   2)  Changes for MacroLogic opt:
>>   http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/
>>
>> Kindly review and let me know your feedback.
>>
>> Thanks,
>> Jatin
>>
>>> -----Original Message-----
>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>> Sent: Wednesday, March 25, 2020 12:33 AM
>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>>> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
>>> <sandhya.viswanathan at intel.com>
>>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic 
>>> Instruction
>>>
>>> Hi Jatin,
>>>
>>> I tried to submit the patches for testing, but windows-x64 build 
>>> failed with the
>>> following errors:
>>>
>>> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did 
>>> not
>>> evaluate to a constant
>>> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by 
>>> a read
>>> of a variable outside its lifetime
>>> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition'
>>> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int
>>> ['function']' is not assignable
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> On 24.03.2020 10:34, Bhateja, Jatin wrote:
>>>> Hi Vladimir,
>>>>
>>>> Thanks for your comments , I have split the original patch into two 
>>>> sub-
>>> patches.
>>>>
>>>> 1)  Optimized NotV handling:
>>>> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
>>>>
>>>> 2)  Changes for MacroLogic opt:
>>>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
>>>>
>>>> Added a new flag "UseVectorMacroLogic" which guards MacroLogic
>>> optimization.
>>>>
>>>> Kindly review and let me know your feedback.
>>>>
>>>> Best Regards,
>>>> Jatin
>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>> Sent: Tuesday, March 17, 2020 4:31 PM
>>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
>>>>> dev at openjdk.java.net
>>>>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
>>>>> Instruction
>>>>>
>>>>>
>>>>>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
>>>>>
>>>>> Very nice contribution, Jatin!
>>>>>
>>>>> Some comments after a brief review pass:
>>>>>
>>>>>      * Please, contribute NotV part separately.
>>>>>
>>>>>      * Why don't you perform (XorV v 0xFF..FF) => (NotV v)
>>>>> transformation during GVN instead?
>>>>>
>>>>>      * As of now, vector nodes are only produced by SuperWord
>>>>> analysis. It makes sense to limit new optimization pass to SuperWord
>>>>> pass only (probably, introduce a new dedicated Phase ). Once Vector
>>>>> API is available, it can be extended to cases when vector nodes are
>>>>> present
>>>>> (C->max_vector_size() > 0).
>>>>>
>>>>>      * There are more efficient ways to produce a vector of all-1s 
>>>>> [1] [2].
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>> [1]
>>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105
>>>>> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgc
>>>>> qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$
>>>>> 1-efficiently
>>>>>
>>>>> [2]
>>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469
>>>>> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI
>>>>> _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$
>>>>> value-to-all-one-bits
>>>>>
>>>>>>
>>>>>> A new optimization pass has been added post Auto-Vectorization which
>>>>> folds expression tree involving vector boolean logic operations
>>>>> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
>>>>>> Optimization pass has following stages:
>>>>>>
>>>>>>      1.  Collection stage :
>>>>>>         *   This performs a DFS traversal over Ideal Graph and 
>>>>>> collects the root
>>>>> nodes of all vector logic expression trees.
>>>>>>      2.  Processing stage:
>>>>>>         *   Performs a bottom up traversal over expression tree and
>>>>> simultaneously folds specific DAG patterns involving Boolean logic
>>>>> parent and child nodes.
>>>>>>         *   Transforms (XORV INP , -1) -> (NOTV INP) to promote 
>>>>>> logic folding.
>>>>>>         *   Folding is performed under a constraint on the total 
>>>>>> number of
>>> inputs
>>>>> which a MacroLogic node can have, in this case it's 3.
>>>>>>         *   A partition is created around a DAG pattern involving 
>>>>>> logic parent
>>> and
>>>>> one or two logic child node, it encapsulate the nodes in post-order 
>>>>> fashion.
>>>>>>         *   This partition is then evaluated by traversing over 
>>>>>> the nodes,
>>> assigning
>>>>> boolean values to its inputs and performing operations over them
>>>>> based on its Opcode. Node along with its computed result is stored in
>>>>> a map which is accessed during the evaluation of its user/parent node.
>>>>>>         *   Post-evaluation a MacroLogic node is created which is 
>>>>>> equivalent to
>>> a
>>>>> three input truth-table. Expression tree leaf level inputs along with
>>>>> result of its evaluation are the inputs fed to this new node.
>>>>>>         *   Entire expression tree is eventually subsumed/replaced 
>>>>>> by newly
>>>>> create MacroLogic node.
>>>>>>
>>>>>>
>>>>>> Following are the JMH benchmarks results with and without changes.
>>>>>>
>>>>>> Without Changes:
>>>>>>
>>>>>> Benchmark                            (VECLEN)   Mode  Cnt     
>>>>>> Score   Error  Units
>>>>>> MacroLogicOpt.workload1_caller             64  thrpt       
>>>>>> 2904.480          ops/s
>>>>>> MacroLogicOpt.workload1_caller            128  thrpt       
>>>>>> 2219.252          ops/s
>>>>>> MacroLogicOpt.workload1_caller            256  thrpt       
>>>>>> 1507.267          ops/s
>>>>>> MacroLogicOpt.workload1_caller            512  thrpt        
>>>>>> 860.926          ops/s
>>>>>> MacroLogicOpt.workload1_caller           1024  thrpt        
>>>>>> 470.163          ops/s
>>>>>> MacroLogicOpt.workload1_caller           2048  thrpt        
>>>>>> 246.608          ops/s
>>>>>> MacroLogicOpt.workload1_caller           4096  thrpt        
>>>>>> 108.031          ops/s
>>>>>> MacroLogicOpt.workload2_caller             64  thrpt        
>>>>>> 344.633          ops/s
>>>>>> MacroLogicOpt.workload2_caller            128  thrpt        
>>>>>> 209.818          ops/s
>>>>>> MacroLogicOpt.workload2_caller            256  thrpt        
>>>>>> 111.678          ops/s
>>>>>> MacroLogicOpt.workload2_caller            512  thrpt         
>>>>>> 53.360          ops/s
>>>>>> MacroLogicOpt.workload2_caller           1024  thrpt         
>>>>>> 27.888          ops/s
>>>>>> MacroLogicOpt.workload2_caller           2048  thrpt         
>>>>>> 12.103          ops/s
>>>>>> MacroLogicOpt.workload2_caller           4096  thrpt          
>>>>>> 6.018          ops/s
>>>>>> MacroLogicOpt.workload3_caller             64  thrpt       
>>>>>> 3110.669          ops/s
>>>>>> MacroLogicOpt.workload3_caller            128  thrpt       
>>>>>> 1996.861          ops/s
>>>>>> MacroLogicOpt.workload3_caller            256  thrpt        
>>>>>> 870.166          ops/s
>>>>>> MacroLogicOpt.workload3_caller            512  thrpt        
>>>>>> 389.629          ops/s
>>>>>> MacroLogicOpt.workload3_caller           1024  thrpt        
>>>>>> 151.203          ops/s
>>>>>> MacroLogicOpt.workload3_caller           2048  thrpt         
>>>>>> 75.086          ops/s
>>>>>> MacroLogicOpt.workload3_caller           4096  thrpt         
>>>>>> 37.576          ops/s
>>>>>>
>>>>>> With Changes:
>>>>>>
>>>>>> Benchmark                            (VECLEN)   Mode  Cnt     
>>>>>> Score   Error  Units
>>>>>> MacroLogicOpt.workload1_caller             64  thrpt       
>>>>>> 3306.670          ops/s
>>>>>> MacroLogicOpt.workload1_caller            128  thrpt       
>>>>>> 2936.851          ops/s
>>>>>> MacroLogicOpt.workload1_caller            256  thrpt       
>>>>>> 2413.827          ops/s
>>>>>> MacroLogicOpt.workload1_caller            512  thrpt       
>>>>>> 1440.291          ops/s
>>>>>> MacroLogicOpt.workload1_caller           1024  thrpt        
>>>>>> 707.576          ops/s
>>>>>> MacroLogicOpt.workload1_caller           2048  thrpt        
>>>>>> 384.863          ops/s
>>>>>> MacroLogicOpt.workload1_caller           4096  thrpt        
>>>>>> 132.753          ops/s
>>>>>> MacroLogicOpt.workload2_caller             64  thrpt        
>>>>>> 450.856          ops/s
>>>>>> MacroLogicOpt.workload2_caller            128  thrpt        
>>>>>> 323.925          ops/s
>>>>>> MacroLogicOpt.workload2_caller            256  thrpt        
>>>>>> 135.191          ops/s
>>>>>> MacroLogicOpt.workload2_caller            512  thrpt         
>>>>>> 69.424          ops/s
>>>>>> MacroLogicOpt.workload2_caller           1024  thrpt         
>>>>>> 35.744          ops/s
>>>>>> MacroLogicOpt.workload2_caller           2048  thrpt         
>>>>>> 14.168          ops/s
>>>>>> MacroLogicOpt.workload2_caller           4096  thrpt          
>>>>>> 7.245          ops/s
>>>>>> MacroLogicOpt.workload3_caller             64  thrpt       
>>>>>> 3333.550          ops/s
>>>>>> MacroLogicOpt.workload3_caller            128  thrpt       
>>>>>> 2269.428          ops/s
>>>>>> MacroLogicOpt.workload3_caller            256  thrpt        
>>>>>> 995.691          ops/s
>>>>>> MacroLogicOpt.workload3_caller            512  thrpt        
>>>>>> 412.452          ops/s
>>>>>> MacroLogicOpt.workload3_caller           1024  thrpt        
>>>>>> 151.157          ops/s
>>>>>> MacroLogicOpt.workload3_caller           2048  thrpt         
>>>>>> 75.079          ops/s
>>>>>> MacroLogicOpt.workload3_caller           4096  thrpt         
>>>>>> 37.158          ops/s
>>>>>>
>>>>>> Please review the patch.
>>>>>>
>>>>>> Best Regards,
>>>>>> Jatin
>>>>>>
>>>>>> [1] Section 17.7 :
>>>>>> https://urldefense.com/v3/__https://software.intel.com/sites/default
>>>>>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG
>>>>>> QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$
>>>>>> architectures-optimization-manual.pdf
>>>>>>