RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction

Thu Apr 2 09:28:45 UTC 2020

Hi Jatin,

The patch is nice and clean.

Reviewed.

Best regards
Nils Eliasson

On 2020-04-01 20:23, Bhateja, Jatin wrote:
> Hi Vladimir,
>
> Please find an updated unified patch at the following link.
>
> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/
>
> This removes Optimized NotV handling for AVX3, as suggested it will be
> brought via vectorIntrinsics branch.
>
> Thanks for your help in shaping up this patch, please let me know if there
> are other comments.
>
> Best Regards,
> Jatin
> ________________________________________
> From: Bhateja, Jatin
> Sent: Wednesday, March 25, 2020 12:14 PM
> To: Vladimir Ivanov
> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
> Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
>
> Hi Vladimir,
>
> I have placed updated patch at following links:-
>
>   1)  Optimized NotV handling:
> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
>
>   2)  Changes for MacroLogic opt:
>   http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/
>
> Kindly review and let me know your feedback.
>
> Thanks,
> Jatin
>
>> -----Original Message-----
>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>> Sent: Wednesday, March 25, 2020 12:33 AM
>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
>> <sandhya.viswanathan at intel.com>
>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
>>
>> Hi Jatin,
>>
>> I tried to submit the patches for testing, but windows-x64 build failed with the
>> following errors:
>>
>> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did not
>> evaluate to a constant
>> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by a read
>> of a variable outside its lifetime
>> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition'
>> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int
>> ['function']' is not assignable
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 24.03.2020 10:34, Bhateja, Jatin wrote:
>>> Hi Vladimir,
>>>
>>> Thanks for your comments , I have split the original patch into two sub-
>> patches.
>>> 1)  Optimized NotV handling:
>>> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
>>>
>>> 2)  Changes for MacroLogic opt:
>>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
>>>
>>> Added a new flag "UseVectorMacroLogic" which guards MacroLogic
>> optimization.
>>> Kindly review and let me know your feedback.
>>>
>>> Best Regards,
>>> Jatin
>>>
>>>> -----Original Message-----
>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>> Sent: Tuesday, March 17, 2020 4:31 PM
>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
>>>> dev at openjdk.java.net
>>>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
>>>> Instruction
>>>>
>>>>
>>>>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
>>>> Very nice contribution, Jatin!
>>>>
>>>> Some comments after a brief review pass:
>>>>
>>>>      * Please, contribute NotV part separately.
>>>>
>>>>      * Why don't you perform (XorV v 0xFF..FF) => (NotV v)
>>>> transformation during GVN instead?
>>>>
>>>>      * As of now, vector nodes are only produced by SuperWord
>>>> analysis. It makes sense to limit new optimization pass to SuperWord
>>>> pass only (probably, introduce a new dedicated Phase ). Once Vector
>>>> API is available, it can be extended to cases when vector nodes are
>>>> present
>>>> (C->max_vector_size() > 0).
>>>>
>>>>      * There are more efficient ways to produce a vector of all-1s [1] [2].
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>> [1]
>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105
>>>> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgc
>>>> qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$
>>>> 1-efficiently
>>>>
>>>> [2]
>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469
>>>> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI
>>>> _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$
>>>> value-to-all-one-bits
>>>>
>>>>> A new optimization pass has been added post Auto-Vectorization which
>>>> folds expression tree involving vector boolean logic operations
>>>> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
>>>>> Optimization pass has following stages:
>>>>>
>>>>>      1.  Collection stage :
>>>>>         *   This performs a DFS traversal over Ideal Graph and collects the root
>>>> nodes of all vector logic expression trees.
>>>>>      2.  Processing stage:
>>>>>         *   Performs a bottom up traversal over expression tree and
>>>> simultaneously folds specific DAG patterns involving Boolean logic
>>>> parent and child nodes.
>>>>>         *   Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding.
>>>>>         *   Folding is performed under a constraint on the total number of
>> inputs
>>>> which a MacroLogic node can have, in this case it's 3.
>>>>>         *   A partition is created around a DAG pattern involving logic parent
>> and
>>>> one or two logic child node, it encapsulate the nodes in post-order fashion.
>>>>>         *   This partition is then evaluated by traversing over the nodes,
>> assigning
>>>> boolean values to its inputs and performing operations over them
>>>> based on its Opcode. Node along with its computed result is stored in
>>>> a map which is accessed during the evaluation of its user/parent node.
>>>>>         *   Post-evaluation a MacroLogic node is created which is equivalent to
>> a
>>>> three input truth-table. Expression tree leaf level inputs along with
>>>> result of its evaluation are the inputs fed to this new node.
>>>>>         *   Entire expression tree is eventually subsumed/replaced by newly
>>>> create MacroLogic node.
>>>>>
>>>>> Following are the JMH benchmarks results with and without changes.
>>>>>
>>>>> Without Changes:
>>>>>
>>>>> Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
>>>>> MacroLogicOpt.workload1_caller             64  thrpt       2904.480          ops/s
>>>>> MacroLogicOpt.workload1_caller            128  thrpt       2219.252          ops/s
>>>>> MacroLogicOpt.workload1_caller            256  thrpt       1507.267          ops/s
>>>>> MacroLogicOpt.workload1_caller            512  thrpt        860.926          ops/s
>>>>> MacroLogicOpt.workload1_caller           1024  thrpt        470.163          ops/s
>>>>> MacroLogicOpt.workload1_caller           2048  thrpt        246.608          ops/s
>>>>> MacroLogicOpt.workload1_caller           4096  thrpt        108.031          ops/s
>>>>> MacroLogicOpt.workload2_caller             64  thrpt        344.633          ops/s
>>>>> MacroLogicOpt.workload2_caller            128  thrpt        209.818          ops/s
>>>>> MacroLogicOpt.workload2_caller            256  thrpt        111.678          ops/s
>>>>> MacroLogicOpt.workload2_caller            512  thrpt         53.360          ops/s
>>>>> MacroLogicOpt.workload2_caller           1024  thrpt         27.888          ops/s
>>>>> MacroLogicOpt.workload2_caller           2048  thrpt         12.103          ops/s
>>>>> MacroLogicOpt.workload2_caller           4096  thrpt          6.018          ops/s
>>>>> MacroLogicOpt.workload3_caller             64  thrpt       3110.669          ops/s
>>>>> MacroLogicOpt.workload3_caller            128  thrpt       1996.861          ops/s
>>>>> MacroLogicOpt.workload3_caller            256  thrpt        870.166          ops/s
>>>>> MacroLogicOpt.workload3_caller            512  thrpt        389.629          ops/s
>>>>> MacroLogicOpt.workload3_caller           1024  thrpt        151.203          ops/s
>>>>> MacroLogicOpt.workload3_caller           2048  thrpt         75.086          ops/s
>>>>> MacroLogicOpt.workload3_caller           4096  thrpt         37.576          ops/s
>>>>>
>>>>> With Changes:
>>>>>
>>>>> Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
>>>>> MacroLogicOpt.workload1_caller             64  thrpt       3306.670          ops/s
>>>>> MacroLogicOpt.workload1_caller            128  thrpt       2936.851          ops/s
>>>>> MacroLogicOpt.workload1_caller            256  thrpt       2413.827          ops/s
>>>>> MacroLogicOpt.workload1_caller            512  thrpt       1440.291          ops/s
>>>>> MacroLogicOpt.workload1_caller           1024  thrpt        707.576          ops/s
>>>>> MacroLogicOpt.workload1_caller           2048  thrpt        384.863          ops/s
>>>>> MacroLogicOpt.workload1_caller           4096  thrpt        132.753          ops/s
>>>>> MacroLogicOpt.workload2_caller             64  thrpt        450.856          ops/s
>>>>> MacroLogicOpt.workload2_caller            128  thrpt        323.925          ops/s
>>>>> MacroLogicOpt.workload2_caller            256  thrpt        135.191          ops/s
>>>>> MacroLogicOpt.workload2_caller            512  thrpt         69.424          ops/s
>>>>> MacroLogicOpt.workload2_caller           1024  thrpt         35.744          ops/s
>>>>> MacroLogicOpt.workload2_caller           2048  thrpt         14.168          ops/s
>>>>> MacroLogicOpt.workload2_caller           4096  thrpt          7.245          ops/s
>>>>> MacroLogicOpt.workload3_caller             64  thrpt       3333.550          ops/s
>>>>> MacroLogicOpt.workload3_caller            128  thrpt       2269.428          ops/s
>>>>> MacroLogicOpt.workload3_caller            256  thrpt        995.691          ops/s
>>>>> MacroLogicOpt.workload3_caller            512  thrpt        412.452          ops/s
>>>>> MacroLogicOpt.workload3_caller           1024  thrpt        151.157          ops/s
>>>>> MacroLogicOpt.workload3_caller           2048  thrpt         75.079          ops/s
>>>>> MacroLogicOpt.workload3_caller           4096  thrpt         37.158          ops/s
>>>>>
>>>>> Please review the patch.
>>>>>
>>>>> Best Regards,
>>>>> Jatin
>>>>>
>>>>> [1] Section 17.7 :
>>>>> https://urldefense.com/v3/__https://software.intel.com/sites/default
>>>>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG
>>>>> QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$
>>>>> architectures-optimization-manual.pdf
>>>>>