RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction

Tue Mar 24 19:03:27 UTC 2020

Hi Jatin,

I tried to submit the patches for testing, but windows-x64 build failed 
with the following errors:

src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did 
not evaluate to a constant
src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by a 
read of a variable outside its lifetime
src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition'
src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int 
['function']' is not assignable

Best regards,
Vladimir Ivanov

On 24.03.2020 10:34, Bhateja, Jatin wrote:
> Hi Vladimir,
> 
> Thanks for your comments , I have split the original patch into two sub-patches.
> 
> 1)  Optimized NotV handling:
> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
> 
> 2)  Changes for MacroLogic opt:
> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
> 
> Added a new flag "UseVectorMacroLogic" which guards MacroLogic optimization.
> 
> Kindly review and let me know your feedback.
> 
> Best Regards,
> Jatin
> 
>> -----Original Message-----
>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>> Sent: Tuesday, March 17, 2020 4:31 PM
>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
>> dev at openjdk.java.net
>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
>>
>>
>>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
>>
>> Very nice contribution, Jatin!
>>
>> Some comments after a brief review pass:
>>
>>     * Please, contribute NotV part separately.
>>
>>     * Why don't you perform (XorV v 0xFF..FF) => (NotV v) transformation during
>> GVN instead?
>>
>>     * As of now, vector nodes are only produced by SuperWord analysis. It makes
>> sense to limit new optimization pass to SuperWord pass only (probably,
>> introduce a new dedicated Phase ). Once Vector API is available, it can be
>> extended to cases when vector nodes are present
>> (C->max_vector_size() > 0).
>>
>>     * There are more efficient ways to produce a vector of all-1s [1] [2].
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> [1]
>> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$
>> 1-efficiently
>>
>> [2]
>> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$
>> value-to-all-one-bits
>>
>>>
>>> A new optimization pass has been added post Auto-Vectorization which
>> folds expression tree involving vector boolean logic operations
>> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
>>> Optimization pass has following stages:
>>>
>>>     1.  Collection stage :
>>>        *   This performs a DFS traversal over Ideal Graph and collects the root
>> nodes of all vector logic expression trees.
>>>     2.  Processing stage:
>>>        *   Performs a bottom up traversal over expression tree and
>> simultaneously folds specific DAG patterns involving Boolean logic parent and
>> child nodes.
>>>        *   Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding.
>>>        *   Folding is performed under a constraint on the total number of inputs
>> which a MacroLogic node can have, in this case it's 3.
>>>        *   A partition is created around a DAG pattern involving logic parent and
>> one or two logic child node, it encapsulate the nodes in post-order fashion.
>>>        *   This partition is then evaluated by traversing over the nodes, assigning
>> boolean values to its inputs and performing operations over them based on its
>> Opcode. Node along with its computed result is stored in a map which is
>> accessed during the evaluation of its user/parent node.
>>>        *   Post-evaluation a MacroLogic node is created which is equivalent to a
>> three input truth-table. Expression tree leaf level inputs along with result of its
>> evaluation are the inputs fed to this new node.
>>>        *   Entire expression tree is eventually subsumed/replaced by newly
>> create MacroLogic node.
>>>
>>>
>>> Following are the JMH benchmarks results with and without changes.
>>>
>>> Without Changes:
>>>
>>> Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
>>> MacroLogicOpt.workload1_caller             64  thrpt       2904.480          ops/s
>>> MacroLogicOpt.workload1_caller            128  thrpt       2219.252          ops/s
>>> MacroLogicOpt.workload1_caller            256  thrpt       1507.267          ops/s
>>> MacroLogicOpt.workload1_caller            512  thrpt        860.926          ops/s
>>> MacroLogicOpt.workload1_caller           1024  thrpt        470.163          ops/s
>>> MacroLogicOpt.workload1_caller           2048  thrpt        246.608          ops/s
>>> MacroLogicOpt.workload1_caller           4096  thrpt        108.031          ops/s
>>> MacroLogicOpt.workload2_caller             64  thrpt        344.633          ops/s
>>> MacroLogicOpt.workload2_caller            128  thrpt        209.818          ops/s
>>> MacroLogicOpt.workload2_caller            256  thrpt        111.678          ops/s
>>> MacroLogicOpt.workload2_caller            512  thrpt         53.360          ops/s
>>> MacroLogicOpt.workload2_caller           1024  thrpt         27.888          ops/s
>>> MacroLogicOpt.workload2_caller           2048  thrpt         12.103          ops/s
>>> MacroLogicOpt.workload2_caller           4096  thrpt          6.018          ops/s
>>> MacroLogicOpt.workload3_caller             64  thrpt       3110.669          ops/s
>>> MacroLogicOpt.workload3_caller            128  thrpt       1996.861          ops/s
>>> MacroLogicOpt.workload3_caller            256  thrpt        870.166          ops/s
>>> MacroLogicOpt.workload3_caller            512  thrpt        389.629          ops/s
>>> MacroLogicOpt.workload3_caller           1024  thrpt        151.203          ops/s
>>> MacroLogicOpt.workload3_caller           2048  thrpt         75.086          ops/s
>>> MacroLogicOpt.workload3_caller           4096  thrpt         37.576          ops/s
>>>
>>> With Changes:
>>>
>>> Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
>>> MacroLogicOpt.workload1_caller             64  thrpt       3306.670          ops/s
>>> MacroLogicOpt.workload1_caller            128  thrpt       2936.851          ops/s
>>> MacroLogicOpt.workload1_caller            256  thrpt       2413.827          ops/s
>>> MacroLogicOpt.workload1_caller            512  thrpt       1440.291          ops/s
>>> MacroLogicOpt.workload1_caller           1024  thrpt        707.576          ops/s
>>> MacroLogicOpt.workload1_caller           2048  thrpt        384.863          ops/s
>>> MacroLogicOpt.workload1_caller           4096  thrpt        132.753          ops/s
>>> MacroLogicOpt.workload2_caller             64  thrpt        450.856          ops/s
>>> MacroLogicOpt.workload2_caller            128  thrpt        323.925          ops/s
>>> MacroLogicOpt.workload2_caller            256  thrpt        135.191          ops/s
>>> MacroLogicOpt.workload2_caller            512  thrpt         69.424          ops/s
>>> MacroLogicOpt.workload2_caller           1024  thrpt         35.744          ops/s
>>> MacroLogicOpt.workload2_caller           2048  thrpt         14.168          ops/s
>>> MacroLogicOpt.workload2_caller           4096  thrpt          7.245          ops/s
>>> MacroLogicOpt.workload3_caller             64  thrpt       3333.550          ops/s
>>> MacroLogicOpt.workload3_caller            128  thrpt       2269.428          ops/s
>>> MacroLogicOpt.workload3_caller            256  thrpt        995.691          ops/s
>>> MacroLogicOpt.workload3_caller            512  thrpt        412.452          ops/s
>>> MacroLogicOpt.workload3_caller           1024  thrpt        151.157          ops/s
>>> MacroLogicOpt.workload3_caller           2048  thrpt         75.079          ops/s
>>> MacroLogicOpt.workload3_caller           4096  thrpt         37.158          ops/s
>>>
>>> Please review the patch.
>>>
>>> Best Regards,
>>> Jatin
>>>
>>> [1] Section 17.7 :
>>> https://urldefense.com/v3/__https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$
>>> architectures-optimization-manual.pdf
>>>