RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Tue Mar 24 19:03:27 UTC 2020
Hi Jatin,
I tried to submit the patches for testing, but windows-x64 build failed
with the following errors:
src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did
not evaluate to a constant
src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by a
read of a variable outside its lifetime
src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition'
src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int
['function']' is not assignable
Best regards,
Vladimir Ivanov
On 24.03.2020 10:34, Bhateja, Jatin wrote:
> Hi Vladimir,
>
> Thanks for your comments , I have split the original patch into two sub-patches.
>
> 1) Optimized NotV handling:
> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
>
> 2) Changes for MacroLogic opt:
> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
>
> Added a new flag "UseVectorMacroLogic" which guards MacroLogic optimization.
>
> Kindly review and let me know your feedback.
>
> Best Regards,
> Jatin
>
>> -----Original Message-----
>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>> Sent: Tuesday, March 17, 2020 4:31 PM
>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
>> dev at openjdk.java.net
>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
>>
>>
>>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
>>
>> Very nice contribution, Jatin!
>>
>> Some comments after a brief review pass:
>>
>> * Please, contribute NotV part separately.
>>
>> * Why don't you perform (XorV v 0xFF..FF) => (NotV v) transformation during
>> GVN instead?
>>
>> * As of now, vector nodes are only produced by SuperWord analysis. It makes
>> sense to limit new optimization pass to SuperWord pass only (probably,
>> introduce a new dedicated Phase ). Once Vector API is available, it can be
>> extended to cases when vector nodes are present
>> (C->max_vector_size() > 0).
>>
>> * There are more efficient ways to produce a vector of all-1s [1] [2].
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> [1]
>> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$
>> 1-efficiently
>>
>> [2]
>> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$
>> value-to-all-one-bits
>>
>>>
>>> A new optimization pass has been added post Auto-Vectorization which
>> folds expression tree involving vector boolean logic operations
>> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
>>> Optimization pass has following stages:
>>>
>>> 1. Collection stage :
>>> * This performs a DFS traversal over Ideal Graph and collects the root
>> nodes of all vector logic expression trees.
>>> 2. Processing stage:
>>> * Performs a bottom up traversal over expression tree and
>> simultaneously folds specific DAG patterns involving Boolean logic parent and
>> child nodes.
>>> * Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding.
>>> * Folding is performed under a constraint on the total number of inputs
>> which a MacroLogic node can have, in this case it's 3.
>>> * A partition is created around a DAG pattern involving logic parent and
>> one or two logic child node, it encapsulate the nodes in post-order fashion.
>>> * This partition is then evaluated by traversing over the nodes, assigning
>> boolean values to its inputs and performing operations over them based on its
>> Opcode. Node along with its computed result is stored in a map which is
>> accessed during the evaluation of its user/parent node.
>>> * Post-evaluation a MacroLogic node is created which is equivalent to a
>> three input truth-table. Expression tree leaf level inputs along with result of its
>> evaluation are the inputs fed to this new node.
>>> * Entire expression tree is eventually subsumed/replaced by newly
>> create MacroLogic node.
>>>
>>>
>>> Following are the JMH benchmarks results with and without changes.
>>>
>>> Without Changes:
>>>
>>> Benchmark (VECLEN) Mode Cnt Score Error Units
>>> MacroLogicOpt.workload1_caller 64 thrpt 2904.480 ops/s
>>> MacroLogicOpt.workload1_caller 128 thrpt 2219.252 ops/s
>>> MacroLogicOpt.workload1_caller 256 thrpt 1507.267 ops/s
>>> MacroLogicOpt.workload1_caller 512 thrpt 860.926 ops/s
>>> MacroLogicOpt.workload1_caller 1024 thrpt 470.163 ops/s
>>> MacroLogicOpt.workload1_caller 2048 thrpt 246.608 ops/s
>>> MacroLogicOpt.workload1_caller 4096 thrpt 108.031 ops/s
>>> MacroLogicOpt.workload2_caller 64 thrpt 344.633 ops/s
>>> MacroLogicOpt.workload2_caller 128 thrpt 209.818 ops/s
>>> MacroLogicOpt.workload2_caller 256 thrpt 111.678 ops/s
>>> MacroLogicOpt.workload2_caller 512 thrpt 53.360 ops/s
>>> MacroLogicOpt.workload2_caller 1024 thrpt 27.888 ops/s
>>> MacroLogicOpt.workload2_caller 2048 thrpt 12.103 ops/s
>>> MacroLogicOpt.workload2_caller 4096 thrpt 6.018 ops/s
>>> MacroLogicOpt.workload3_caller 64 thrpt 3110.669 ops/s
>>> MacroLogicOpt.workload3_caller 128 thrpt 1996.861 ops/s
>>> MacroLogicOpt.workload3_caller 256 thrpt 870.166 ops/s
>>> MacroLogicOpt.workload3_caller 512 thrpt 389.629 ops/s
>>> MacroLogicOpt.workload3_caller 1024 thrpt 151.203 ops/s
>>> MacroLogicOpt.workload3_caller 2048 thrpt 75.086 ops/s
>>> MacroLogicOpt.workload3_caller 4096 thrpt 37.576 ops/s
>>>
>>> With Changes:
>>>
>>> Benchmark (VECLEN) Mode Cnt Score Error Units
>>> MacroLogicOpt.workload1_caller 64 thrpt 3306.670 ops/s
>>> MacroLogicOpt.workload1_caller 128 thrpt 2936.851 ops/s
>>> MacroLogicOpt.workload1_caller 256 thrpt 2413.827 ops/s
>>> MacroLogicOpt.workload1_caller 512 thrpt 1440.291 ops/s
>>> MacroLogicOpt.workload1_caller 1024 thrpt 707.576 ops/s
>>> MacroLogicOpt.workload1_caller 2048 thrpt 384.863 ops/s
>>> MacroLogicOpt.workload1_caller 4096 thrpt 132.753 ops/s
>>> MacroLogicOpt.workload2_caller 64 thrpt 450.856 ops/s
>>> MacroLogicOpt.workload2_caller 128 thrpt 323.925 ops/s
>>> MacroLogicOpt.workload2_caller 256 thrpt 135.191 ops/s
>>> MacroLogicOpt.workload2_caller 512 thrpt 69.424 ops/s
>>> MacroLogicOpt.workload2_caller 1024 thrpt 35.744 ops/s
>>> MacroLogicOpt.workload2_caller 2048 thrpt 14.168 ops/s
>>> MacroLogicOpt.workload2_caller 4096 thrpt 7.245 ops/s
>>> MacroLogicOpt.workload3_caller 64 thrpt 3333.550 ops/s
>>> MacroLogicOpt.workload3_caller 128 thrpt 2269.428 ops/s
>>> MacroLogicOpt.workload3_caller 256 thrpt 995.691 ops/s
>>> MacroLogicOpt.workload3_caller 512 thrpt 412.452 ops/s
>>> MacroLogicOpt.workload3_caller 1024 thrpt 151.157 ops/s
>>> MacroLogicOpt.workload3_caller 2048 thrpt 75.079 ops/s
>>> MacroLogicOpt.workload3_caller 4096 thrpt 37.158 ops/s
>>>
>>> Please review the patch.
>>>
>>> Best Regards,
>>> Jatin
>>>
>>> [1] Section 17.7 :
>>> https://urldefense.com/v3/__https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$
>>> architectures-optimization-manual.pdf
>>>
More information about the hotspot-compiler-dev
mailing list