RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
Bhateja, Jatin
jatin.bhateja at intel.com
Tue Mar 24 07:34:49 UTC 2020
Hi Vladimir,
Thanks for your comments , I have split the original patch into two sub-patches.
1) Optimized NotV handling:
http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
2) Changes for MacroLogic opt:
http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
Added a new flag "UseVectorMacroLogic" which guards MacroLogic optimization.
Kindly review and let me know your feedback.
Best Regards,
Jatin
> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Tuesday, March 17, 2020 4:31 PM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
> dev at openjdk.java.net
> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
>
>
> > Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
>
> Very nice contribution, Jatin!
>
> Some comments after a brief review pass:
>
> * Please, contribute NotV part separately.
>
> * Why don't you perform (XorV v 0xFF..FF) => (NotV v) transformation during
> GVN instead?
>
> * As of now, vector nodes are only produced by SuperWord analysis. It makes
> sense to limit new optimization pass to SuperWord pass only (probably,
> introduce a new dedicated Phase ). Once Vector API is available, it can be
> extended to cases when vector nodes are present
> (C->max_vector_size() > 0).
>
> * There are more efficient ways to produce a vector of all-1s [1] [2].
>
> Best regards,
> Vladimir Ivanov
>
> [1]
> https://stackoverflow.com/questions/45105164/set-all-bits-in-cpu-register-to-
> 1-efficiently
>
> [2]
> https://stackoverflow.com/questions/37469930/fastest-way-to-set-m256-
> value-to-all-one-bits
>
> >
> > A new optimization pass has been added post Auto-Vectorization which
> folds expression tree involving vector boolean logic operations
> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
> > Optimization pass has following stages:
> >
> > 1. Collection stage :
> > * This performs a DFS traversal over Ideal Graph and collects the root
> nodes of all vector logic expression trees.
> > 2. Processing stage:
> > * Performs a bottom up traversal over expression tree and
> simultaneously folds specific DAG patterns involving Boolean logic parent and
> child nodes.
> > * Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding.
> > * Folding is performed under a constraint on the total number of inputs
> which a MacroLogic node can have, in this case it's 3.
> > * A partition is created around a DAG pattern involving logic parent and
> one or two logic child node, it encapsulate the nodes in post-order fashion.
> > * This partition is then evaluated by traversing over the nodes, assigning
> boolean values to its inputs and performing operations over them based on its
> Opcode. Node along with its computed result is stored in a map which is
> accessed during the evaluation of its user/parent node.
> > * Post-evaluation a MacroLogic node is created which is equivalent to a
> three input truth-table. Expression tree leaf level inputs along with result of its
> evaluation are the inputs fed to this new node.
> > * Entire expression tree is eventually subsumed/replaced by newly
> create MacroLogic node.
> >
> >
> > Following are the JMH benchmarks results with and without changes.
> >
> > Without Changes:
> >
> > Benchmark (VECLEN) Mode Cnt Score Error Units
> > MacroLogicOpt.workload1_caller 64 thrpt 2904.480 ops/s
> > MacroLogicOpt.workload1_caller 128 thrpt 2219.252 ops/s
> > MacroLogicOpt.workload1_caller 256 thrpt 1507.267 ops/s
> > MacroLogicOpt.workload1_caller 512 thrpt 860.926 ops/s
> > MacroLogicOpt.workload1_caller 1024 thrpt 470.163 ops/s
> > MacroLogicOpt.workload1_caller 2048 thrpt 246.608 ops/s
> > MacroLogicOpt.workload1_caller 4096 thrpt 108.031 ops/s
> > MacroLogicOpt.workload2_caller 64 thrpt 344.633 ops/s
> > MacroLogicOpt.workload2_caller 128 thrpt 209.818 ops/s
> > MacroLogicOpt.workload2_caller 256 thrpt 111.678 ops/s
> > MacroLogicOpt.workload2_caller 512 thrpt 53.360 ops/s
> > MacroLogicOpt.workload2_caller 1024 thrpt 27.888 ops/s
> > MacroLogicOpt.workload2_caller 2048 thrpt 12.103 ops/s
> > MacroLogicOpt.workload2_caller 4096 thrpt 6.018 ops/s
> > MacroLogicOpt.workload3_caller 64 thrpt 3110.669 ops/s
> > MacroLogicOpt.workload3_caller 128 thrpt 1996.861 ops/s
> > MacroLogicOpt.workload3_caller 256 thrpt 870.166 ops/s
> > MacroLogicOpt.workload3_caller 512 thrpt 389.629 ops/s
> > MacroLogicOpt.workload3_caller 1024 thrpt 151.203 ops/s
> > MacroLogicOpt.workload3_caller 2048 thrpt 75.086 ops/s
> > MacroLogicOpt.workload3_caller 4096 thrpt 37.576 ops/s
> >
> > With Changes:
> >
> > Benchmark (VECLEN) Mode Cnt Score Error Units
> > MacroLogicOpt.workload1_caller 64 thrpt 3306.670 ops/s
> > MacroLogicOpt.workload1_caller 128 thrpt 2936.851 ops/s
> > MacroLogicOpt.workload1_caller 256 thrpt 2413.827 ops/s
> > MacroLogicOpt.workload1_caller 512 thrpt 1440.291 ops/s
> > MacroLogicOpt.workload1_caller 1024 thrpt 707.576 ops/s
> > MacroLogicOpt.workload1_caller 2048 thrpt 384.863 ops/s
> > MacroLogicOpt.workload1_caller 4096 thrpt 132.753 ops/s
> > MacroLogicOpt.workload2_caller 64 thrpt 450.856 ops/s
> > MacroLogicOpt.workload2_caller 128 thrpt 323.925 ops/s
> > MacroLogicOpt.workload2_caller 256 thrpt 135.191 ops/s
> > MacroLogicOpt.workload2_caller 512 thrpt 69.424 ops/s
> > MacroLogicOpt.workload2_caller 1024 thrpt 35.744 ops/s
> > MacroLogicOpt.workload2_caller 2048 thrpt 14.168 ops/s
> > MacroLogicOpt.workload2_caller 4096 thrpt 7.245 ops/s
> > MacroLogicOpt.workload3_caller 64 thrpt 3333.550 ops/s
> > MacroLogicOpt.workload3_caller 128 thrpt 2269.428 ops/s
> > MacroLogicOpt.workload3_caller 256 thrpt 995.691 ops/s
> > MacroLogicOpt.workload3_caller 512 thrpt 412.452 ops/s
> > MacroLogicOpt.workload3_caller 1024 thrpt 151.157 ops/s
> > MacroLogicOpt.workload3_caller 2048 thrpt 75.079 ops/s
> > MacroLogicOpt.workload3_caller 4096 thrpt 37.158 ops/s
> >
> > Please review the patch.
> >
> > Best Regards,
> > Jatin
> >
> > [1] Section 17.7 :
> > https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-
> > architectures-optimization-manual.pdf
> >
More information about the hotspot-compiler-dev
mailing list