RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction

Tue Mar 24 07:34:49 UTC 2020

Hi Vladimir,

Thanks for your comments , I have split the original patch into two sub-patches.

1)  Optimized NotV handling: 
http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/

2)  Changes for MacroLogic opt:
http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/

Added a new flag "UseVectorMacroLogic" which guards MacroLogic optimization.

Kindly review and let me know your feedback.

Best Regards,
Jatin

> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Tuesday, March 17, 2020 4:31 PM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
> dev at openjdk.java.net
> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
> 
> 
> > Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
> 
> Very nice contribution, Jatin!
> 
> Some comments after a brief review pass:
> 
>    * Please, contribute NotV part separately.
> 
>    * Why don't you perform (XorV v 0xFF..FF) => (NotV v) transformation during
> GVN instead?
> 
>    * As of now, vector nodes are only produced by SuperWord analysis. It makes
> sense to limit new optimization pass to SuperWord pass only (probably,
> introduce a new dedicated Phase ). Once Vector API is available, it can be
> extended to cases when vector nodes are present
> (C->max_vector_size() > 0).
> 
>    * There are more efficient ways to produce a vector of all-1s [1] [2].
> 
> Best regards,
> Vladimir Ivanov
> 
> [1]
> https://stackoverflow.com/questions/45105164/set-all-bits-in-cpu-register-to-
> 1-efficiently
> 
> [2]
> https://stackoverflow.com/questions/37469930/fastest-way-to-set-m256-
> value-to-all-one-bits
> 
> >
> > A new optimization pass has been added post Auto-Vectorization which
> folds expression tree involving vector boolean logic operations
> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
> > Optimization pass has following stages:
> >
> >    1.  Collection stage :
> >       *   This performs a DFS traversal over Ideal Graph and collects the root
> nodes of all vector logic expression trees.
> >    2.  Processing stage:
> >       *   Performs a bottom up traversal over expression tree and
> simultaneously folds specific DAG patterns involving Boolean logic parent and
> child nodes.
> >       *   Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding.
> >       *   Folding is performed under a constraint on the total number of inputs
> which a MacroLogic node can have, in this case it's 3.
> >       *   A partition is created around a DAG pattern involving logic parent and
> one or two logic child node, it encapsulate the nodes in post-order fashion.
> >       *   This partition is then evaluated by traversing over the nodes, assigning
> boolean values to its inputs and performing operations over them based on its
> Opcode. Node along with its computed result is stored in a map which is
> accessed during the evaluation of its user/parent node.
> >       *   Post-evaluation a MacroLogic node is created which is equivalent to a
> three input truth-table. Expression tree leaf level inputs along with result of its
> evaluation are the inputs fed to this new node.
> >       *   Entire expression tree is eventually subsumed/replaced by newly
> create MacroLogic node.
> >
> >
> > Following are the JMH benchmarks results with and without changes.
> >
> > Without Changes:
> >
> > Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
> > MacroLogicOpt.workload1_caller             64  thrpt       2904.480          ops/s
> > MacroLogicOpt.workload1_caller            128  thrpt       2219.252          ops/s
> > MacroLogicOpt.workload1_caller            256  thrpt       1507.267          ops/s
> > MacroLogicOpt.workload1_caller            512  thrpt        860.926          ops/s
> > MacroLogicOpt.workload1_caller           1024  thrpt        470.163          ops/s
> > MacroLogicOpt.workload1_caller           2048  thrpt        246.608          ops/s
> > MacroLogicOpt.workload1_caller           4096  thrpt        108.031          ops/s
> > MacroLogicOpt.workload2_caller             64  thrpt        344.633          ops/s
> > MacroLogicOpt.workload2_caller            128  thrpt        209.818          ops/s
> > MacroLogicOpt.workload2_caller            256  thrpt        111.678          ops/s
> > MacroLogicOpt.workload2_caller            512  thrpt         53.360          ops/s
> > MacroLogicOpt.workload2_caller           1024  thrpt         27.888          ops/s
> > MacroLogicOpt.workload2_caller           2048  thrpt         12.103          ops/s
> > MacroLogicOpt.workload2_caller           4096  thrpt          6.018          ops/s
> > MacroLogicOpt.workload3_caller             64  thrpt       3110.669          ops/s
> > MacroLogicOpt.workload3_caller            128  thrpt       1996.861          ops/s
> > MacroLogicOpt.workload3_caller            256  thrpt        870.166          ops/s
> > MacroLogicOpt.workload3_caller            512  thrpt        389.629          ops/s
> > MacroLogicOpt.workload3_caller           1024  thrpt        151.203          ops/s
> > MacroLogicOpt.workload3_caller           2048  thrpt         75.086          ops/s
> > MacroLogicOpt.workload3_caller           4096  thrpt         37.576          ops/s
> >
> > With Changes:
> >
> > Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
> > MacroLogicOpt.workload1_caller             64  thrpt       3306.670          ops/s
> > MacroLogicOpt.workload1_caller            128  thrpt       2936.851          ops/s
> > MacroLogicOpt.workload1_caller            256  thrpt       2413.827          ops/s
> > MacroLogicOpt.workload1_caller            512  thrpt       1440.291          ops/s
> > MacroLogicOpt.workload1_caller           1024  thrpt        707.576          ops/s
> > MacroLogicOpt.workload1_caller           2048  thrpt        384.863          ops/s
> > MacroLogicOpt.workload1_caller           4096  thrpt        132.753          ops/s
> > MacroLogicOpt.workload2_caller             64  thrpt        450.856          ops/s
> > MacroLogicOpt.workload2_caller            128  thrpt        323.925          ops/s
> > MacroLogicOpt.workload2_caller            256  thrpt        135.191          ops/s
> > MacroLogicOpt.workload2_caller            512  thrpt         69.424          ops/s
> > MacroLogicOpt.workload2_caller           1024  thrpt         35.744          ops/s
> > MacroLogicOpt.workload2_caller           2048  thrpt         14.168          ops/s
> > MacroLogicOpt.workload2_caller           4096  thrpt          7.245          ops/s
> > MacroLogicOpt.workload3_caller             64  thrpt       3333.550          ops/s
> > MacroLogicOpt.workload3_caller            128  thrpt       2269.428          ops/s
> > MacroLogicOpt.workload3_caller            256  thrpt        995.691          ops/s
> > MacroLogicOpt.workload3_caller            512  thrpt        412.452          ops/s
> > MacroLogicOpt.workload3_caller           1024  thrpt        151.157          ops/s
> > MacroLogicOpt.workload3_caller           2048  thrpt         75.079          ops/s
> > MacroLogicOpt.workload3_caller           4096  thrpt         37.158          ops/s
> >
> > Please review the patch.
> >
> > Best Regards,
> > Jatin
> >
> > [1] Section 17.7 :
> > https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-
> > architectures-optimization-manual.pdf
> >