RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction

Tue Mar 17 11:01:28 UTC 2020

> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/

Very nice contribution, Jatin!

Some comments after a brief review pass:

   * Please, contribute NotV part separately.

   * Why don't you perform (XorV v 0xFF..FF) => (NotV v) transformation 
during GVN instead?

   * As of now, vector nodes are only produced by SuperWord analysis. It 
makes sense to limit new optimization pass to SuperWord pass only 
(probably, introduce a new dedicated Phase ). Once Vector API is 
available, it can be extended to cases when vector nodes are present 
(C->max_vector_size() > 0).

   * There are more efficient ways to produce a vector of all-1s [1] [2].

Best regards,
Vladimir Ivanov

[1] 
https://stackoverflow.com/questions/45105164/set-all-bits-in-cpu-register-to-1-efficiently

[2] 
https://stackoverflow.com/questions/37469930/fastest-way-to-set-m256-value-to-all-one-bits

> 
> A new optimization pass has been added post Auto-Vectorization which folds expression tree involving vector boolean logic operations (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
> Optimization pass has following stages:
> 
>    1.  Collection stage :
>       *   This performs a DFS traversal over Ideal Graph and collects the root nodes of all vector logic expression trees.
>    2.  Processing stage:
>       *   Performs a bottom up traversal over expression tree and simultaneously folds specific DAG patterns involving Boolean logic parent and child nodes.
>       *   Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding.
>       *   Folding is performed under a constraint on the total number of inputs which a MacroLogic node can have, in this case it's 3.
>       *   A partition is created around a DAG pattern involving logic parent and one or two logic child node, it encapsulate the nodes in post-order fashion.
>       *   This partition is then evaluated by traversing over the nodes, assigning boolean values to its inputs and performing operations over them based on its Opcode. Node along with its computed result is stored in a map which is accessed during the evaluation of its user/parent node.
>       *   Post-evaluation a MacroLogic node is created which is equivalent to a three input truth-table. Expression tree leaf level inputs along with result of its evaluation are the inputs fed to this new node.
>       *   Entire expression tree is eventually subsumed/replaced by newly create MacroLogic node.
> 
> 
> Following are the JMH benchmarks results with and without changes.
> 
> Without Changes:
> 
> Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
> MacroLogicOpt.workload1_caller             64  thrpt       2904.480          ops/s
> MacroLogicOpt.workload1_caller            128  thrpt       2219.252          ops/s
> MacroLogicOpt.workload1_caller            256  thrpt       1507.267          ops/s
> MacroLogicOpt.workload1_caller            512  thrpt        860.926          ops/s
> MacroLogicOpt.workload1_caller           1024  thrpt        470.163          ops/s
> MacroLogicOpt.workload1_caller           2048  thrpt        246.608          ops/s
> MacroLogicOpt.workload1_caller           4096  thrpt        108.031          ops/s
> MacroLogicOpt.workload2_caller             64  thrpt        344.633          ops/s
> MacroLogicOpt.workload2_caller            128  thrpt        209.818          ops/s
> MacroLogicOpt.workload2_caller            256  thrpt        111.678          ops/s
> MacroLogicOpt.workload2_caller            512  thrpt         53.360          ops/s
> MacroLogicOpt.workload2_caller           1024  thrpt         27.888          ops/s
> MacroLogicOpt.workload2_caller           2048  thrpt         12.103          ops/s
> MacroLogicOpt.workload2_caller           4096  thrpt          6.018          ops/s
> MacroLogicOpt.workload3_caller             64  thrpt       3110.669          ops/s
> MacroLogicOpt.workload3_caller            128  thrpt       1996.861          ops/s
> MacroLogicOpt.workload3_caller            256  thrpt        870.166          ops/s
> MacroLogicOpt.workload3_caller            512  thrpt        389.629          ops/s
> MacroLogicOpt.workload3_caller           1024  thrpt        151.203          ops/s
> MacroLogicOpt.workload3_caller           2048  thrpt         75.086          ops/s
> MacroLogicOpt.workload3_caller           4096  thrpt         37.576          ops/s
> 
> With Changes:
> 
> Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
> MacroLogicOpt.workload1_caller             64  thrpt       3306.670          ops/s
> MacroLogicOpt.workload1_caller            128  thrpt       2936.851          ops/s
> MacroLogicOpt.workload1_caller            256  thrpt       2413.827          ops/s
> MacroLogicOpt.workload1_caller            512  thrpt       1440.291          ops/s
> MacroLogicOpt.workload1_caller           1024  thrpt        707.576          ops/s
> MacroLogicOpt.workload1_caller           2048  thrpt        384.863          ops/s
> MacroLogicOpt.workload1_caller           4096  thrpt        132.753          ops/s
> MacroLogicOpt.workload2_caller             64  thrpt        450.856          ops/s
> MacroLogicOpt.workload2_caller            128  thrpt        323.925          ops/s
> MacroLogicOpt.workload2_caller            256  thrpt        135.191          ops/s
> MacroLogicOpt.workload2_caller            512  thrpt         69.424          ops/s
> MacroLogicOpt.workload2_caller           1024  thrpt         35.744          ops/s
> MacroLogicOpt.workload2_caller           2048  thrpt         14.168          ops/s
> MacroLogicOpt.workload2_caller           4096  thrpt          7.245          ops/s
> MacroLogicOpt.workload3_caller             64  thrpt       3333.550          ops/s
> MacroLogicOpt.workload3_caller            128  thrpt       2269.428          ops/s
> MacroLogicOpt.workload3_caller            256  thrpt        995.691          ops/s
> MacroLogicOpt.workload3_caller            512  thrpt        412.452          ops/s
> MacroLogicOpt.workload3_caller           1024  thrpt        151.157          ops/s
> MacroLogicOpt.workload3_caller           2048  thrpt         75.079          ops/s
> MacroLogicOpt.workload3_caller           4096  thrpt         37.158          ops/s
> 
> Please review the patch.
> 
> Best Regards,
> Jatin
> 
> [1] Section 17.7 : https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
>