RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
Bhateja, Jatin
jatin.bhateja at intel.com
Mon Mar 16 04:20:56 UTC 2020
Hi All,
Please find below a patch to support AVX-512 Ternary Logic instruction[1].
JBS : https://bugs.openjdk.java.net/browse/JDK-8241040
Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
A new optimization pass has been added post Auto-Vectorization which folds expression tree involving vector boolean logic operations (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
Optimization pass has following stages:
1. Collection stage :
* This performs a DFS traversal over Ideal Graph and collects the root nodes of all vector logic expression trees.
2. Processing stage:
* Performs a bottom up traversal over expression tree and simultaneously folds specific DAG patterns involving Boolean logic parent and child nodes.
* Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding.
* Folding is performed under a constraint on the total number of inputs which a MacroLogic node can have, in this case it's 3.
* A partition is created around a DAG pattern involving logic parent and one or two logic child node, it encapsulate the nodes in post-order fashion.
* This partition is then evaluated by traversing over the nodes, assigning boolean values to its inputs and performing operations over them based on its Opcode. Node along with its computed result is stored in a map which is accessed during the evaluation of its user/parent node.
* Post-evaluation a MacroLogic node is created which is equivalent to a three input truth-table. Expression tree leaf level inputs along with result of its evaluation are the inputs fed to this new node.
* Entire expression tree is eventually subsumed/replaced by newly create MacroLogic node.
Following are the JMH benchmarks results with and without changes.
Without Changes:
Benchmark (VECLEN) Mode Cnt Score Error Units
MacroLogicOpt.workload1_caller 64 thrpt 2904.480 ops/s
MacroLogicOpt.workload1_caller 128 thrpt 2219.252 ops/s
MacroLogicOpt.workload1_caller 256 thrpt 1507.267 ops/s
MacroLogicOpt.workload1_caller 512 thrpt 860.926 ops/s
MacroLogicOpt.workload1_caller 1024 thrpt 470.163 ops/s
MacroLogicOpt.workload1_caller 2048 thrpt 246.608 ops/s
MacroLogicOpt.workload1_caller 4096 thrpt 108.031 ops/s
MacroLogicOpt.workload2_caller 64 thrpt 344.633 ops/s
MacroLogicOpt.workload2_caller 128 thrpt 209.818 ops/s
MacroLogicOpt.workload2_caller 256 thrpt 111.678 ops/s
MacroLogicOpt.workload2_caller 512 thrpt 53.360 ops/s
MacroLogicOpt.workload2_caller 1024 thrpt 27.888 ops/s
MacroLogicOpt.workload2_caller 2048 thrpt 12.103 ops/s
MacroLogicOpt.workload2_caller 4096 thrpt 6.018 ops/s
MacroLogicOpt.workload3_caller 64 thrpt 3110.669 ops/s
MacroLogicOpt.workload3_caller 128 thrpt 1996.861 ops/s
MacroLogicOpt.workload3_caller 256 thrpt 870.166 ops/s
MacroLogicOpt.workload3_caller 512 thrpt 389.629 ops/s
MacroLogicOpt.workload3_caller 1024 thrpt 151.203 ops/s
MacroLogicOpt.workload3_caller 2048 thrpt 75.086 ops/s
MacroLogicOpt.workload3_caller 4096 thrpt 37.576 ops/s
With Changes:
Benchmark (VECLEN) Mode Cnt Score Error Units
MacroLogicOpt.workload1_caller 64 thrpt 3306.670 ops/s
MacroLogicOpt.workload1_caller 128 thrpt 2936.851 ops/s
MacroLogicOpt.workload1_caller 256 thrpt 2413.827 ops/s
MacroLogicOpt.workload1_caller 512 thrpt 1440.291 ops/s
MacroLogicOpt.workload1_caller 1024 thrpt 707.576 ops/s
MacroLogicOpt.workload1_caller 2048 thrpt 384.863 ops/s
MacroLogicOpt.workload1_caller 4096 thrpt 132.753 ops/s
MacroLogicOpt.workload2_caller 64 thrpt 450.856 ops/s
MacroLogicOpt.workload2_caller 128 thrpt 323.925 ops/s
MacroLogicOpt.workload2_caller 256 thrpt 135.191 ops/s
MacroLogicOpt.workload2_caller 512 thrpt 69.424 ops/s
MacroLogicOpt.workload2_caller 1024 thrpt 35.744 ops/s
MacroLogicOpt.workload2_caller 2048 thrpt 14.168 ops/s
MacroLogicOpt.workload2_caller 4096 thrpt 7.245 ops/s
MacroLogicOpt.workload3_caller 64 thrpt 3333.550 ops/s
MacroLogicOpt.workload3_caller 128 thrpt 2269.428 ops/s
MacroLogicOpt.workload3_caller 256 thrpt 995.691 ops/s
MacroLogicOpt.workload3_caller 512 thrpt 412.452 ops/s
MacroLogicOpt.workload3_caller 1024 thrpt 151.157 ops/s
MacroLogicOpt.workload3_caller 2048 thrpt 75.079 ops/s
MacroLogicOpt.workload3_caller 4096 thrpt 37.158 ops/s
Please review the patch.
Best Regards,
Jatin
[1] Section 17.7 : https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
More information about the hotspot-compiler-dev
mailing list