RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Tue Mar 17 11:01:28 UTC 2020
> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
Very nice contribution, Jatin!
Some comments after a brief review pass:
* Please, contribute NotV part separately.
* Why don't you perform (XorV v 0xFF..FF) => (NotV v) transformation
during GVN instead?
* As of now, vector nodes are only produced by SuperWord analysis. It
makes sense to limit new optimization pass to SuperWord pass only
(probably, introduce a new dedicated Phase ). Once Vector API is
available, it can be extended to cases when vector nodes are present
(C->max_vector_size() > 0).
* There are more efficient ways to produce a vector of all-1s [1] [2].
Best regards,
Vladimir Ivanov
[1]
https://stackoverflow.com/questions/45105164/set-all-bits-in-cpu-register-to-1-efficiently
[2]
https://stackoverflow.com/questions/37469930/fastest-way-to-set-m256-value-to-all-one-bits
>
> A new optimization pass has been added post Auto-Vectorization which folds expression tree involving vector boolean logic operations (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
> Optimization pass has following stages:
>
> 1. Collection stage :
> * This performs a DFS traversal over Ideal Graph and collects the root nodes of all vector logic expression trees.
> 2. Processing stage:
> * Performs a bottom up traversal over expression tree and simultaneously folds specific DAG patterns involving Boolean logic parent and child nodes.
> * Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding.
> * Folding is performed under a constraint on the total number of inputs which a MacroLogic node can have, in this case it's 3.
> * A partition is created around a DAG pattern involving logic parent and one or two logic child node, it encapsulate the nodes in post-order fashion.
> * This partition is then evaluated by traversing over the nodes, assigning boolean values to its inputs and performing operations over them based on its Opcode. Node along with its computed result is stored in a map which is accessed during the evaluation of its user/parent node.
> * Post-evaluation a MacroLogic node is created which is equivalent to a three input truth-table. Expression tree leaf level inputs along with result of its evaluation are the inputs fed to this new node.
> * Entire expression tree is eventually subsumed/replaced by newly create MacroLogic node.
>
>
> Following are the JMH benchmarks results with and without changes.
>
> Without Changes:
>
> Benchmark (VECLEN) Mode Cnt Score Error Units
> MacroLogicOpt.workload1_caller 64 thrpt 2904.480 ops/s
> MacroLogicOpt.workload1_caller 128 thrpt 2219.252 ops/s
> MacroLogicOpt.workload1_caller 256 thrpt 1507.267 ops/s
> MacroLogicOpt.workload1_caller 512 thrpt 860.926 ops/s
> MacroLogicOpt.workload1_caller 1024 thrpt 470.163 ops/s
> MacroLogicOpt.workload1_caller 2048 thrpt 246.608 ops/s
> MacroLogicOpt.workload1_caller 4096 thrpt 108.031 ops/s
> MacroLogicOpt.workload2_caller 64 thrpt 344.633 ops/s
> MacroLogicOpt.workload2_caller 128 thrpt 209.818 ops/s
> MacroLogicOpt.workload2_caller 256 thrpt 111.678 ops/s
> MacroLogicOpt.workload2_caller 512 thrpt 53.360 ops/s
> MacroLogicOpt.workload2_caller 1024 thrpt 27.888 ops/s
> MacroLogicOpt.workload2_caller 2048 thrpt 12.103 ops/s
> MacroLogicOpt.workload2_caller 4096 thrpt 6.018 ops/s
> MacroLogicOpt.workload3_caller 64 thrpt 3110.669 ops/s
> MacroLogicOpt.workload3_caller 128 thrpt 1996.861 ops/s
> MacroLogicOpt.workload3_caller 256 thrpt 870.166 ops/s
> MacroLogicOpt.workload3_caller 512 thrpt 389.629 ops/s
> MacroLogicOpt.workload3_caller 1024 thrpt 151.203 ops/s
> MacroLogicOpt.workload3_caller 2048 thrpt 75.086 ops/s
> MacroLogicOpt.workload3_caller 4096 thrpt 37.576 ops/s
>
> With Changes:
>
> Benchmark (VECLEN) Mode Cnt Score Error Units
> MacroLogicOpt.workload1_caller 64 thrpt 3306.670 ops/s
> MacroLogicOpt.workload1_caller 128 thrpt 2936.851 ops/s
> MacroLogicOpt.workload1_caller 256 thrpt 2413.827 ops/s
> MacroLogicOpt.workload1_caller 512 thrpt 1440.291 ops/s
> MacroLogicOpt.workload1_caller 1024 thrpt 707.576 ops/s
> MacroLogicOpt.workload1_caller 2048 thrpt 384.863 ops/s
> MacroLogicOpt.workload1_caller 4096 thrpt 132.753 ops/s
> MacroLogicOpt.workload2_caller 64 thrpt 450.856 ops/s
> MacroLogicOpt.workload2_caller 128 thrpt 323.925 ops/s
> MacroLogicOpt.workload2_caller 256 thrpt 135.191 ops/s
> MacroLogicOpt.workload2_caller 512 thrpt 69.424 ops/s
> MacroLogicOpt.workload2_caller 1024 thrpt 35.744 ops/s
> MacroLogicOpt.workload2_caller 2048 thrpt 14.168 ops/s
> MacroLogicOpt.workload2_caller 4096 thrpt 7.245 ops/s
> MacroLogicOpt.workload3_caller 64 thrpt 3333.550 ops/s
> MacroLogicOpt.workload3_caller 128 thrpt 2269.428 ops/s
> MacroLogicOpt.workload3_caller 256 thrpt 995.691 ops/s
> MacroLogicOpt.workload3_caller 512 thrpt 412.452 ops/s
> MacroLogicOpt.workload3_caller 1024 thrpt 151.157 ops/s
> MacroLogicOpt.workload3_caller 2048 thrpt 75.079 ops/s
> MacroLogicOpt.workload3_caller 4096 thrpt 37.158 ops/s
>
> Please review the patch.
>
> Best Regards,
> Jatin
>
> [1] Section 17.7 : https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
>
More information about the hotspot-compiler-dev
mailing list