RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
Bhateja, Jatin
jatin.bhateja at intel.com
Wed Mar 25 06:44:48 UTC 2020
Hi Vladimir,
I have placed updated patch at following links:-
1) Optimized NotV handling:
http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
2) Changes for MacroLogic opt:
http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/
Kindly review and let me know your feedback.
Thanks,
Jatin
> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Wednesday, March 25, 2020 12:33 AM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
> <sandhya.viswanathan at intel.com>
> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
>
> Hi Jatin,
>
> I tried to submit the patches for testing, but windows-x64 build failed with the
> following errors:
>
> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did not
> evaluate to a constant
> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by a read
> of a variable outside its lifetime
> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition'
> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int
> ['function']' is not assignable
>
> Best regards,
> Vladimir Ivanov
>
> On 24.03.2020 10:34, Bhateja, Jatin wrote:
> > Hi Vladimir,
> >
> > Thanks for your comments , I have split the original patch into two sub-
> patches.
> >
> > 1) Optimized NotV handling:
> > http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
> >
> > 2) Changes for MacroLogic opt:
> > http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
> >
> > Added a new flag "UseVectorMacroLogic" which guards MacroLogic
> optimization.
> >
> > Kindly review and let me know your feedback.
> >
> > Best Regards,
> > Jatin
> >
> >> -----Original Message-----
> >> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> >> Sent: Tuesday, March 17, 2020 4:31 PM
> >> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
> >> dev at openjdk.java.net
> >> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
> >> Instruction
> >>
> >>
> >>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
> >>
> >> Very nice contribution, Jatin!
> >>
> >> Some comments after a brief review pass:
> >>
> >> * Please, contribute NotV part separately.
> >>
> >> * Why don't you perform (XorV v 0xFF..FF) => (NotV v)
> >> transformation during GVN instead?
> >>
> >> * As of now, vector nodes are only produced by SuperWord
> >> analysis. It makes sense to limit new optimization pass to SuperWord
> >> pass only (probably, introduce a new dedicated Phase ). Once Vector
> >> API is available, it can be extended to cases when vector nodes are
> >> present
> >> (C->max_vector_size() > 0).
> >>
> >> * There are more efficient ways to produce a vector of all-1s [1] [2].
> >>
> >> Best regards,
> >> Vladimir Ivanov
> >>
> >> [1]
> >> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105
> >> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgc
> >> qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$
> >> 1-efficiently
> >>
> >> [2]
> >> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469
> >> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI
> >> _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$
> >> value-to-all-one-bits
> >>
> >>>
> >>> A new optimization pass has been added post Auto-Vectorization which
> >> folds expression tree involving vector boolean logic operations
> >> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
> >>> Optimization pass has following stages:
> >>>
> >>> 1. Collection stage :
> >>> * This performs a DFS traversal over Ideal Graph and collects the root
> >> nodes of all vector logic expression trees.
> >>> 2. Processing stage:
> >>> * Performs a bottom up traversal over expression tree and
> >> simultaneously folds specific DAG patterns involving Boolean logic
> >> parent and child nodes.
> >>> * Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding.
> >>> * Folding is performed under a constraint on the total number of
> inputs
> >> which a MacroLogic node can have, in this case it's 3.
> >>> * A partition is created around a DAG pattern involving logic parent
> and
> >> one or two logic child node, it encapsulate the nodes in post-order fashion.
> >>> * This partition is then evaluated by traversing over the nodes,
> assigning
> >> boolean values to its inputs and performing operations over them
> >> based on its Opcode. Node along with its computed result is stored in
> >> a map which is accessed during the evaluation of its user/parent node.
> >>> * Post-evaluation a MacroLogic node is created which is equivalent to
> a
> >> three input truth-table. Expression tree leaf level inputs along with
> >> result of its evaluation are the inputs fed to this new node.
> >>> * Entire expression tree is eventually subsumed/replaced by newly
> >> create MacroLogic node.
> >>>
> >>>
> >>> Following are the JMH benchmarks results with and without changes.
> >>>
> >>> Without Changes:
> >>>
> >>> Benchmark (VECLEN) Mode Cnt Score Error Units
> >>> MacroLogicOpt.workload1_caller 64 thrpt 2904.480 ops/s
> >>> MacroLogicOpt.workload1_caller 128 thrpt 2219.252 ops/s
> >>> MacroLogicOpt.workload1_caller 256 thrpt 1507.267 ops/s
> >>> MacroLogicOpt.workload1_caller 512 thrpt 860.926 ops/s
> >>> MacroLogicOpt.workload1_caller 1024 thrpt 470.163 ops/s
> >>> MacroLogicOpt.workload1_caller 2048 thrpt 246.608 ops/s
> >>> MacroLogicOpt.workload1_caller 4096 thrpt 108.031 ops/s
> >>> MacroLogicOpt.workload2_caller 64 thrpt 344.633 ops/s
> >>> MacroLogicOpt.workload2_caller 128 thrpt 209.818 ops/s
> >>> MacroLogicOpt.workload2_caller 256 thrpt 111.678 ops/s
> >>> MacroLogicOpt.workload2_caller 512 thrpt 53.360 ops/s
> >>> MacroLogicOpt.workload2_caller 1024 thrpt 27.888 ops/s
> >>> MacroLogicOpt.workload2_caller 2048 thrpt 12.103 ops/s
> >>> MacroLogicOpt.workload2_caller 4096 thrpt 6.018 ops/s
> >>> MacroLogicOpt.workload3_caller 64 thrpt 3110.669 ops/s
> >>> MacroLogicOpt.workload3_caller 128 thrpt 1996.861 ops/s
> >>> MacroLogicOpt.workload3_caller 256 thrpt 870.166 ops/s
> >>> MacroLogicOpt.workload3_caller 512 thrpt 389.629 ops/s
> >>> MacroLogicOpt.workload3_caller 1024 thrpt 151.203 ops/s
> >>> MacroLogicOpt.workload3_caller 2048 thrpt 75.086 ops/s
> >>> MacroLogicOpt.workload3_caller 4096 thrpt 37.576 ops/s
> >>>
> >>> With Changes:
> >>>
> >>> Benchmark (VECLEN) Mode Cnt Score Error Units
> >>> MacroLogicOpt.workload1_caller 64 thrpt 3306.670 ops/s
> >>> MacroLogicOpt.workload1_caller 128 thrpt 2936.851 ops/s
> >>> MacroLogicOpt.workload1_caller 256 thrpt 2413.827 ops/s
> >>> MacroLogicOpt.workload1_caller 512 thrpt 1440.291 ops/s
> >>> MacroLogicOpt.workload1_caller 1024 thrpt 707.576 ops/s
> >>> MacroLogicOpt.workload1_caller 2048 thrpt 384.863 ops/s
> >>> MacroLogicOpt.workload1_caller 4096 thrpt 132.753 ops/s
> >>> MacroLogicOpt.workload2_caller 64 thrpt 450.856 ops/s
> >>> MacroLogicOpt.workload2_caller 128 thrpt 323.925 ops/s
> >>> MacroLogicOpt.workload2_caller 256 thrpt 135.191 ops/s
> >>> MacroLogicOpt.workload2_caller 512 thrpt 69.424 ops/s
> >>> MacroLogicOpt.workload2_caller 1024 thrpt 35.744 ops/s
> >>> MacroLogicOpt.workload2_caller 2048 thrpt 14.168 ops/s
> >>> MacroLogicOpt.workload2_caller 4096 thrpt 7.245 ops/s
> >>> MacroLogicOpt.workload3_caller 64 thrpt 3333.550 ops/s
> >>> MacroLogicOpt.workload3_caller 128 thrpt 2269.428 ops/s
> >>> MacroLogicOpt.workload3_caller 256 thrpt 995.691 ops/s
> >>> MacroLogicOpt.workload3_caller 512 thrpt 412.452 ops/s
> >>> MacroLogicOpt.workload3_caller 1024 thrpt 151.157 ops/s
> >>> MacroLogicOpt.workload3_caller 2048 thrpt 75.079 ops/s
> >>> MacroLogicOpt.workload3_caller 4096 thrpt 37.158 ops/s
> >>>
> >>> Please review the patch.
> >>>
> >>> Best Regards,
> >>> Jatin
> >>>
> >>> [1] Section 17.7 :
> >>> https://urldefense.com/v3/__https://software.intel.com/sites/default
> >>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG
> >>> QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$
> >>> architectures-optimization-manual.pdf
> >>>
More information about the hotspot-compiler-dev
mailing list