RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
Bhateja, Jatin
jatin.bhateja at intel.com
Wed Apr 1 18:23:29 UTC 2020
Hi Vladimir,
Please find an updated unified patch at the following link.
http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/
This removes Optimized NotV handling for AVX3, as suggested it will be
brought via vectorIntrinsics branch.
Thanks for your help in shaping up this patch, please let me know if there
are other comments.
Best Regards,
Jatin
________________________________________
From: Bhateja, Jatin
Sent: Wednesday, March 25, 2020 12:14 PM
To: Vladimir Ivanov
Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
Hi Vladimir,
I have placed updated patch at following links:-
1) Optimized NotV handling:
http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
2) Changes for MacroLogic opt:
http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/
Kindly review and let me know your feedback.
Thanks,
Jatin
> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Wednesday, March 25, 2020 12:33 AM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
> <sandhya.viswanathan at intel.com>
> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
>
> Hi Jatin,
>
> I tried to submit the patches for testing, but windows-x64 build failed with the
> following errors:
>
> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did not
> evaluate to a constant
> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by a read
> of a variable outside its lifetime
> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition'
> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int
> ['function']' is not assignable
>
> Best regards,
> Vladimir Ivanov
>
> On 24.03.2020 10:34, Bhateja, Jatin wrote:
> > Hi Vladimir,
> >
> > Thanks for your comments , I have split the original patch into two sub-
> patches.
> >
> > 1) Optimized NotV handling:
> > http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
> >
> > 2) Changes for MacroLogic opt:
> > http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
> >
> > Added a new flag "UseVectorMacroLogic" which guards MacroLogic
> optimization.
> >
> > Kindly review and let me know your feedback.
> >
> > Best Regards,
> > Jatin
> >
> >> -----Original Message-----
> >> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> >> Sent: Tuesday, March 17, 2020 4:31 PM
> >> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
> >> dev at openjdk.java.net
> >> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
> >> Instruction
> >>
> >>
> >>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
> >>
> >> Very nice contribution, Jatin!
> >>
> >> Some comments after a brief review pass:
> >>
> >> * Please, contribute NotV part separately.
> >>
> >> * Why don't you perform (XorV v 0xFF..FF) => (NotV v)
> >> transformation during GVN instead?
> >>
> >> * As of now, vector nodes are only produced by SuperWord
> >> analysis. It makes sense to limit new optimization pass to SuperWord
> >> pass only (probably, introduce a new dedicated Phase ). Once Vector
> >> API is available, it can be extended to cases when vector nodes are
> >> present
> >> (C->max_vector_size() > 0).
> >>
> >> * There are more efficient ways to produce a vector of all-1s [1] [2].
> >>
> >> Best regards,
> >> Vladimir Ivanov
> >>
> >> [1]
> >> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105
> >> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgc
> >> qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$
> >> 1-efficiently
> >>
> >> [2]
> >> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469
> >> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI
> >> _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$
> >> value-to-all-one-bits
> >>
> >>>
> >>> A new optimization pass has been added post Auto-Vectorization which
> >> folds expression tree involving vector boolean logic operations
> >> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
> >>> Optimization pass has following stages:
> >>>
> >>> 1. Collection stage :
> >>> * This performs a DFS traversal over Ideal Graph and collects the root
> >> nodes of all vector logic expression trees.
> >>> 2. Processing stage:
> >>> * Performs a bottom up traversal over expression tree and
> >> simultaneously folds specific DAG patterns involving Boolean logic
> >> parent and child nodes.
> >>> * Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding.
> >>> * Folding is performed under a constraint on the total number of
> inputs
> >> which a MacroLogic node can have, in this case it's 3.
> >>> * A partition is created around a DAG pattern involving logic parent
> and
> >> one or two logic child node, it encapsulate the nodes in post-order fashion.
> >>> * This partition is then evaluated by traversing over the nodes,
> assigning
> >> boolean values to its inputs and performing operations over them
> >> based on its Opcode. Node along with its computed result is stored in
> >> a map which is accessed during the evaluation of its user/parent node.
> >>> * Post-evaluation a MacroLogic node is created which is equivalent to
> a
> >> three input truth-table. Expression tree leaf level inputs along with
> >> result of its evaluation are the inputs fed to this new node.
> >>> * Entire expression tree is eventually subsumed/replaced by newly
> >> create MacroLogic node.
> >>>
> >>>
> >>> Following are the JMH benchmarks results with and without changes.
> >>>
> >>> Without Changes:
> >>>
> >>> Benchmark (VECLEN) Mode Cnt Score Error Units
> >>> MacroLogicOpt.workload1_caller 64 thrpt 2904.480 ops/s
> >>> MacroLogicOpt.workload1_caller 128 thrpt 2219.252 ops/s
> >>> MacroLogicOpt.workload1_caller 256 thrpt 1507.267 ops/s
> >>> MacroLogicOpt.workload1_caller 512 thrpt 860.926 ops/s
> >>> MacroLogicOpt.workload1_caller 1024 thrpt 470.163 ops/s
> >>> MacroLogicOpt.workload1_caller 2048 thrpt 246.608 ops/s
> >>> MacroLogicOpt.workload1_caller 4096 thrpt 108.031 ops/s
> >>> MacroLogicOpt.workload2_caller 64 thrpt 344.633 ops/s
> >>> MacroLogicOpt.workload2_caller 128 thrpt 209.818 ops/s
> >>> MacroLogicOpt.workload2_caller 256 thrpt 111.678 ops/s
> >>> MacroLogicOpt.workload2_caller 512 thrpt 53.360 ops/s
> >>> MacroLogicOpt.workload2_caller 1024 thrpt 27.888 ops/s
> >>> MacroLogicOpt.workload2_caller 2048 thrpt 12.103 ops/s
> >>> MacroLogicOpt.workload2_caller 4096 thrpt 6.018 ops/s
> >>> MacroLogicOpt.workload3_caller 64 thrpt 3110.669 ops/s
> >>> MacroLogicOpt.workload3_caller 128 thrpt 1996.861 ops/s
> >>> MacroLogicOpt.workload3_caller 256 thrpt 870.166 ops/s
> >>> MacroLogicOpt.workload3_caller 512 thrpt 389.629 ops/s
> >>> MacroLogicOpt.workload3_caller 1024 thrpt 151.203 ops/s
> >>> MacroLogicOpt.workload3_caller 2048 thrpt 75.086 ops/s
> >>> MacroLogicOpt.workload3_caller 4096 thrpt 37.576 ops/s
> >>>
> >>> With Changes:
> >>>
> >>> Benchmark (VECLEN) Mode Cnt Score Error Units
> >>> MacroLogicOpt.workload1_caller 64 thrpt 3306.670 ops/s
> >>> MacroLogicOpt.workload1_caller 128 thrpt 2936.851 ops/s
> >>> MacroLogicOpt.workload1_caller 256 thrpt 2413.827 ops/s
> >>> MacroLogicOpt.workload1_caller 512 thrpt 1440.291 ops/s
> >>> MacroLogicOpt.workload1_caller 1024 thrpt 707.576 ops/s
> >>> MacroLogicOpt.workload1_caller 2048 thrpt 384.863 ops/s
> >>> MacroLogicOpt.workload1_caller 4096 thrpt 132.753 ops/s
> >>> MacroLogicOpt.workload2_caller 64 thrpt 450.856 ops/s
> >>> MacroLogicOpt.workload2_caller 128 thrpt 323.925 ops/s
> >>> MacroLogicOpt.workload2_caller 256 thrpt 135.191 ops/s
> >>> MacroLogicOpt.workload2_caller 512 thrpt 69.424 ops/s
> >>> MacroLogicOpt.workload2_caller 1024 thrpt 35.744 ops/s
> >>> MacroLogicOpt.workload2_caller 2048 thrpt 14.168 ops/s
> >>> MacroLogicOpt.workload2_caller 4096 thrpt 7.245 ops/s
> >>> MacroLogicOpt.workload3_caller 64 thrpt 3333.550 ops/s
> >>> MacroLogicOpt.workload3_caller 128 thrpt 2269.428 ops/s
> >>> MacroLogicOpt.workload3_caller 256 thrpt 995.691 ops/s
> >>> MacroLogicOpt.workload3_caller 512 thrpt 412.452 ops/s
> >>> MacroLogicOpt.workload3_caller 1024 thrpt 151.157 ops/s
> >>> MacroLogicOpt.workload3_caller 2048 thrpt 75.079 ops/s
> >>> MacroLogicOpt.workload3_caller 4096 thrpt 37.158 ops/s
> >>>
> >>> Please review the patch.
> >>>
> >>> Best Regards,
> >>> Jatin
> >>>
> >>> [1] Section 17.7 :
> >>> https://urldefense.com/v3/__https://software.intel.com/sites/default
> >>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG
> >>> QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$
> >>> architectures-optimization-manual.pdf
> >>>
More information about the hotspot-compiler-dev
mailing list