RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction

Wed Mar 25 06:44:48 UTC 2020

Hi Vladimir,

I have placed updated patch at following links:-

 1)  Optimized NotV handling:
http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/

 2)  Changes for MacroLogic opt:
 http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/

Kindly review and let me know your feedback.

Thanks,
Jatin

> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Wednesday, March 25, 2020 12:33 AM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
> <sandhya.viswanathan at intel.com>
> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
> 
> Hi Jatin,
> 
> I tried to submit the patches for testing, but windows-x64 build failed with the
> following errors:
> 
> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did not
> evaluate to a constant
> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by a read
> of a variable outside its lifetime
> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition'
> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int
> ['function']' is not assignable
> 
> Best regards,
> Vladimir Ivanov
> 
> On 24.03.2020 10:34, Bhateja, Jatin wrote:
> > Hi Vladimir,
> >
> > Thanks for your comments , I have split the original patch into two sub-
> patches.
> >
> > 1)  Optimized NotV handling:
> > http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
> >
> > 2)  Changes for MacroLogic opt:
> > http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
> >
> > Added a new flag "UseVectorMacroLogic" which guards MacroLogic
> optimization.
> >
> > Kindly review and let me know your feedback.
> >
> > Best Regards,
> > Jatin
> >
> >> -----Original Message-----
> >> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> >> Sent: Tuesday, March 17, 2020 4:31 PM
> >> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
> >> dev at openjdk.java.net
> >> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
> >> Instruction
> >>
> >>
> >>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
> >>
> >> Very nice contribution, Jatin!
> >>
> >> Some comments after a brief review pass:
> >>
> >>     * Please, contribute NotV part separately.
> >>
> >>     * Why don't you perform (XorV v 0xFF..FF) => (NotV v)
> >> transformation during GVN instead?
> >>
> >>     * As of now, vector nodes are only produced by SuperWord
> >> analysis. It makes sense to limit new optimization pass to SuperWord
> >> pass only (probably, introduce a new dedicated Phase ). Once Vector
> >> API is available, it can be extended to cases when vector nodes are
> >> present
> >> (C->max_vector_size() > 0).
> >>
> >>     * There are more efficient ways to produce a vector of all-1s [1] [2].
> >>
> >> Best regards,
> >> Vladimir Ivanov
> >>
> >> [1]
> >> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105
> >> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgc
> >> qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$
> >> 1-efficiently
> >>
> >> [2]
> >> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469
> >> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI
> >> _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$
> >> value-to-all-one-bits
> >>
> >>>
> >>> A new optimization pass has been added post Auto-Vectorization which
> >> folds expression tree involving vector boolean logic operations
> >> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
> >>> Optimization pass has following stages:
> >>>
> >>>     1.  Collection stage :
> >>>        *   This performs a DFS traversal over Ideal Graph and collects the root
> >> nodes of all vector logic expression trees.
> >>>     2.  Processing stage:
> >>>        *   Performs a bottom up traversal over expression tree and
> >> simultaneously folds specific DAG patterns involving Boolean logic
> >> parent and child nodes.
> >>>        *   Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding.
> >>>        *   Folding is performed under a constraint on the total number of
> inputs
> >> which a MacroLogic node can have, in this case it's 3.
> >>>        *   A partition is created around a DAG pattern involving logic parent
> and
> >> one or two logic child node, it encapsulate the nodes in post-order fashion.
> >>>        *   This partition is then evaluated by traversing over the nodes,
> assigning
> >> boolean values to its inputs and performing operations over them
> >> based on its Opcode. Node along with its computed result is stored in
> >> a map which is accessed during the evaluation of its user/parent node.
> >>>        *   Post-evaluation a MacroLogic node is created which is equivalent to
> a
> >> three input truth-table. Expression tree leaf level inputs along with
> >> result of its evaluation are the inputs fed to this new node.
> >>>        *   Entire expression tree is eventually subsumed/replaced by newly
> >> create MacroLogic node.
> >>>
> >>>
> >>> Following are the JMH benchmarks results with and without changes.
> >>>
> >>> Without Changes:
> >>>
> >>> Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
> >>> MacroLogicOpt.workload1_caller             64  thrpt       2904.480          ops/s
> >>> MacroLogicOpt.workload1_caller            128  thrpt       2219.252          ops/s
> >>> MacroLogicOpt.workload1_caller            256  thrpt       1507.267          ops/s
> >>> MacroLogicOpt.workload1_caller            512  thrpt        860.926          ops/s
> >>> MacroLogicOpt.workload1_caller           1024  thrpt        470.163          ops/s
> >>> MacroLogicOpt.workload1_caller           2048  thrpt        246.608          ops/s
> >>> MacroLogicOpt.workload1_caller           4096  thrpt        108.031          ops/s
> >>> MacroLogicOpt.workload2_caller             64  thrpt        344.633          ops/s
> >>> MacroLogicOpt.workload2_caller            128  thrpt        209.818          ops/s
> >>> MacroLogicOpt.workload2_caller            256  thrpt        111.678          ops/s
> >>> MacroLogicOpt.workload2_caller            512  thrpt         53.360          ops/s
> >>> MacroLogicOpt.workload2_caller           1024  thrpt         27.888          ops/s
> >>> MacroLogicOpt.workload2_caller           2048  thrpt         12.103          ops/s
> >>> MacroLogicOpt.workload2_caller           4096  thrpt          6.018          ops/s
> >>> MacroLogicOpt.workload3_caller             64  thrpt       3110.669          ops/s
> >>> MacroLogicOpt.workload3_caller            128  thrpt       1996.861          ops/s
> >>> MacroLogicOpt.workload3_caller            256  thrpt        870.166          ops/s
> >>> MacroLogicOpt.workload3_caller            512  thrpt        389.629          ops/s
> >>> MacroLogicOpt.workload3_caller           1024  thrpt        151.203          ops/s
> >>> MacroLogicOpt.workload3_caller           2048  thrpt         75.086          ops/s
> >>> MacroLogicOpt.workload3_caller           4096  thrpt         37.576          ops/s
> >>>
> >>> With Changes:
> >>>
> >>> Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
> >>> MacroLogicOpt.workload1_caller             64  thrpt       3306.670          ops/s
> >>> MacroLogicOpt.workload1_caller            128  thrpt       2936.851          ops/s
> >>> MacroLogicOpt.workload1_caller            256  thrpt       2413.827          ops/s
> >>> MacroLogicOpt.workload1_caller            512  thrpt       1440.291          ops/s
> >>> MacroLogicOpt.workload1_caller           1024  thrpt        707.576          ops/s
> >>> MacroLogicOpt.workload1_caller           2048  thrpt        384.863          ops/s
> >>> MacroLogicOpt.workload1_caller           4096  thrpt        132.753          ops/s
> >>> MacroLogicOpt.workload2_caller             64  thrpt        450.856          ops/s
> >>> MacroLogicOpt.workload2_caller            128  thrpt        323.925          ops/s
> >>> MacroLogicOpt.workload2_caller            256  thrpt        135.191          ops/s
> >>> MacroLogicOpt.workload2_caller            512  thrpt         69.424          ops/s
> >>> MacroLogicOpt.workload2_caller           1024  thrpt         35.744          ops/s
> >>> MacroLogicOpt.workload2_caller           2048  thrpt         14.168          ops/s
> >>> MacroLogicOpt.workload2_caller           4096  thrpt          7.245          ops/s
> >>> MacroLogicOpt.workload3_caller             64  thrpt       3333.550          ops/s
> >>> MacroLogicOpt.workload3_caller            128  thrpt       2269.428          ops/s
> >>> MacroLogicOpt.workload3_caller            256  thrpt        995.691          ops/s
> >>> MacroLogicOpt.workload3_caller            512  thrpt        412.452          ops/s
> >>> MacroLogicOpt.workload3_caller           1024  thrpt        151.157          ops/s
> >>> MacroLogicOpt.workload3_caller           2048  thrpt         75.079          ops/s
> >>> MacroLogicOpt.workload3_caller           4096  thrpt         37.158          ops/s
> >>>
> >>> Please review the patch.
> >>>
> >>> Best Regards,
> >>> Jatin
> >>>
> >>> [1] Section 17.7 :
> >>> https://urldefense.com/v3/__https://software.intel.com/sites/default
> >>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG
> >>> QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$
> >>> architectures-optimization-manual.pdf
> >>>