RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction

Wed Apr 1 18:23:29 UTC 2020

Hi Vladimir,

Please find an updated unified patch at the following link.

http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/

This removes Optimized NotV handling for AVX3, as suggested it will be
brought via vectorIntrinsics branch.

Thanks for your help in shaping up this patch, please let me know if there 
are other comments.

Best Regards,
Jatin
________________________________________
From: Bhateja, Jatin
Sent: Wednesday, March 25, 2020 12:14 PM
To: Vladimir Ivanov
Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction

Hi Vladimir,

I have placed updated patch at following links:-

 1)  Optimized NotV handling:
http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/

 2)  Changes for MacroLogic opt:
 http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/

Kindly review and let me know your feedback.

Thanks,
Jatin

> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Wednesday, March 25, 2020 12:33 AM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
> <sandhya.viswanathan at intel.com>
> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
>
> Hi Jatin,
>
> I tried to submit the patches for testing, but windows-x64 build failed with the
> following errors:
>
> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did not
> evaluate to a constant
> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by a read
> of a variable outside its lifetime
> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition'
> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int
> ['function']' is not assignable
>
> Best regards,
> Vladimir Ivanov
>
> On 24.03.2020 10:34, Bhateja, Jatin wrote:
> > Hi Vladimir,
> >
> > Thanks for your comments , I have split the original patch into two sub-
> patches.
> >
> > 1)  Optimized NotV handling:
> > http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
> >
> > 2)  Changes for MacroLogic opt:
> > http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
> >
> > Added a new flag "UseVectorMacroLogic" which guards MacroLogic
> optimization.
> >
> > Kindly review and let me know your feedback.
> >
> > Best Regards,
> > Jatin
> >
> >> -----Original Message-----
> >> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> >> Sent: Tuesday, March 17, 2020 4:31 PM
> >> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
> >> dev at openjdk.java.net
> >> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
> >> Instruction
> >>
> >>
> >>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
> >>
> >> Very nice contribution, Jatin!
> >>
> >> Some comments after a brief review pass:
> >>
> >>     * Please, contribute NotV part separately.
> >>
> >>     * Why don't you perform (XorV v 0xFF..FF) => (NotV v)
> >> transformation during GVN instead?
> >>
> >>     * As of now, vector nodes are only produced by SuperWord
> >> analysis. It makes sense to limit new optimization pass to SuperWord
> >> pass only (probably, introduce a new dedicated Phase ). Once Vector
> >> API is available, it can be extended to cases when vector nodes are
> >> present
> >> (C->max_vector_size() > 0).
> >>
> >>     * There are more efficient ways to produce a vector of all-1s [1] [2].
> >>
> >> Best regards,
> >> Vladimir Ivanov
> >>
> >> [1]
> >> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105
> >> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgc
> >> qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$
> >> 1-efficiently
> >>
> >> [2]
> >> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469
> >> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI
> >> _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$
> >> value-to-all-one-bits
> >>
> >>>
> >>> A new optimization pass has been added post Auto-Vectorization which
> >> folds expression tree involving vector boolean logic operations
> >> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
> >>> Optimization pass has following stages:
> >>>
> >>>     1.  Collection stage :
> >>>        *   This performs a DFS traversal over Ideal Graph and collects the root
> >> nodes of all vector logic expression trees.
> >>>     2.  Processing stage:
> >>>        *   Performs a bottom up traversal over expression tree and
> >> simultaneously folds specific DAG patterns involving Boolean logic
> >> parent and child nodes.
> >>>        *   Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding.
> >>>        *   Folding is performed under a constraint on the total number of
> inputs
> >> which a MacroLogic node can have, in this case it's 3.
> >>>        *   A partition is created around a DAG pattern involving logic parent
> and
> >> one or two logic child node, it encapsulate the nodes in post-order fashion.
> >>>        *   This partition is then evaluated by traversing over the nodes,
> assigning
> >> boolean values to its inputs and performing operations over them
> >> based on its Opcode. Node along with its computed result is stored in
> >> a map which is accessed during the evaluation of its user/parent node.
> >>>        *   Post-evaluation a MacroLogic node is created which is equivalent to
> a
> >> three input truth-table. Expression tree leaf level inputs along with
> >> result of its evaluation are the inputs fed to this new node.
> >>>        *   Entire expression tree is eventually subsumed/replaced by newly
> >> create MacroLogic node.
> >>>
> >>>
> >>> Following are the JMH benchmarks results with and without changes.
> >>>
> >>> Without Changes:
> >>>
> >>> Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
> >>> MacroLogicOpt.workload1_caller             64  thrpt       2904.480          ops/s
> >>> MacroLogicOpt.workload1_caller            128  thrpt       2219.252          ops/s
> >>> MacroLogicOpt.workload1_caller            256  thrpt       1507.267          ops/s
> >>> MacroLogicOpt.workload1_caller            512  thrpt        860.926          ops/s
> >>> MacroLogicOpt.workload1_caller           1024  thrpt        470.163          ops/s
> >>> MacroLogicOpt.workload1_caller           2048  thrpt        246.608          ops/s
> >>> MacroLogicOpt.workload1_caller           4096  thrpt        108.031          ops/s
> >>> MacroLogicOpt.workload2_caller             64  thrpt        344.633          ops/s
> >>> MacroLogicOpt.workload2_caller            128  thrpt        209.818          ops/s
> >>> MacroLogicOpt.workload2_caller            256  thrpt        111.678          ops/s
> >>> MacroLogicOpt.workload2_caller            512  thrpt         53.360          ops/s
> >>> MacroLogicOpt.workload2_caller           1024  thrpt         27.888          ops/s
> >>> MacroLogicOpt.workload2_caller           2048  thrpt         12.103          ops/s
> >>> MacroLogicOpt.workload2_caller           4096  thrpt          6.018          ops/s
> >>> MacroLogicOpt.workload3_caller             64  thrpt       3110.669          ops/s
> >>> MacroLogicOpt.workload3_caller            128  thrpt       1996.861          ops/s
> >>> MacroLogicOpt.workload3_caller            256  thrpt        870.166          ops/s
> >>> MacroLogicOpt.workload3_caller            512  thrpt        389.629          ops/s
> >>> MacroLogicOpt.workload3_caller           1024  thrpt        151.203          ops/s
> >>> MacroLogicOpt.workload3_caller           2048  thrpt         75.086          ops/s
> >>> MacroLogicOpt.workload3_caller           4096  thrpt         37.576          ops/s
> >>>
> >>> With Changes:
> >>>
> >>> Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
> >>> MacroLogicOpt.workload1_caller             64  thrpt       3306.670          ops/s
> >>> MacroLogicOpt.workload1_caller            128  thrpt       2936.851          ops/s
> >>> MacroLogicOpt.workload1_caller            256  thrpt       2413.827          ops/s
> >>> MacroLogicOpt.workload1_caller            512  thrpt       1440.291          ops/s
> >>> MacroLogicOpt.workload1_caller           1024  thrpt        707.576          ops/s
> >>> MacroLogicOpt.workload1_caller           2048  thrpt        384.863          ops/s
> >>> MacroLogicOpt.workload1_caller           4096  thrpt        132.753          ops/s
> >>> MacroLogicOpt.workload2_caller             64  thrpt        450.856          ops/s
> >>> MacroLogicOpt.workload2_caller            128  thrpt        323.925          ops/s
> >>> MacroLogicOpt.workload2_caller            256  thrpt        135.191          ops/s
> >>> MacroLogicOpt.workload2_caller            512  thrpt         69.424          ops/s
> >>> MacroLogicOpt.workload2_caller           1024  thrpt         35.744          ops/s
> >>> MacroLogicOpt.workload2_caller           2048  thrpt         14.168          ops/s
> >>> MacroLogicOpt.workload2_caller           4096  thrpt          7.245          ops/s
> >>> MacroLogicOpt.workload3_caller             64  thrpt       3333.550          ops/s
> >>> MacroLogicOpt.workload3_caller            128  thrpt       2269.428          ops/s
> >>> MacroLogicOpt.workload3_caller            256  thrpt        995.691          ops/s
> >>> MacroLogicOpt.workload3_caller            512  thrpt        412.452          ops/s
> >>> MacroLogicOpt.workload3_caller           1024  thrpt        151.157          ops/s
> >>> MacroLogicOpt.workload3_caller           2048  thrpt         75.079          ops/s
> >>> MacroLogicOpt.workload3_caller           4096  thrpt         37.158          ops/s
> >>>
> >>> Please review the patch.
> >>>
> >>> Best Regards,
> >>> Jatin
> >>>
> >>> [1] Section 17.7 :
> >>> https://urldefense.com/v3/__https://software.intel.com/sites/default
> >>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG
> >>> QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$
> >>> architectures-optimization-manual.pdf
> >>>