RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction

Thu Apr 2 17:09:16 UTC 2020

Thanks Nils , Vladimir.

Changes have been pushed.
http://hg.openjdk.java.net/jdk/jdk/rev/29d878d3af35

Best Regards,
Jatin

> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Thursday, April 2, 2020 3:45 PM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
> <sandhya.viswanathan at intel.com>
> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
> 
> 
> >> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/
> >
> > Looks good. I'll submit it for testing.
> 
> Test results are clean.
> 
> Best regards,
> Vladimir Ivanov
> 
> >> This removes Optimized NotV handling for AVX3, as suggested it will
> >> be brought via vectorIntrinsics branch.
> >>
> >> Thanks for your help in shaping up this patch, please let me know if
> >> there are other comments.
> >>
> >> Best Regards,
> >> Jatin
> >> ________________________________________
> >> From: Bhateja, Jatin
> >> Sent: Wednesday, March 25, 2020 12:14 PM
> >> To: Vladimir Ivanov
> >> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
> >> Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
> >> Instruction
> >>
> >> Hi Vladimir,
> >>
> >> I have placed updated patch at following links:-
> >>
> >>   1)  Optimized NotV handling:
> >> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
> >>
> >>   2)  Changes for MacroLogic opt:
> >>   http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/
> >>
> >> Kindly review and let me know your feedback.
> >>
> >> Thanks,
> >> Jatin
> >>
> >>> -----Original Message-----
> >>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> >>> Sent: Wednesday, March 25, 2020 12:33 AM
> >>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> >>> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
> >>> <sandhya.viswanathan at intel.com>
> >>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
> >>> Instruction
> >>>
> >>> Hi Jatin,
> >>>
> >>> I tried to submit the patches for testing, but windows-x64 build
> >>> failed with the following errors:
> >>>
> >>> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression
> >>> did not evaluate to a constant
> >>> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused
> >>> by a read of a variable outside its lifetime
> >>> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition'
> >>> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type
> >>> 'int ['function']' is not assignable
> >>>
> >>> Best regards,
> >>> Vladimir Ivanov
> >>>
> >>> On 24.03.2020 10:34, Bhateja, Jatin wrote:
> >>>> Hi Vladimir,
> >>>>
> >>>> Thanks for your comments , I have split the original patch into two
> >>>> sub-
> >>> patches.
> >>>>
> >>>> 1)  Optimized NotV handling:
> >>>> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
> >>>>
> >>>> 2)  Changes for MacroLogic opt:
> >>>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
> >>>>
> >>>> Added a new flag "UseVectorMacroLogic" which guards MacroLogic
> >>> optimization.
> >>>>
> >>>> Kindly review and let me know your feedback.
> >>>>
> >>>> Best Regards,
> >>>> Jatin
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> >>>>> Sent: Tuesday, March 17, 2020 4:31 PM
> >>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
> >>>>> dev at openjdk.java.net
> >>>>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
> >>>>> Instruction
> >>>>>
> >>>>>
> >>>>>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
> >>>>>
> >>>>> Very nice contribution, Jatin!
> >>>>>
> >>>>> Some comments after a brief review pass:
> >>>>>
> >>>>>      * Please, contribute NotV part separately.
> >>>>>
> >>>>>      * Why don't you perform (XorV v 0xFF..FF) => (NotV v)
> >>>>> transformation during GVN instead?
> >>>>>
> >>>>>      * As of now, vector nodes are only produced by SuperWord
> >>>>> analysis. It makes sense to limit new optimization pass to
> >>>>> SuperWord pass only (probably, introduce a new dedicated Phase ).
> >>>>> Once Vector API is available, it can be extended to cases when
> >>>>> vector nodes are present
> >>>>> (C->max_vector_size() > 0).
> >>>>>
> >>>>>      * There are more efficient ways to produce a vector of all-1s
> >>>>> [1] [2].
> >>>>>
> >>>>> Best regards,
> >>>>> Vladimir Ivanov
> >>>>>
> >>>>> [1]
> >>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/45
> >>>>> 105
> >>>>> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3Dg
> >>>>> Jgc qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$
> >>>>> 1-efficiently
> >>>>>
> >>>>> [2]
> >>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/37
> >>>>> 469
> >>>>> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG
> >>>>> QTI _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$
> >>>>> value-to-all-one-bits
> >>>>>
> >>>>>>
> >>>>>> A new optimization pass has been added post Auto-Vectorization
> >>>>>> which
> >>>>> folds expression tree involving vector boolean logic operations
> >>>>> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
> >>>>>> Optimization pass has following stages:
> >>>>>>
> >>>>>>      1.  Collection stage :
> >>>>>>         *   This performs a DFS traversal over Ideal Graph and
> >>>>>> collects the root
> >>>>> nodes of all vector logic expression trees.
> >>>>>>      2.  Processing stage:
> >>>>>>         *   Performs a bottom up traversal over expression tree
> >>>>>> and
> >>>>> simultaneously folds specific DAG patterns involving Boolean logic
> >>>>> parent and child nodes.
> >>>>>>         *   Transforms (XORV INP , -1) -> (NOTV INP) to promote
> >>>>>> logic folding.
> >>>>>>         *   Folding is performed under a constraint on the total
> >>>>>> number of
> >>> inputs
> >>>>> which a MacroLogic node can have, in this case it's 3.
> >>>>>>         *   A partition is created around a DAG pattern involving
> >>>>>> logic parent
> >>> and
> >>>>> one or two logic child node, it encapsulate the nodes in
> >>>>> post-order fashion.
> >>>>>>         *   This partition is then evaluated by traversing over
> >>>>>> the nodes,
> >>> assigning
> >>>>> boolean values to its inputs and performing operations over them
> >>>>> based on its Opcode. Node along with its computed result is stored
> >>>>> in a map which is accessed during the evaluation of its user/parent
> node.
> >>>>>>         *   Post-evaluation a MacroLogic node is created which is
> >>>>>> equivalent to
> >>> a
> >>>>> three input truth-table. Expression tree leaf level inputs along
> >>>>> with result of its evaluation are the inputs fed to this new node.
> >>>>>>         *   Entire expression tree is eventually
> >>>>>> subsumed/replaced by newly
> >>>>> create MacroLogic node.
> >>>>>>
> >>>>>>
> >>>>>> Following are the JMH benchmarks results with and without changes.
> >>>>>>
> >>>>>> Without Changes:
> >>>>>>
> >>>>>> Benchmark                            (VECLEN)   Mode  Cnt
> >>>>>> Score   Error  Units
> >>>>>> MacroLogicOpt.workload1_caller             64  thrpt
> >>>>>> 2904.480          ops/s
> >>>>>> MacroLogicOpt.workload1_caller            128  thrpt
> >>>>>> 2219.252          ops/s
> >>>>>> MacroLogicOpt.workload1_caller            256  thrpt
> >>>>>> 1507.267          ops/s
> >>>>>> MacroLogicOpt.workload1_caller            512  thrpt
> >>>>>> 860.926          ops/s
> >>>>>> MacroLogicOpt.workload1_caller           1024  thrpt
> >>>>>> 470.163          ops/s
> >>>>>> MacroLogicOpt.workload1_caller           2048  thrpt
> >>>>>> 246.608          ops/s
> >>>>>> MacroLogicOpt.workload1_caller           4096  thrpt
> >>>>>> 108.031          ops/s
> >>>>>> MacroLogicOpt.workload2_caller             64  thrpt
> >>>>>> 344.633          ops/s
> >>>>>> MacroLogicOpt.workload2_caller            128  thrpt
> >>>>>> 209.818          ops/s
> >>>>>> MacroLogicOpt.workload2_caller            256  thrpt
> >>>>>> 111.678          ops/s
> >>>>>> MacroLogicOpt.workload2_caller            512  thrpt
> >>>>>> 53.360          ops/s
> >>>>>> MacroLogicOpt.workload2_caller           1024  thrpt
> >>>>>> 27.888          ops/s
> >>>>>> MacroLogicOpt.workload2_caller           2048  thrpt
> >>>>>> 12.103          ops/s
> >>>>>> MacroLogicOpt.workload2_caller           4096  thrpt
> >>>>>> 6.018          ops/s
> >>>>>> MacroLogicOpt.workload3_caller             64  thrpt
> >>>>>> 3110.669          ops/s
> >>>>>> MacroLogicOpt.workload3_caller            128  thrpt
> >>>>>> 1996.861          ops/s
> >>>>>> MacroLogicOpt.workload3_caller            256  thrpt
> >>>>>> 870.166          ops/s
> >>>>>> MacroLogicOpt.workload3_caller            512  thrpt
> >>>>>> 389.629          ops/s
> >>>>>> MacroLogicOpt.workload3_caller           1024  thrpt
> >>>>>> 151.203          ops/s
> >>>>>> MacroLogicOpt.workload3_caller           2048  thrpt
> >>>>>> 75.086          ops/s
> >>>>>> MacroLogicOpt.workload3_caller           4096  thrpt
> >>>>>> 37.576          ops/s
> >>>>>>
> >>>>>> With Changes:
> >>>>>>
> >>>>>> Benchmark                            (VECLEN)   Mode  Cnt
> >>>>>> Score   Error  Units
> >>>>>> MacroLogicOpt.workload1_caller             64  thrpt
> >>>>>> 3306.670          ops/s
> >>>>>> MacroLogicOpt.workload1_caller            128  thrpt
> >>>>>> 2936.851          ops/s
> >>>>>> MacroLogicOpt.workload1_caller            256  thrpt
> >>>>>> 2413.827          ops/s
> >>>>>> MacroLogicOpt.workload1_caller            512  thrpt
> >>>>>> 1440.291          ops/s
> >>>>>> MacroLogicOpt.workload1_caller           1024  thrpt
> >>>>>> 707.576          ops/s
> >>>>>> MacroLogicOpt.workload1_caller           2048  thrpt
> >>>>>> 384.863          ops/s
> >>>>>> MacroLogicOpt.workload1_caller           4096  thrpt
> >>>>>> 132.753          ops/s
> >>>>>> MacroLogicOpt.workload2_caller             64  thrpt
> >>>>>> 450.856          ops/s
> >>>>>> MacroLogicOpt.workload2_caller            128  thrpt
> >>>>>> 323.925          ops/s
> >>>>>> MacroLogicOpt.workload2_caller            256  thrpt
> >>>>>> 135.191          ops/s
> >>>>>> MacroLogicOpt.workload2_caller            512  thrpt
> >>>>>> 69.424          ops/s
> >>>>>> MacroLogicOpt.workload2_caller           1024  thrpt
> >>>>>> 35.744          ops/s
> >>>>>> MacroLogicOpt.workload2_caller           2048  thrpt
> >>>>>> 14.168          ops/s
> >>>>>> MacroLogicOpt.workload2_caller           4096  thrpt
> >>>>>> 7.245          ops/s
> >>>>>> MacroLogicOpt.workload3_caller             64  thrpt
> >>>>>> 3333.550          ops/s
> >>>>>> MacroLogicOpt.workload3_caller            128  thrpt
> >>>>>> 2269.428          ops/s
> >>>>>> MacroLogicOpt.workload3_caller            256  thrpt
> >>>>>> 995.691          ops/s
> >>>>>> MacroLogicOpt.workload3_caller            512  thrpt
> >>>>>> 412.452          ops/s
> >>>>>> MacroLogicOpt.workload3_caller           1024  thrpt
> >>>>>> 151.157          ops/s
> >>>>>> MacroLogicOpt.workload3_caller           2048  thrpt
> >>>>>> 75.079          ops/s
> >>>>>> MacroLogicOpt.workload3_caller           4096  thrpt
> >>>>>> 37.158          ops/s
> >>>>>>
> >>>>>> Please review the patch.
> >>>>>>
> >>>>>> Best Regards,
> >>>>>> Jatin
> >>>>>>
> >>>>>> [1] Section 17.7 :
> >>>>>> https://urldefense.com/v3/__https://software.intel.com/sites/defa
> >>>>>> ult
> >>>>>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqf
> >>>>>> llG QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$
> >>>>>> architectures-optimization-manual.pdf
> >>>>>>