RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Thu Apr 2 10:14:53 UTC 2020
>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/
>
> Looks good. I'll submit it for testing.
Test results are clean.
Best regards,
Vladimir Ivanov
>> This removes Optimized NotV handling for AVX3, as suggested it will be
>> brought via vectorIntrinsics branch.
>>
>> Thanks for your help in shaping up this patch, please let me know if
>> there
>> are other comments.
>>
>> Best Regards,
>> Jatin
>> ________________________________________
>> From: Bhateja, Jatin
>> Sent: Wednesday, March 25, 2020 12:14 PM
>> To: Vladimir Ivanov
>> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
>> Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
>> Instruction
>>
>> Hi Vladimir,
>>
>> I have placed updated patch at following links:-
>>
>> 1) Optimized NotV handling:
>> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
>>
>> 2) Changes for MacroLogic opt:
>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/
>>
>> Kindly review and let me know your feedback.
>>
>> Thanks,
>> Jatin
>>
>>> -----Original Message-----
>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>> Sent: Wednesday, March 25, 2020 12:33 AM
>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>>> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
>>> <sandhya.viswanathan at intel.com>
>>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
>>> Instruction
>>>
>>> Hi Jatin,
>>>
>>> I tried to submit the patches for testing, but windows-x64 build
>>> failed with the
>>> following errors:
>>>
>>> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did
>>> not
>>> evaluate to a constant
>>> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by
>>> a read
>>> of a variable outside its lifetime
>>> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition'
>>> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int
>>> ['function']' is not assignable
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> On 24.03.2020 10:34, Bhateja, Jatin wrote:
>>>> Hi Vladimir,
>>>>
>>>> Thanks for your comments , I have split the original patch into two
>>>> sub-
>>> patches.
>>>>
>>>> 1) Optimized NotV handling:
>>>> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
>>>>
>>>> 2) Changes for MacroLogic opt:
>>>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
>>>>
>>>> Added a new flag "UseVectorMacroLogic" which guards MacroLogic
>>> optimization.
>>>>
>>>> Kindly review and let me know your feedback.
>>>>
>>>> Best Regards,
>>>> Jatin
>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>> Sent: Tuesday, March 17, 2020 4:31 PM
>>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
>>>>> dev at openjdk.java.net
>>>>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
>>>>> Instruction
>>>>>
>>>>>
>>>>>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
>>>>>
>>>>> Very nice contribution, Jatin!
>>>>>
>>>>> Some comments after a brief review pass:
>>>>>
>>>>> * Please, contribute NotV part separately.
>>>>>
>>>>> * Why don't you perform (XorV v 0xFF..FF) => (NotV v)
>>>>> transformation during GVN instead?
>>>>>
>>>>> * As of now, vector nodes are only produced by SuperWord
>>>>> analysis. It makes sense to limit new optimization pass to SuperWord
>>>>> pass only (probably, introduce a new dedicated Phase ). Once Vector
>>>>> API is available, it can be extended to cases when vector nodes are
>>>>> present
>>>>> (C->max_vector_size() > 0).
>>>>>
>>>>> * There are more efficient ways to produce a vector of all-1s
>>>>> [1] [2].
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>> [1]
>>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105
>>>>> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgc
>>>>> qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$
>>>>> 1-efficiently
>>>>>
>>>>> [2]
>>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469
>>>>> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI
>>>>> _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$
>>>>> value-to-all-one-bits
>>>>>
>>>>>>
>>>>>> A new optimization pass has been added post Auto-Vectorization which
>>>>> folds expression tree involving vector boolean logic operations
>>>>> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
>>>>>> Optimization pass has following stages:
>>>>>>
>>>>>> 1. Collection stage :
>>>>>> * This performs a DFS traversal over Ideal Graph and
>>>>>> collects the root
>>>>> nodes of all vector logic expression trees.
>>>>>> 2. Processing stage:
>>>>>> * Performs a bottom up traversal over expression tree and
>>>>> simultaneously folds specific DAG patterns involving Boolean logic
>>>>> parent and child nodes.
>>>>>> * Transforms (XORV INP , -1) -> (NOTV INP) to promote
>>>>>> logic folding.
>>>>>> * Folding is performed under a constraint on the total
>>>>>> number of
>>> inputs
>>>>> which a MacroLogic node can have, in this case it's 3.
>>>>>> * A partition is created around a DAG pattern involving
>>>>>> logic parent
>>> and
>>>>> one or two logic child node, it encapsulate the nodes in post-order
>>>>> fashion.
>>>>>> * This partition is then evaluated by traversing over
>>>>>> the nodes,
>>> assigning
>>>>> boolean values to its inputs and performing operations over them
>>>>> based on its Opcode. Node along with its computed result is stored in
>>>>> a map which is accessed during the evaluation of its user/parent node.
>>>>>> * Post-evaluation a MacroLogic node is created which is
>>>>>> equivalent to
>>> a
>>>>> three input truth-table. Expression tree leaf level inputs along with
>>>>> result of its evaluation are the inputs fed to this new node.
>>>>>> * Entire expression tree is eventually subsumed/replaced
>>>>>> by newly
>>>>> create MacroLogic node.
>>>>>>
>>>>>>
>>>>>> Following are the JMH benchmarks results with and without changes.
>>>>>>
>>>>>> Without Changes:
>>>>>>
>>>>>> Benchmark (VECLEN) Mode Cnt
>>>>>> Score Error Units
>>>>>> MacroLogicOpt.workload1_caller 64 thrpt
>>>>>> 2904.480 ops/s
>>>>>> MacroLogicOpt.workload1_caller 128 thrpt
>>>>>> 2219.252 ops/s
>>>>>> MacroLogicOpt.workload1_caller 256 thrpt
>>>>>> 1507.267 ops/s
>>>>>> MacroLogicOpt.workload1_caller 512 thrpt
>>>>>> 860.926 ops/s
>>>>>> MacroLogicOpt.workload1_caller 1024 thrpt
>>>>>> 470.163 ops/s
>>>>>> MacroLogicOpt.workload1_caller 2048 thrpt
>>>>>> 246.608 ops/s
>>>>>> MacroLogicOpt.workload1_caller 4096 thrpt
>>>>>> 108.031 ops/s
>>>>>> MacroLogicOpt.workload2_caller 64 thrpt
>>>>>> 344.633 ops/s
>>>>>> MacroLogicOpt.workload2_caller 128 thrpt
>>>>>> 209.818 ops/s
>>>>>> MacroLogicOpt.workload2_caller 256 thrpt
>>>>>> 111.678 ops/s
>>>>>> MacroLogicOpt.workload2_caller 512 thrpt
>>>>>> 53.360 ops/s
>>>>>> MacroLogicOpt.workload2_caller 1024 thrpt
>>>>>> 27.888 ops/s
>>>>>> MacroLogicOpt.workload2_caller 2048 thrpt
>>>>>> 12.103 ops/s
>>>>>> MacroLogicOpt.workload2_caller 4096 thrpt
>>>>>> 6.018 ops/s
>>>>>> MacroLogicOpt.workload3_caller 64 thrpt
>>>>>> 3110.669 ops/s
>>>>>> MacroLogicOpt.workload3_caller 128 thrpt
>>>>>> 1996.861 ops/s
>>>>>> MacroLogicOpt.workload3_caller 256 thrpt
>>>>>> 870.166 ops/s
>>>>>> MacroLogicOpt.workload3_caller 512 thrpt
>>>>>> 389.629 ops/s
>>>>>> MacroLogicOpt.workload3_caller 1024 thrpt
>>>>>> 151.203 ops/s
>>>>>> MacroLogicOpt.workload3_caller 2048 thrpt
>>>>>> 75.086 ops/s
>>>>>> MacroLogicOpt.workload3_caller 4096 thrpt
>>>>>> 37.576 ops/s
>>>>>>
>>>>>> With Changes:
>>>>>>
>>>>>> Benchmark (VECLEN) Mode Cnt
>>>>>> Score Error Units
>>>>>> MacroLogicOpt.workload1_caller 64 thrpt
>>>>>> 3306.670 ops/s
>>>>>> MacroLogicOpt.workload1_caller 128 thrpt
>>>>>> 2936.851 ops/s
>>>>>> MacroLogicOpt.workload1_caller 256 thrpt
>>>>>> 2413.827 ops/s
>>>>>> MacroLogicOpt.workload1_caller 512 thrpt
>>>>>> 1440.291 ops/s
>>>>>> MacroLogicOpt.workload1_caller 1024 thrpt
>>>>>> 707.576 ops/s
>>>>>> MacroLogicOpt.workload1_caller 2048 thrpt
>>>>>> 384.863 ops/s
>>>>>> MacroLogicOpt.workload1_caller 4096 thrpt
>>>>>> 132.753 ops/s
>>>>>> MacroLogicOpt.workload2_caller 64 thrpt
>>>>>> 450.856 ops/s
>>>>>> MacroLogicOpt.workload2_caller 128 thrpt
>>>>>> 323.925 ops/s
>>>>>> MacroLogicOpt.workload2_caller 256 thrpt
>>>>>> 135.191 ops/s
>>>>>> MacroLogicOpt.workload2_caller 512 thrpt
>>>>>> 69.424 ops/s
>>>>>> MacroLogicOpt.workload2_caller 1024 thrpt
>>>>>> 35.744 ops/s
>>>>>> MacroLogicOpt.workload2_caller 2048 thrpt
>>>>>> 14.168 ops/s
>>>>>> MacroLogicOpt.workload2_caller 4096 thrpt
>>>>>> 7.245 ops/s
>>>>>> MacroLogicOpt.workload3_caller 64 thrpt
>>>>>> 3333.550 ops/s
>>>>>> MacroLogicOpt.workload3_caller 128 thrpt
>>>>>> 2269.428 ops/s
>>>>>> MacroLogicOpt.workload3_caller 256 thrpt
>>>>>> 995.691 ops/s
>>>>>> MacroLogicOpt.workload3_caller 512 thrpt
>>>>>> 412.452 ops/s
>>>>>> MacroLogicOpt.workload3_caller 1024 thrpt
>>>>>> 151.157 ops/s
>>>>>> MacroLogicOpt.workload3_caller 2048 thrpt
>>>>>> 75.079 ops/s
>>>>>> MacroLogicOpt.workload3_caller 4096 thrpt
>>>>>> 37.158 ops/s
>>>>>>
>>>>>> Please review the patch.
>>>>>>
>>>>>> Best Regards,
>>>>>> Jatin
>>>>>>
>>>>>> [1] Section 17.7 :
>>>>>> https://urldefense.com/v3/__https://software.intel.com/sites/default
>>>>>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG
>>>>>> QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$
>>>>>> architectures-optimization-manual.pdf
>>>>>>
More information about the hotspot-compiler-dev
mailing list