RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
Nils Eliasson
nils.eliasson at oracle.com
Thu Apr 2 09:28:45 UTC 2020
Hi Jatin,
The patch is nice and clean.
Reviewed.
Best regards
Nils Eliasson
On 2020-04-01 20:23, Bhateja, Jatin wrote:
> Hi Vladimir,
>
> Please find an updated unified patch at the following link.
>
> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/
>
> This removes Optimized NotV handling for AVX3, as suggested it will be
> brought via vectorIntrinsics branch.
>
> Thanks for your help in shaping up this patch, please let me know if there
> are other comments.
>
> Best Regards,
> Jatin
> ________________________________________
> From: Bhateja, Jatin
> Sent: Wednesday, March 25, 2020 12:14 PM
> To: Vladimir Ivanov
> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
> Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
>
> Hi Vladimir,
>
> I have placed updated patch at following links:-
>
> 1) Optimized NotV handling:
> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
>
> 2) Changes for MacroLogic opt:
> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/
>
> Kindly review and let me know your feedback.
>
> Thanks,
> Jatin
>
>> -----Original Message-----
>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>> Sent: Wednesday, March 25, 2020 12:33 AM
>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
>> <sandhya.viswanathan at intel.com>
>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
>>
>> Hi Jatin,
>>
>> I tried to submit the patches for testing, but windows-x64 build failed with the
>> following errors:
>>
>> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did not
>> evaluate to a constant
>> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by a read
>> of a variable outside its lifetime
>> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition'
>> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int
>> ['function']' is not assignable
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 24.03.2020 10:34, Bhateja, Jatin wrote:
>>> Hi Vladimir,
>>>
>>> Thanks for your comments , I have split the original patch into two sub-
>> patches.
>>> 1) Optimized NotV handling:
>>> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
>>>
>>> 2) Changes for MacroLogic opt:
>>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
>>>
>>> Added a new flag "UseVectorMacroLogic" which guards MacroLogic
>> optimization.
>>> Kindly review and let me know your feedback.
>>>
>>> Best Regards,
>>> Jatin
>>>
>>>> -----Original Message-----
>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>> Sent: Tuesday, March 17, 2020 4:31 PM
>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
>>>> dev at openjdk.java.net
>>>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
>>>> Instruction
>>>>
>>>>
>>>>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
>>>> Very nice contribution, Jatin!
>>>>
>>>> Some comments after a brief review pass:
>>>>
>>>> * Please, contribute NotV part separately.
>>>>
>>>> * Why don't you perform (XorV v 0xFF..FF) => (NotV v)
>>>> transformation during GVN instead?
>>>>
>>>> * As of now, vector nodes are only produced by SuperWord
>>>> analysis. It makes sense to limit new optimization pass to SuperWord
>>>> pass only (probably, introduce a new dedicated Phase ). Once Vector
>>>> API is available, it can be extended to cases when vector nodes are
>>>> present
>>>> (C->max_vector_size() > 0).
>>>>
>>>> * There are more efficient ways to produce a vector of all-1s [1] [2].
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>> [1]
>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105
>>>> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgc
>>>> qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$
>>>> 1-efficiently
>>>>
>>>> [2]
>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469
>>>> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI
>>>> _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$
>>>> value-to-all-one-bits
>>>>
>>>>> A new optimization pass has been added post Auto-Vectorization which
>>>> folds expression tree involving vector boolean logic operations
>>>> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
>>>>> Optimization pass has following stages:
>>>>>
>>>>> 1. Collection stage :
>>>>> * This performs a DFS traversal over Ideal Graph and collects the root
>>>> nodes of all vector logic expression trees.
>>>>> 2. Processing stage:
>>>>> * Performs a bottom up traversal over expression tree and
>>>> simultaneously folds specific DAG patterns involving Boolean logic
>>>> parent and child nodes.
>>>>> * Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding.
>>>>> * Folding is performed under a constraint on the total number of
>> inputs
>>>> which a MacroLogic node can have, in this case it's 3.
>>>>> * A partition is created around a DAG pattern involving logic parent
>> and
>>>> one or two logic child node, it encapsulate the nodes in post-order fashion.
>>>>> * This partition is then evaluated by traversing over the nodes,
>> assigning
>>>> boolean values to its inputs and performing operations over them
>>>> based on its Opcode. Node along with its computed result is stored in
>>>> a map which is accessed during the evaluation of its user/parent node.
>>>>> * Post-evaluation a MacroLogic node is created which is equivalent to
>> a
>>>> three input truth-table. Expression tree leaf level inputs along with
>>>> result of its evaluation are the inputs fed to this new node.
>>>>> * Entire expression tree is eventually subsumed/replaced by newly
>>>> create MacroLogic node.
>>>>>
>>>>> Following are the JMH benchmarks results with and without changes.
>>>>>
>>>>> Without Changes:
>>>>>
>>>>> Benchmark (VECLEN) Mode Cnt Score Error Units
>>>>> MacroLogicOpt.workload1_caller 64 thrpt 2904.480 ops/s
>>>>> MacroLogicOpt.workload1_caller 128 thrpt 2219.252 ops/s
>>>>> MacroLogicOpt.workload1_caller 256 thrpt 1507.267 ops/s
>>>>> MacroLogicOpt.workload1_caller 512 thrpt 860.926 ops/s
>>>>> MacroLogicOpt.workload1_caller 1024 thrpt 470.163 ops/s
>>>>> MacroLogicOpt.workload1_caller 2048 thrpt 246.608 ops/s
>>>>> MacroLogicOpt.workload1_caller 4096 thrpt 108.031 ops/s
>>>>> MacroLogicOpt.workload2_caller 64 thrpt 344.633 ops/s
>>>>> MacroLogicOpt.workload2_caller 128 thrpt 209.818 ops/s
>>>>> MacroLogicOpt.workload2_caller 256 thrpt 111.678 ops/s
>>>>> MacroLogicOpt.workload2_caller 512 thrpt 53.360 ops/s
>>>>> MacroLogicOpt.workload2_caller 1024 thrpt 27.888 ops/s
>>>>> MacroLogicOpt.workload2_caller 2048 thrpt 12.103 ops/s
>>>>> MacroLogicOpt.workload2_caller 4096 thrpt 6.018 ops/s
>>>>> MacroLogicOpt.workload3_caller 64 thrpt 3110.669 ops/s
>>>>> MacroLogicOpt.workload3_caller 128 thrpt 1996.861 ops/s
>>>>> MacroLogicOpt.workload3_caller 256 thrpt 870.166 ops/s
>>>>> MacroLogicOpt.workload3_caller 512 thrpt 389.629 ops/s
>>>>> MacroLogicOpt.workload3_caller 1024 thrpt 151.203 ops/s
>>>>> MacroLogicOpt.workload3_caller 2048 thrpt 75.086 ops/s
>>>>> MacroLogicOpt.workload3_caller 4096 thrpt 37.576 ops/s
>>>>>
>>>>> With Changes:
>>>>>
>>>>> Benchmark (VECLEN) Mode Cnt Score Error Units
>>>>> MacroLogicOpt.workload1_caller 64 thrpt 3306.670 ops/s
>>>>> MacroLogicOpt.workload1_caller 128 thrpt 2936.851 ops/s
>>>>> MacroLogicOpt.workload1_caller 256 thrpt 2413.827 ops/s
>>>>> MacroLogicOpt.workload1_caller 512 thrpt 1440.291 ops/s
>>>>> MacroLogicOpt.workload1_caller 1024 thrpt 707.576 ops/s
>>>>> MacroLogicOpt.workload1_caller 2048 thrpt 384.863 ops/s
>>>>> MacroLogicOpt.workload1_caller 4096 thrpt 132.753 ops/s
>>>>> MacroLogicOpt.workload2_caller 64 thrpt 450.856 ops/s
>>>>> MacroLogicOpt.workload2_caller 128 thrpt 323.925 ops/s
>>>>> MacroLogicOpt.workload2_caller 256 thrpt 135.191 ops/s
>>>>> MacroLogicOpt.workload2_caller 512 thrpt 69.424 ops/s
>>>>> MacroLogicOpt.workload2_caller 1024 thrpt 35.744 ops/s
>>>>> MacroLogicOpt.workload2_caller 2048 thrpt 14.168 ops/s
>>>>> MacroLogicOpt.workload2_caller 4096 thrpt 7.245 ops/s
>>>>> MacroLogicOpt.workload3_caller 64 thrpt 3333.550 ops/s
>>>>> MacroLogicOpt.workload3_caller 128 thrpt 2269.428 ops/s
>>>>> MacroLogicOpt.workload3_caller 256 thrpt 995.691 ops/s
>>>>> MacroLogicOpt.workload3_caller 512 thrpt 412.452 ops/s
>>>>> MacroLogicOpt.workload3_caller 1024 thrpt 151.157 ops/s
>>>>> MacroLogicOpt.workload3_caller 2048 thrpt 75.079 ops/s
>>>>> MacroLogicOpt.workload3_caller 4096 thrpt 37.158 ops/s
>>>>>
>>>>> Please review the patch.
>>>>>
>>>>> Best Regards,
>>>>> Jatin
>>>>>
>>>>> [1] Section 17.7 :
>>>>> https://urldefense.com/v3/__https://software.intel.com/sites/default
>>>>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG
>>>>> QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$
>>>>> architectures-optimization-manual.pdf
>>>>>
More information about the hotspot-compiler-dev
mailing list