[aarch64-port-dev ] RFR: C2: Canonicalize (x & 16 == 16) [Was: AARCH64 optimization: using TBZ instruction for bit check]

Boris Ulasevich boris.ulasevich at bell-sw.com
Mon Jun 22 14:45:13 UTC 2020


Hi Vladimir,

 > Would be nice to know if any Java benchmark is affected.

With the change we have got 5% performance boost on lucene tokenizer 
method on ARM64. Same time on x86 there is no visible improvement on 
lucene tokenizer.

thanks,
Boris

import org.apache.lucene.analysis.standard.StandardTokenizerImpl;
import java.nio.file.Files;
import java.io.*;

class Test {
   public static void main(String args[]) {
     long count = 0;
     try {
       byte[] content = Files.readAllBytes(new 
File("aarch64.ad").toPath());
       for (int i=0; i < 1000; i++) {
         Reader reader = new InputStreamReader(new 
ByteArrayInputStream(content));
         StandardTokenizerImpl sti = new StandardTokenizerImpl(reader);
         while (sti.getNextToken() != -1) {
           count ++;
         }
       }
     } catch (Exception ex) { System.out.println(ex); }
     System.out.println(count);
   }
}


On 19.06.2020 21:36, Vladimir Kozlov wrote:
> Nice optimization.
>
> I don't think we should turn it off on any machine. In real 
> application you will not see such tight loops only with such branch. 
> On other hand reducing code size should help in all cases.
>
> Would be nice to know if any Java benchmark is affected.
>
> I will try to run our set of benchmarks with these changes.
>
> Regards,
> Vladimir K
>
> On 6/19/20 10:07 AM, Andrew Haley wrote:
>> Hi,
>>
>> On 19/06/2020 17:49, Boris Ulasevich wrote:
>>> I added the expression canonicalization in the BoolNode::Ideal method:
>>> http://cr.openjdk.java.net/~bulasevich/8247408/webrev.02b
>>>
>>> The change reduces a number of generated machine instructions on all
>>> ARM/x86/PPC architectures. Benchmark shows positive results on ARM64 
>>> and
>>> ARM32 with the given change.
>>>
>>> On x86 benchmark performance improves from +1% to +13% depending on the
>>> CPU generation, except of machines affected by Intel Erratum 
>>> (JDK-8234160)
>>> issue. Maximum decrease observed is -%11. It does not look like a 
>>> problem
>>> with the proposed benchmark though, but rather like an issue with
>>> Erratum mitigation.
>>>
>>> On PowerPC result of the micro-benchmark is also positive. I changed 
>>> the
>>> micro-benchmark to make it a little bulkier so that we don't hit the
>>> limitations of architectures with a less elaborate branch prediction
>>> mechanism. The original application performance does not change on 
>>> PowerPC.
>>
>> Fantastic work, thanks! You've done a remarkably thorough job. It's
>> slightly unfortunate that one of the targets regresses. If there had
>> been no regressions, I'd approve this straight away.
>>
>> Forwarding to hotspot-compiler-dev for more comments.
>>
>> VladimirK, what do you think? I guess we could turn this off on the
>> machines affected by JDK-8234160. Should we?
>>



More information about the aarch64-port-dev mailing list