[aarch64-port-dev ] RFR: C2: Canonicalize (x & 16 == 16) [Was: AARCH64 optimization: using TBZ instruction for bit check]

Mon Jun 22 15:48:07 UTC 2020

On 6/22/20 7:45 AM, Boris Ulasevich wrote:
> Hi Vladimir,
> 
>  > Would be nice to know if any Java benchmark is affected.
> 
> With the change we have got 5% performance boost on lucene tokenizer method on ARM64. Same time on x86 there is no 
> visible improvement on lucene tokenizer.

Good.

I ran our benchmarks (mostly jvm2008) on x86 and don't see any effects too.

Thanks,
Vladimir

> 
> thanks,
> Boris
> 
> import org.apache.lucene.analysis.standard.StandardTokenizerImpl;
> import java.nio.file.Files;
> import java.io.*;
> 
> class Test {
>    public static void main(String args[]) {
>      long count = 0;
>      try {
>        byte[] content = Files.readAllBytes(new File("aarch64.ad").toPath());
>        for (int i=0; i < 1000; i++) {
>          Reader reader = new InputStreamReader(new ByteArrayInputStream(content));
>          StandardTokenizerImpl sti = new StandardTokenizerImpl(reader);
>          while (sti.getNextToken() != -1) {
>            count ++;
>          }
>        }
>      } catch (Exception ex) { System.out.println(ex); }
>      System.out.println(count);
>    }
> }
> 
> 
> On 19.06.2020 21:36, Vladimir Kozlov wrote:
>> Nice optimization.
>>
>> I don't think we should turn it off on any machine. In real application you will not see such tight loops only with 
>> such branch. On other hand reducing code size should help in all cases.
>>
>> Would be nice to know if any Java benchmark is affected.
>>
>> I will try to run our set of benchmarks with these changes.
>>
>> Regards,
>> Vladimir K
>>
>> On 6/19/20 10:07 AM, Andrew Haley wrote:
>>> Hi,
>>>
>>> On 19/06/2020 17:49, Boris Ulasevich wrote:
>>>> I added the expression canonicalization in the BoolNode::Ideal method:
>>>> http://cr.openjdk.java.net/~bulasevich/8247408/webrev.02b
>>>>
>>>> The change reduces a number of generated machine instructions on all
>>>> ARM/x86/PPC architectures. Benchmark shows positive results on ARM64 and
>>>> ARM32 with the given change.
>>>>
>>>> On x86 benchmark performance improves from +1% to +13% depending on the
>>>> CPU generation, except of machines affected by Intel Erratum (JDK-8234160)
>>>> issue. Maximum decrease observed is -%11. It does not look like a problem
>>>> with the proposed benchmark though, but rather like an issue with
>>>> Erratum mitigation.
>>>>
>>>> On PowerPC result of the micro-benchmark is also positive. I changed the
>>>> micro-benchmark to make it a little bulkier so that we don't hit the
>>>> limitations of architectures with a less elaborate branch prediction
>>>> mechanism. The original application performance does not change on PowerPC.
>>>
>>> Fantastic work, thanks! You've done a remarkably thorough job. It's
>>> slightly unfortunate that one of the targets regresses. If there had
>>> been no regressions, I'd approve this straight away.
>>>
>>> Forwarding to hotspot-compiler-dev for more comments.
>>>
>>> VladimirK, what do you think? I guess we could turn this off on the
>>> machines affected by JDK-8234160. Should we?
>>>
>