From ci_notify at linaro.org  Fri Nov  1 04:18:42 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Fri, 1 Nov 2019 04:18:42 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK 11u on AArch64
Message-ID: <486311059.11170.1572581922901.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk11u/openjdk-jtreg-nightly-tests/summary/2019/304/summary.html
 
-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jun/15 pass: 5,737; fail: 5; not run: 11,623
Build 1: aarch64/2019/jun/27 pass: 5,737; fail: 5
Build 2: aarch64/2019/jul/02 pass: 5,737; fail: 5
Build 3: aarch64/2019/aug/03 pass: 5,746; fail: 4
Build 4: aarch64/2019/aug/10 pass: 5,747; fail: 4
Build 5: aarch64/2019/aug/15 pass: 5,753; fail: 4
Build 6: aarch64/2019/aug/22 pass: 5,755; fail: 4
Build 7: aarch64/2019/sep/04 pass: 5,764; fail: 2
Build 8: aarch64/2019/sep/05 pass: 5,764; fail: 2
Build 9: aarch64/2019/sep/10 pass: 5,764; fail: 2
Build 10: aarch64/2019/sep/17 pass: 5,763; fail: 3
Build 11: aarch64/2019/sep/21 pass: 5,764; fail: 2
Build 12: aarch64/2019/oct/04 pass: 5,764; fail: 2
Build 13: aarch64/2019/oct/17 pass: 5,764; fail: 2
Build 14: aarch64/2019/oct/31 pass: 5,784; fail: 1

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jun/15 pass: 8,409; fail: 506; error: 20
Build 1: aarch64/2019/jun/27 pass: 8,401; fail: 512; error: 22
Build 2: aarch64/2019/jul/02 pass: 8,407; fail: 498; error: 31
Build 3: aarch64/2019/aug/03 pass: 8,429; fail: 509; error: 18
Build 4: aarch64/2019/aug/10 pass: 8,450; fail: 485; error: 16
Build 5: aarch64/2019/aug/15 pass: 8,443; fail: 496; error: 13
Build 6: aarch64/2019/aug/22 pass: 8,446; fail: 494; error: 15
Build 7: aarch64/2019/sep/04 pass: 8,483; fail: 465; error: 10
Build 8: aarch64/2019/sep/05 pass: 8,465; fail: 479; error: 14
Build 9: aarch64/2019/sep/10 pass: 8,444; fail: 500; error: 14
Build 10: aarch64/2019/sep/17 pass: 8,462; fail: 482; error: 12
Build 11: aarch64/2019/sep/21 pass: 8,467; fail: 478; error: 13
Build 12: aarch64/2019/oct/04 pass: 8,444; fail: 498; error: 16
Build 13: aarch64/2019/oct/17 pass: 8,452; fail: 493; error: 16
Build 14: aarch64/2019/oct/31 pass: 8,468; fail: 490; error: 14

5 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jun/15 pass: 3,908
Build 1: aarch64/2019/jun/27 pass: 3,908
Build 2: aarch64/2019/jul/02 pass: 3,908
Build 3: aarch64/2019/aug/03 pass: 3,908
Build 4: aarch64/2019/aug/10 pass: 3,909
Build 5: aarch64/2019/aug/15 pass: 3,909
Build 6: aarch64/2019/aug/22 pass: 3,909
Build 7: aarch64/2019/sep/04 pass: 3,910
Build 8: aarch64/2019/sep/05 pass: 3,910
Build 9: aarch64/2019/sep/10 pass: 3,910
Build 10: aarch64/2019/sep/17 pass: 3,910
Build 11: aarch64/2019/sep/21 pass: 3,910
Build 12: aarch64/2019/oct/04 pass: 3,910
Build 13: aarch64/2019/oct/17 pass: 3,910
Build 14: aarch64/2019/oct/31 pass: 3,910

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdk11u/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 7.46x
Relative performance: Server critical-jOPS (nc): 7.91x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk11u/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 210.67

Server 210.67 / Server 2014-04-01 (71.00): 2.97x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk11u/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-06-16 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/166/results/
2019-06-28 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/178/results/
2019-07-03 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/183/results/
2019-08-04 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/215/results/
2019-08-11 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/222/results/
2019-08-16 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/227/results/
2019-08-23 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/234/results/
2019-09-05 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/247/results/
2019-09-07 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/248/results/
2019-09-11 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/253/results/
2019-09-18 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/260/results/
2019-09-22 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/264/results/
2019-10-05 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/277/results/
2019-10-18 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/290/results/
2019-11-01 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/304/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/

From aph at redhat.com  Fri Nov  1 10:15:37 2019
From: aph at redhat.com (Andrew Haley)
Date: Fri, 1 Nov 2019 10:15:37 +0000
Subject: [aarch64-port-dev ] RFR 8233339: Shenandoah: Centralize load
 barrier decisions into ShenandoahBarrierSet
In-Reply-To: <6ef89df6-84db-0ffe-d1fc-7ffde7e622bf@redhat.com>
References: <6ef89df6-84db-0ffe-d1fc-7ffde7e622bf@redhat.com>
Message-ID: <4ed90469-8689-b49d-69f1-98f644e9edd0@redhat.com>

On 10/31/19 6:48 PM, Zhengyu Gu wrote:
> Right now, the decisions on, if a load barrier needs load reference 
> barrier, if so, what kind? and if the reference needs to be kept alive, 
> are scattered inside interpreter/c1/2 load barrier code, which is hard 
> to make them consistent.
> 
> I would like to centralize the decision making into 
> ShenandoahBarrierSet, so them can be consistent and easy to maintain.

You should say, at the start of every routine you touch, which
registers are inputs, which are outputs, and (important) which may
alias with rscratch1 and rscratch2. Please also mark clobbers of
rscratch1 and 2.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From zgu at redhat.com  Fri Nov  1 14:15:56 2019
From: zgu at redhat.com (Zhengyu Gu)
Date: Fri, 1 Nov 2019 10:15:56 -0400
Subject: [aarch64-port-dev ] RFR 8233339: Shenandoah: Centralize load
 barrier decisions into ShenandoahBarrierSet
In-Reply-To: <4ed90469-8689-b49d-69f1-98f644e9edd0@redhat.com>
References: <6ef89df6-84db-0ffe-d1fc-7ffde7e622bf@redhat.com>
 <4ed90469-8689-b49d-69f1-98f644e9edd0@redhat.com>
Message-ID: <a11a7164-de20-79d0-72d7-8bfc2fe3bfd2@redhat.com>

>>
>> I would like to centralize the decision making into
>> ShenandoahBarrierSet, so them can be consistent and easy to maintain.
> 
> You should say, at the start of every routine you touch, which
> registers are inputs, which are outputs, and (important) which may
> alias with rscratch1 and rscratch2. Please also mark clobbers of
> rscratch1 and 2.
> 
Okay, updated:

Webrev: http://cr.openjdk.java.net/~zgu/JDK-8233339/webrev.01/index.html

Thanks,

-Zhengyu

From shade at redhat.com  Fri Nov  1 15:43:20 2019
From: shade at redhat.com (Aleksey Shipilev)
Date: Fri, 1 Nov 2019 16:43:20 +0100
Subject: [aarch64-port-dev ] RFR 8233339: Shenandoah: Centralize load
 barrier decisions into ShenandoahBarrierSet
In-Reply-To: <a11a7164-de20-79d0-72d7-8bfc2fe3bfd2@redhat.com>
References: <6ef89df6-84db-0ffe-d1fc-7ffde7e622bf@redhat.com>
 <4ed90469-8689-b49d-69f1-98f644e9edd0@redhat.com>
 <a11a7164-de20-79d0-72d7-8bfc2fe3bfd2@redhat.com>
Message-ID: <0fb9cd70-0a89-8c14-7469-55205c4c3808@redhat.com>

On 11/1/19 3:15 PM, Zhengyu Gu wrote:
> Webrev: http://cr.openjdk.java.net/~zgu/JDK-8233339/webrev.01/index.html

To be honest, it does not look like much of the improvement from the first glance. Maybe we should
massage the code a bit to make it more readable? Roman also needs to take a look.

*) shenandoahBarrierSetAssembler_x86.cpp, I believe it would be more straightforward to save
branching on local variable "need_load_reference_barrier" by spelling out the "disabled" path
directly (in fact, I think you are almost there in shenandoahBarrierSetC1.cpp!):

  if (!ShenandoahBarrierSet::need_load_reference_barrier(decorators, type)) {
    BarrierSetAssembler::load_at(masm, decorators, type, dst, src, tmp1, tmp_thread);
    return;
  }

  ... code that assumes need_load_reference_barrier = true follows ...

  Register result_dst = dst;
  bool use_tmp1_for_dst = false;

*) shenandoahBarrierSetC1.cpp: local variable "need_load_reference_barrier" is not needed, there is
only a single use

*) shenandoahBarrierSetC2.cpp: this block should go all the way up:

 557   if (!ShenandoahBarrierSet::need_load_reference_barrier(decorators, type)) {
 558     return load;
 559   }

*) shenandoahBarrierSet.cpp: this is just "return is_reference_type(type)". Saves some inversions.

  78   if (!is_reference_type(type)) return false;
  79   return true;

*) shenandoahBarrierSet.cpp: should be "Should be subset of LRB":

  83   assert(need_load_reference_barrier(decorators, type), "Why ask?");

*) shenandoahBarrierSet.cpp: seems like this assert is subsumed by the previous one?

   84   assert(is_reference_type(type), "Why we here?");


-- 
Thanks,
-Aleksey


From zgu at redhat.com  Fri Nov  1 17:37:49 2019
From: zgu at redhat.com (Zhengyu Gu)
Date: Fri, 1 Nov 2019 13:37:49 -0400
Subject: [aarch64-port-dev ] RFR 8233339: Shenandoah: Centralize load
 barrier decisions into ShenandoahBarrierSet
In-Reply-To: <0fb9cd70-0a89-8c14-7469-55205c4c3808@redhat.com>
References: <6ef89df6-84db-0ffe-d1fc-7ffde7e622bf@redhat.com>
 <4ed90469-8689-b49d-69f1-98f644e9edd0@redhat.com>
 <a11a7164-de20-79d0-72d7-8bfc2fe3bfd2@redhat.com>
 <0fb9cd70-0a89-8c14-7469-55205c4c3808@redhat.com>
Message-ID: <d9ad3216-1645-b88d-1cc6-d59f4caab2a7@redhat.com>

Hi Aleksey,

On 11/1/19 11:43 AM, Aleksey Shipilev wrote:
> On 11/1/19 3:15 PM, Zhengyu Gu wrote:
>> Webrev: http://cr.openjdk.java.net/~zgu/JDK-8233339/webrev.01/index.html
> 
> To be honest, it does not look like much of the improvement from the first glance. Maybe we should
> massage the code a bit to make it more readable? Roman also needs to take a look.

Right, it is not. But I believe that should be done in separate CR, as 
it may cause backport headache, right?

Filed: https://bugs.openjdk.java.net/browse/JDK-8233401

Matter of fact, I would like to hold off this code review, till reactor 
is done.

Thanks,

-Zhengyu

> 
> *) shenandoahBarrierSetAssembler_x86.cpp, I believe it would be more straightforward to save
> branching on local variable "need_load_reference_barrier" by spelling out the "disabled" path
> directly (in fact, I think you are almost there in shenandoahBarrierSetC1.cpp!):
> 
>    if (!ShenandoahBarrierSet::need_load_reference_barrier(decorators, type)) {
>      BarrierSetAssembler::load_at(masm, decorators, type, dst, src, tmp1, tmp_thread);
>      return;
>    }
> 
>    ... code that assumes need_load_reference_barrier = true follows ...
> 
>    Register result_dst = dst;
>    bool use_tmp1_for_dst = false;
> 
> *) shenandoahBarrierSetC1.cpp: local variable "need_load_reference_barrier" is not needed, there is
> only a single use
> 
> *) shenandoahBarrierSetC2.cpp: this block should go all the way up:
> 
>   557   if (!ShenandoahBarrierSet::need_load_reference_barrier(decorators, type)) {
>   558     return load;
>   559   }
> 
> *) shenandoahBarrierSet.cpp: this is just "return is_reference_type(type)". Saves some inversions.
> 
>    78   if (!is_reference_type(type)) return false;
>    79   return true;
> 
> *) shenandoahBarrierSet.cpp: should be "Should be subset of LRB":
> 
>    83   assert(need_load_reference_barrier(decorators, type), "Why ask?");
> 
> *) shenandoahBarrierSet.cpp: seems like this assert is subsumed by the previous one?
> 
>     84   assert(is_reference_type(type), "Why we here?");
> 
> 

From ci_notify at linaro.org  Fri Nov  1 21:07:36 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Fri, 1 Nov 2019 21:07:36 +0000 (UTC)
Subject: [aarch64-port-dev ] Linaro OpenJDK AArch64 jdk/jdk build 2277
	Failure
Message-ID: <1322569899.11330.1572642456920.JavaMail.javamailuser@localhost>

OpenJDK AArch64 jdk/jdk build status is Failure
Build details -  https://ci.linaro.org/job/jdkX-ci-build/2277/

Changes -
  kbarrett: 4ec9fc2b2f0deeb8eabbb816269f5a7f6484be3e 
	- src/hotspot/share/memory/operator_new.cpp 
--"8233359: Add global sized operator delete definitions
Summary: Added new definitions.
Reviewed-by: dholmes
"  bpb: 76638c631869b6da4a7b116dac9a17438cc819c7 
	- src/java.base/share/classes/java/nio/file/FileStore.java
	- src/java.base/unix/classes/sun/nio/fs/UnixFileStore.java
	- src/java.base/windows/classes/sun/nio/fs/WindowsFileStore.java 
--"8162520: (fs) FileStore should support file stores with > Long.MAX_VALUE capacity
Reviewed-by: alanb, darcy, rriggs
"

Build output -
   Building target 'images' in configuration '/home/buildslave/workspace/jdkX-ci-build/build'
   Compiling 8 files for BUILD_TOOLS_LANGTOOLS
   Compiling 1 files for BUILD_JFR_TOOLS
   Creating hotspot/variant-server/tools/adlc/adlc from 13 file(s)
   Compiling 2 files for BUILD_JVMTI_TOOLS
   Compiling 10 properties into resource bundles for jdk.javadoc
   Parsing 2 properties into enum-like class for jdk.compiler
   Compiling 19 properties into resource bundles for jdk.compiler
   Compiling 12 properties into resource bundles for jdk.jdeps
   Compiling 7 properties into resource bundles for jdk.jshell
   Compiling 117 files for BUILD_java.compiler.interim
   Creating support/modules_libs/java.base/server/libjvm.so from 1006 file(s)
   Creating hotspot/variant-server/libjvm/gtest/libjvm.so from 126 file(s)
   Creating hotspot/variant-server/libjvm/gtest/gtestLauncher from 1 file(s)
   Compiling 401 files for BUILD_jdk.compiler.interim
   /home/buildslave/workspace/jdkX-ci-build/jdkX/src/hotspot/share/memory/operator_new.cpp:92:6: error: 'void operator delete(void*, size_t)' is a usual (non-placement) deallocation function in C++14 (or with -fsized-deallocation) [-Werror=c++14-compat]
    void operator delete(void* p, size_t size) throw() {
         ^
   /home/buildslave/workspace/jdkX-ci-build/jdkX/src/hotspot/share/memory/operator_new.cpp:96:6: error: 'void operator delete [](void*, size_t)' is a usual (non-placement) deallocation function in C++14 (or with -fsized-deallocation) [-Werror=c++14-compat]
    void operator delete [](void* p, size_t size) throw() {
         ^
   cc1plus: error: unrecognized command line option '-Wno-cast-function-type' [-Werror]
   cc1plus: error: unrecognized command line option '-Wno-misleading-indentation' [-Werror]
   cc1plus: error: unrecognized command line option '-Wno-implicit-fallthrough' [-Werror]
   cc1plus: error: unrecognized command line option '-Wno-int-in-bool-context' [-Werror]
   cc1plus: all warnings being treated as errors
   lib/CompileJvm.gmk:176: recipe for target '/home/buildslave/workspace/jdkX-ci-build/build/hotspot/variant-server/libjvm/objs/operator_new.o' failed
   make[3]: *** [/home/buildslave/workspace/jdkX-ci-build/build/hotspot/variant-server/libjvm/objs/operator_new.o] Error 1
   make[3]: *** Waiting for unfinished jobs....
   make/Main.gmk:268: recipe for target 'hotspot-server-libs' failed
   make[2]: *** [hotspot-server-libs] Error 1
   make[2]: *** Waiting for unfinished jobs....
   Compiling 218 files for BUILD_jdk.javadoc.interim
   
   ERROR: Build failed for target 'images' in configuration '/home/buildslave/workspace/jdkX-ci-build/build' (exit code 2) 
   
   === Output from failing command(s) repeated here ===
   * For target hotspot_variant-server_libjvm_objs_operator_new.o:
   /home/buildslave/workspace/jdkX-ci-build/jdkX/src/hotspot/share/memory/operator_new.cpp:92:6: error: 'void operator delete(void*, size_t)' is a usual (non-placement) deallocation function in C++14 (or with -fsized-deallocation) [-Werror=c++14-compat]
    void operator delete(void* p, size_t size) throw() {
         ^
   /home/buildslave/workspace/jdkX-ci-build/jdkX/src/hotspot/share/memory/operator_new.cpp:96:6: error: 'void operator delete [](void*, size_t)' is a usual (non-placement) deallocation function in C++14 (or with -fsized-deallocation) [-Werror=c++14-compat]
    void operator delete [](void* p, size_t size) throw() {
         ^
   cc1plus: error: unrecognized command line option '-Wno-cast-function-type' [-Werror]
   cc1plus: error: unrecognized command line option '-Wno-misleading-indentation' [-Werror]
   cc1plus: error: unrecognized command line option '-Wno-implicit-fallthrough' [-Werror]
   cc1plus: error: unrecognized command line option '-Wno-int-in-bool-context' [-Werror]
   cc1plus: all warnings being treated as errors
   
   * All command lines available in /home/buildslave/workspace/jdkX-ci-build/build/make-support/failure-logs.
   === End of repeated output ===
   
   === Make failed targets repeated here ===
   lib/CompileJvm.gmk:176: recipe for target '/home/buildslave/workspace/jdkX-ci-build/build/hotspot/variant-server/libjvm/objs/operator_new.o' failed
   make/Main.gmk:268: recipe for target 'hotspot-server-libs' failed
   === End of repeated output ===
   
   Hint: Try searching the build log for the name of the first failed target.
   Hint: See doc/building.html#troubleshooting for assistance.
   
   /home/buildslave/workspace/jdkX-ci-build/jdkX/make/Init.gmk:307: recipe for target 'main' failed
   make[1]: *** [main] Error 1
   /home/buildslave/workspace/jdkX-ci-build/jdkX/make/Init.gmk:186: recipe for target 'images' failed
   make: *** [images] Error 2

From ci_notify at linaro.org  Sat Nov  2 02:42:18 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Sat, 2 Nov 2019 02:42:18 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK JDK on AArch64
Message-ID: <2110647221.11377.1572662539493.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/summary/2019/305/summary.html
 
-------------------------------------------------------------------------------
client-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,780; fail: 19; not run: 90

-------------------------------------------------------------------------------
client-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,495; fail: 670; error: 23

-------------------------------------------------------------------------------
client-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/sep/16 pass: 5,726
Build 1: aarch64/2019/sep/18 pass: 5,727
Build 2: aarch64/2019/sep/20 pass: 5,728
Build 3: aarch64/2019/sep/23 pass: 5,727
Build 4: aarch64/2019/oct/07 pass: 5,750
Build 5: aarch64/2019/oct/09 pass: 5,747; fail: 1
Build 6: aarch64/2019/oct/11 pass: 5,751; fail: 1
Build 7: aarch64/2019/oct/14 pass: 5,753
Build 8: aarch64/2019/oct/16 pass: 5,753; fail: 1
Build 9: aarch64/2019/oct/18 pass: 5,760
Build 10: aarch64/2019/oct/21 pass: 5,716; fail: 43; error: 1
Build 11: aarch64/2019/oct/23 pass: 5,760; fail: 1
Build 12: aarch64/2019/oct/28 pass: 5,766
Build 13: aarch64/2019/oct/30 pass: 5,768
Build 14: aarch64/2019/nov/01 pass: 5,768; fail: 1

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/sep/16 pass: 8,687; fail: 501; error: 21
Build 1: aarch64/2019/sep/18 pass: 8,675; fail: 517; error: 18
Build 2: aarch64/2019/sep/20 pass: 8,685; fail: 503; error: 22
Build 3: aarch64/2019/sep/23 pass: 8,696; fail: 497; error: 19
Build 4: aarch64/2019/oct/07 pass: 8,683; fail: 517; error: 18
Build 5: aarch64/2019/oct/09 pass: 8,692; fail: 507; error: 21
Build 6: aarch64/2019/oct/11 pass: 8,693; fail: 511; error: 18
Build 7: aarch64/2019/oct/14 pass: 8,706; fail: 497; error: 20
Build 8: aarch64/2019/oct/16 pass: 8,702; fail: 509; error: 17
Build 9: aarch64/2019/oct/18 pass: 8,694; fail: 522; error: 17
Build 10: aarch64/2019/oct/21 pass: 8,705; fail: 512; error: 18
Build 11: aarch64/2019/oct/23 pass: 8,712; fail: 505; error: 18
Build 12: aarch64/2019/oct/28 pass: 8,711; fail: 509; error: 18
Build 13: aarch64/2019/oct/30 pass: 8,723; fail: 504; error: 19
Build 14: aarch64/2019/nov/01 pass: 8,774; fail: 506; error: 18

2 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/sep/16 pass: 3,978
Build 1: aarch64/2019/sep/18 pass: 3,978
Build 2: aarch64/2019/sep/20 pass: 3,979
Build 3: aarch64/2019/sep/23 pass: 3,979
Build 4: aarch64/2019/oct/07 pass: 3,979
Build 5: aarch64/2019/oct/09 pass: 3,979
Build 6: aarch64/2019/oct/11 pass: 3,979
Build 7: aarch64/2019/oct/14 pass: 3,979
Build 8: aarch64/2019/oct/16 pass: 3,979
Build 9: aarch64/2019/oct/18 pass: 3,979
Build 10: aarch64/2019/oct/21 pass: 3,979
Build 11: aarch64/2019/oct/23 pass: 3,980
Build 12: aarch64/2019/oct/28 pass: 3,980
Build 13: aarch64/2019/oct/30 pass: 3,980
Build 14: aarch64/2019/nov/01 pass: 3,980

-------------------------------------------------------------------------------
server-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,787; fail: 18; not run: 90

-------------------------------------------------------------------------------
server-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,476; fail: 686; error: 27

-------------------------------------------------------------------------------
server-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 8.09x
Relative performance: Server critical-jOPS (nc): 10.05x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 210.67

Server 210.67 / Server 2014-04-01 (71.00): 2.97x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-09-17 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/259/results/
2019-09-19 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/261/results/
2019-09-21 pass rate: 10487/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/263/results/
2019-09-29 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/266/results/
2019-10-08 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/280/results/
2019-10-10 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/282/results/
2019-10-12 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/284/results/
2019-10-15 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/287/results/
2019-10-17 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/289/results/
2019-10-19 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/291/results/
2019-10-22 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/294/results/
2019-10-23 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/296/results/
2019-10-29 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/301/results/
2019-10-31 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/303/results/
2019-11-02 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/305/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/

From ci_notify at linaro.org  Sat Nov  2 13:36:40 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Sat, 2 Nov 2019 13:36:40 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK 8u on AArch64
Message-ID: <284063212.11475.1572701801018.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk8u/openjdk-jtreg-nightly-tests/summary/2019/306/summary.html
 
-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/19 pass: 814; fail: 20; error: 4
Build 1: aarch64/2019/jul/25 pass: 802; fail: 25; error: 11
Build 2: aarch64/2019/jul/30 pass: 787; fail: 40; error: 11
Build 3: aarch64/2019/aug/01 pass: 800; fail: 26; error: 12
Build 4: aarch64/2019/aug/04 pass: 808; fail: 30; error: 2
Build 5: aarch64/2019/aug/06 pass: 799; fail: 29; error: 12
Build 6: aarch64/2019/aug/08 pass: 830; fail: 9; error: 1
Build 7: aarch64/2019/aug/11 pass: 825; fail: 14; error: 1
Build 8: aarch64/2019/aug/13 pass: 830; fail: 9; error: 1
Build 9: aarch64/2019/aug/15 pass: 837; fail: 9; error: 1
Build 10: aarch64/2019/aug/17 pass: 837; fail: 9; error: 1
Build 11: aarch64/2019/aug/22 pass: 837; fail: 9; error: 1
Build 12: aarch64/2019/sep/10 pass: 838; fail: 13; error: 1
Build 13: aarch64/2019/sep/21 pass: 838; fail: 13; error: 1
Build 14: aarch64/2019/nov/02 pass: 843; fail: 9; error: 1

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/19 pass: 5,940; fail: 278; error: 22
Build 1: aarch64/2019/jul/25 pass: 5,938; fail: 276; error: 26
Build 2: aarch64/2019/jul/30 pass: 5,942; fail: 273; error: 25
Build 3: aarch64/2019/aug/01 pass: 5,945; fail: 271; error: 24
Build 4: aarch64/2019/aug/04 pass: 5,949; fail: 270; error: 24
Build 5: aarch64/2019/aug/06 pass: 5,945; fail: 275; error: 23
Build 6: aarch64/2019/aug/08 pass: 5,953; fail: 267; error: 23
Build 7: aarch64/2019/aug/11 pass: 5,947; fail: 272; error: 25
Build 8: aarch64/2019/aug/13 pass: 5,962; fail: 258; error: 24
Build 9: aarch64/2019/aug/15 pass: 5,955; fail: 266; error: 23
Build 10: aarch64/2019/aug/17 pass: 5,951; fail: 269; error: 24
Build 11: aarch64/2019/aug/22 pass: 5,945; fail: 279; error: 20
Build 12: aarch64/2019/sep/10 pass: 5,951; fail: 273; error: 23
Build 13: aarch64/2019/sep/21 pass: 5,964; fail: 261; error: 22
Build 14: aarch64/2019/nov/02 pass: 5,956; fail: 278; error: 18

1 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/19 pass: 3,116; fail: 2
Build 1: aarch64/2019/jul/25 pass: 3,116; fail: 2
Build 2: aarch64/2019/jul/30 pass: 3,116; fail: 2
Build 3: aarch64/2019/aug/01 pass: 3,116; fail: 2
Build 4: aarch64/2019/aug/04 pass: 3,116; fail: 2
Build 5: aarch64/2019/aug/06 pass: 3,116; fail: 2
Build 6: aarch64/2019/aug/08 pass: 3,116; fail: 2
Build 7: aarch64/2019/aug/11 pass: 3,116; fail: 2
Build 8: aarch64/2019/aug/13 pass: 3,116; fail: 2
Build 9: aarch64/2019/aug/15 pass: 3,116; fail: 2
Build 10: aarch64/2019/aug/17 pass: 3,116; fail: 2
Build 11: aarch64/2019/aug/22 pass: 3,116; fail: 2
Build 12: aarch64/2019/sep/10 pass: 3,116; fail: 2
Build 13: aarch64/2019/sep/21 pass: 3,116; fail: 2
Build 14: aarch64/2019/nov/02 pass: 3,116; fail: 2

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdk8u/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 6.89x
Relative performance: Server critical-jOPS (nc): 8.59x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk8u/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 178.67

Server 178.67 / Server 2014-04-01 (71.00): 2.52x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk8u/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-07-20 pass rate: 8229/8229, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/200/results/
2019-07-26 pass rate: 8229/8229, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/206/results/
2019-07-31 pass rate: 8229/8229, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/211/results/
2019-08-02 pass rate: 8229/8229, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/213/results/
2019-08-05 pass rate: 8229/8229, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/216/results/
2019-08-07 pass rate: 8229/8229, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/218/results/
2019-08-09 pass rate: 8229/8229, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/220/results/
2019-08-12 pass rate: 8230/8230, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/223/results/
2019-08-13 pass rate: 8230/8230, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/225/results/
2019-08-16 pass rate: 8230/8230, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/227/results/
2019-08-17 pass rate: 8230/8230, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/229/results/
2019-08-23 pass rate: 8230/8230, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/234/results/
2019-09-11 pass rate: 8230/8230, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/253/results/
2019-09-22 pass rate: 8230/8230, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/264/results/
2019-11-02 pass rate: 8230/8230, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/306/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/

From zgu at redhat.com  Sat Nov  2 15:07:31 2019
From: zgu at redhat.com (Zhengyu Gu)
Date: Sat, 2 Nov 2019 11:07:31 -0400
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
Message-ID: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>

Please review this refactor of Shenandoah load barrier. The goal is to 
make the barrier structurally similar cross interpreter, C1 and C2, 
improve readability and maintainability.

Bug: https://bugs.openjdk.java.net/browse/JDK-8233401
Webrev: http://cr.openjdk.java.net/~zgu/JDK-8233401/webrev.00/index.html

Test:
   hotspot_gc_shenandoah (fastdebug and release)
   x86_64 and x86_32 on Linux
   AArch64 on Linux

Thanks,

-Zhengyu


From shade at redhat.com  Mon Nov  4 09:10:29 2019
From: shade at redhat.com (Aleksey Shipilev)
Date: Mon, 4 Nov 2019 10:10:29 +0100
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
Message-ID: <ea93151c-7a73-b5df-4e43-a21e9188c938@redhat.com>

On 11/2/19 4:07 PM, Zhengyu Gu wrote:
> Please review this refactor of Shenandoah load barrier. The goal is to make the barrier structurally
> similar cross interpreter, C1 and C2, improve readability and maintainability.
> 
> Bug: https://bugs.openjdk.java.net/browse/JDK-8233401
> Webrev: http://cr.openjdk.java.net/~zgu/JDK-8233401/webrev.00/index.html

This is cute patch.

*) Typo "non-reference load":

 207   // 1: none-reference load, no additional barrier is needed

*) The comment style is inconsistent with other places:

 537 Node* ShenandoahBarrierSetC2::load_at_resolved(C2Access& access, const Type* val_type) const {
 538   // 1: load reference
 539   Node* load = BarrierSetC2::load_at_resolved(access, val_type);
 540   // For none-reference load, no additional barrier is needed

*) In constructions like this, it seems more consistent to introduce the local variable for matching
the decorator?

 387     // Native barrier is for concurrent root processing
 388     if (((decorators & IN_NATIVE) != 0) &&
 389         ShenandoahConcurrentRoots::can_do_concurrent_roots()) {

Otherwise looks good. Roman needs to take a look as well.

-- 
Thanks,
-Aleksey


From aph at redhat.com  Mon Nov  4 09:44:34 2019
From: aph at redhat.com (Andrew Haley)
Date: Mon, 4 Nov 2019 09:44:34 +0000
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
Message-ID: <ae3917ce-39f6-d77b-37b6-081e272a7ef7@redhat.com>

On 11/2/19 3:07 PM, Zhengyu Gu wrote:
> Please review this refactor of Shenandoah load barrier. The goal is to 
> make the barrier structurally similar cross interpreter, C1 and C2, 
> improve readability and maintainability.
> 
> Bug: https://bugs.openjdk.java.net/browse/JDK-8233401
> Webrev: http://cr.openjdk.java.net/~zgu/JDK-8233401/webrev.00/index.html
> 
> Test:
>    hotspot_gc_shenandoah (fastdebug and release)
>    x86_64 and x86_32 on Linux
>    AArch64 on Linux

Thanks, this is an improvement. However, it's still weird.

//
// Arguments:
//
// Inputs:
//   src:        oop location to load from, might be clobbered
//   tmp1:       unused
//   tmp_thread: unused
//
// Output:
//   dst:        oop loaded from src location
//
// Kill:
//   rscratch1 (scratch reg)
//
// Alias:
//   dst: rscratch1 (might use rscratch1 as temporary output register to avoid clobbering src)
//
void ShenandoahBarrierSetAssembler::load_at(MacroAssembler* masm, DecoratorSet decorators, BasicType type,
                                            Register dst, Address src, Register tmp1, Register tmp_thread) {

tmp1 and tmp_thread are unused? It'd be a good idea, then, to say if they are
safe to use or not. Or maybe even better do this if you want to keep the same
arg list:

void ShenandoahBarrierSetAssembler::load_at(MacroAssembler* masm, DecoratorSet decorators, BasicType type,
                                            Register dst, Address src, Register, Register) {

I guess it really isn't safe to use "tmp1" as a tmp, regardless of its name.

If so, better pass it as noreg/

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From felix.yang at huawei.com  Mon Nov  4 11:42:01 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Mon, 4 Nov 2019 11:42:01 +0000
Subject: [aarch64-port-dev ] RFR (trivial) : fix aarch64-8u type profile
 bug
In-Reply-To: <8ad0aa09-bfb1-c891-e17a-be7d14b3a2ae@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED5FE7C9A@dggeml527-mbx.china.huawei.com>
 <222f9c0b-7320-8d22-cd44-c4f3af7c1311@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED5FE7E9C@dggeml527-mbx.china.huawei.com>
 <880f5072-91ba-66bd-94be-429556e7c132@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED5FF3BEC@dggeml527-mbx.china.huawei.com>
 <8ad0aa09-bfb1-c891-e17a-be7d14b3a2ae@redhat.com>
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED602769E@dggeml527-mbx.china.huawei.com>

> -----Original Message-----
> From: Andrew Haley [mailto:aph at redhat.com]
> Sent: Thursday, October 17, 2019 9:06 PM
> To: Yangfei (Felix) <felix.yang at huawei.com>;
> aarch64-port-dev at openjdk.java.net
> Cc: hotspot-runtime-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR (trivial) : fix aarch64-8u type profile bug
> 
> On 9/26/19 2:59 AM, Yangfei (Felix) wrote:
> > CCing to hotspot-runtime-dev list.
> >
> > This has passed hotspot jtreg test on aarch64-linux.  Is it OK to go?
> 
> I'll have a look.

Hi,

    I opened a new bug for this: https://bugs.openjdk.java.net/browse/JDK-8233466
    Webrev: http://cr.openjdk.java.net/~fyang/8233466/webrev.00/
    Passed tier1-3 testing.  Is it OK to go?

Thanks,
Felix

From zgu at redhat.com  Mon Nov  4 14:08:42 2019
From: zgu at redhat.com (Zhengyu Gu)
Date: Mon, 4 Nov 2019 09:08:42 -0500
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <ae3917ce-39f6-d77b-37b6-081e272a7ef7@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
 <ae3917ce-39f6-d77b-37b6-081e272a7ef7@redhat.com>
Message-ID: <fef3a5cf-5d57-9c0b-50a8-09fba7d4e25e@redhat.com>

Hi Andrew,

Thanks for the review.

> void ShenandoahBarrierSetAssembler::load_at(MacroAssembler* masm, DecoratorSet decorators, BasicType type,
>                                              Register dst, Address src, Register tmp1, Register tmp_thread) {
> 
> tmp1 and tmp_thread are unused? It'd be a good idea, then, to say if they are
> safe to use or not. Or maybe even better do this if you want to keep the same
> arg list:
> 
> void ShenandoahBarrierSetAssembler::load_at(MacroAssembler* masm, DecoratorSet decorators, BasicType type,
>                                              Register dst, Address src, Register, Register) {
> 

This is an overrode method. What you get for tmp1 and tmp_thread, is 
really platform dependent.

On AArch64, you usually get noreg for tmp1 and tmp_thread. I can not 
tell if you can safely use tmp1 if it is valid.

I don't use tmp1 here, since I don't think it is worth the trouble, as 
we have spare scratch registers. I do use tmp1 in x86 through.

What do you suggest the comment should be?

Thanks,

-Zhengyu

> I guess it really isn't safe to use "tmp1" as a tmp, regardless of its name.
> 
> If so, better pass it as noreg/
> 

From aph at redhat.com  Mon Nov  4 14:32:36 2019
From: aph at redhat.com (Andrew Haley)
Date: Mon, 4 Nov 2019 14:32:36 +0000
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <fef3a5cf-5d57-9c0b-50a8-09fba7d4e25e@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
 <ae3917ce-39f6-d77b-37b6-081e272a7ef7@redhat.com>
 <fef3a5cf-5d57-9c0b-50a8-09fba7d4e25e@redhat.com>
Message-ID: <6ff66df6-cba8-e2a3-30ba-0ba5656e15fb@redhat.com>

On 11/4/19 2:08 PM, Zhengyu Gu wrote:
> 
>> void ShenandoahBarrierSetAssembler::load_at(MacroAssembler* masm, DecoratorSet decorators, BasicType type,
>>                                              Register dst, Address src, Register tmp1, Register tmp_thread) {
>>
>> tmp1 and tmp_thread are unused? It'd be a good idea, then, to say if they are
>> safe to use or not. Or maybe even better do this if you want to keep the same
>> arg list:
>>
>> void ShenandoahBarrierSetAssembler::load_at(MacroAssembler* masm, DecoratorSet decorators, BasicType type,
>>                                              Register dst, Address src, Register, Register) {
>>
> 
> This is an overrode method. What you get for tmp1 and tmp_thread, is 
> really platform dependent.
> 
> On AArch64, you usually get noreg for tmp1 and tmp_thread. I can not 
> tell if you can safely use tmp1 if it is valid.
> 
> I don't use tmp1 here, since I don't think it is worth the trouble, as 
> we have spare scratch registers. I do use tmp1 in x86 through.

OK, so please just do this for now:

>> void ShenandoahBarrierSetAssembler::load_at(MacroAssembler* masm, DecoratorSet decorators, BasicType type,
>>                                              Register dst, Address src, Register, Register) {

I'm working on a redesign of the way that scratch registers are used in
AArch64, and this code is likely to have to be changed. Accurate information
about register usage is likely to be crucial for that.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From thomas.stuefe at gmail.com  Mon Nov  4 15:21:32 2019
From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=)
Date: Mon, 4 Nov 2019 16:21:32 +0100
Subject: [aarch64-port-dev ] Fwd: RFR(xs): 8233019:
 java.lang.Class.isPrimitive() (C1) returns wrong result if Klass* is
 aligned to 32bit
In-Reply-To: <CAA-vtUzdW000k20ZUEH03PbCA5Zt2S6MUGupt_F842fL36Uwvg@mail.gmail.com>
References: <CAA-vtUzdW000k20ZUEH03PbCA5Zt2S6MUGupt_F842fL36Uwvg@mail.gmail.com>
Message-ID: <CAA-vtUw43yQMgyTudNKyMx9YGwtcEt_JqqBOkJZR4yT+qBJjSw@mail.gmail.com>

Hi,

could some aarch64 people please take a quick look at this small patch?

The aarch64 part is really tiny, but I have no possibility to test this.

Last webrev:
http://cr.openjdk.java.net/~stuefe/webrevs/8233019--c1-intrinsic-for-java.lang.class.isprimitive()-does-32bit-compare/webrev.01/webrev/
Issue: https://bugs.openjdk.java.net/browse/JDK-8233019

Thank you,

Thomas

---------- Forwarded message ---------
From: Thomas St?fe <thomas.stuefe at gmail.com>
Date: Wed, Oct 30, 2019 at 11:47 AM
Subject: RFR(xs): 8233019: java.lang.Class.isPrimitive() (C1) returns wrong
result if Klass* is aligned to 32bit
To: hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
Cc: Doerr, Martin <martin.doerr at sap.com>, Schmidt, Lutz <
lutz.schmidt at sap.com>


Hi all,

second attempt at a fix (please find first review thread here:
https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-October/035608.html
)

Issue: https://bugs.openjdk.java.net/browse/JDK-8233019
webrev:
http://cr.openjdk.java.net/~stuefe/webrevs/8233019--c1-intrinsic-for-java.lang.class.isprimitive()-does-32bit-compare/webrev.00/webrev/

In short, C1 intrinsic for jlC::isPrimitive does a compare with the Klass*
pointer for the class to find out if its NULL and hence a primitive type.
That compare is done using 32bit cmp and so it gives wrong results when the
Klass* pointer is aligned to 32bit.

In the generator I changed the comparison constant type from intConst(0) to
metadataConst(0) and implemented the missing code paths for all CPUs. Since
on most architectures we do not seem to have a comparison with a 64bit
immediate (at least I could not find one) I kept the change simple and only
implemented comparison with NULL for now.

I tested the fix in our nightlies (jtreg tier1, jck and others) as well as
manually testing it.

I did not test on aarch64 and arm though and would be thankful if someone
knowledgeable to these platforms could take a look.

Thanks to Martin and Lutz for eyeballing the ppc and s390 parts.

Thanks, Thomas

From rkennke at redhat.com  Mon Nov  4 15:35:52 2019
From: rkennke at redhat.com (Roman Kennke)
Date: Mon, 4 Nov 2019 16:35:52 +0100
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <ea93151c-7a73-b5df-4e43-a21e9188c938@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
 <ea93151c-7a73-b5df-4e43-a21e9188c938@redhat.com>
Message-ID: <c60d30b0-e920-28f3-1077-d182662da48d@redhat.com>

>> Please review this refactor of Shenandoah load barrier. The goal is to make the barrier structurally
>> similar cross interpreter, C1 and C2, improve readability and maintainability.
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8233401
>> Webrev: http://cr.openjdk.java.net/~zgu/JDK-8233401/webrev.00/index.html
> 
> This is cute patch.
> 
> *) Typo "non-reference load":
> 
>  207   // 1: none-reference load, no additional barrier is needed
> 
> *) The comment style is inconsistent with other places:
> 
>  537 Node* ShenandoahBarrierSetC2::load_at_resolved(C2Access& access, const Type* val_type) const {
>  538   // 1: load reference
>  539   Node* load = BarrierSetC2::load_at_resolved(access, val_type);
>  540   // For none-reference load, no additional barrier is needed
> 
> *) In constructions like this, it seems more consistent to introduce the local variable for matching
> the decorator?
> 
>  387     // Native barrier is for concurrent root processing
>  388     if (((decorators & IN_NATIVE) != 0) &&
>  389         ShenandoahConcurrentRoots::can_do_concurrent_roots()) {
> 
> Otherwise looks good. Roman needs to take a look as well.

Yes, otherwise looks good.

Thanks,
Roman


From rkennke at redhat.com  Mon Nov  4 15:54:10 2019
From: rkennke at redhat.com (Roman Kennke)
Date: Mon, 4 Nov 2019 16:54:10 +0100
Subject: [aarch64-port-dev ] [8u] 8231366: Shenandoah: Shenandoah String
 Dedup thread is not properly initialized
In-Reply-To: <d45f273f-b9bf-f8dc-375b-f996807f69a6@redhat.com>
References: <d45f273f-b9bf-f8dc-375b-f996807f69a6@redhat.com>
Message-ID: <d429d7cc-86ab-0ab6-b158-22440f723a51@redhat.com>

Looks good to me.

Thanks,
Roman


> This bug seems to exist since day one of 8u backport. The
> ConcurrentGCThread API is different in 8u and we leave
> ShenandoahDedupThread not properly initialized before it enters work loop.
> 
> In Shenandoah String Deduplication tests, the bug results assertion
> failure that shows Thread::current() == NULL.
> 
> The bug only manifests on Windows, is due to discrepancy of java_start()
> implementation on different OSs. e.g. it sets *thread* on Linux.
> 
> Bug: https://bugs.openjdk.java.net/browse/JDK-8231366
> Webrev: http://cr.openjdk.java.net/~zgu/JDK-8231366/webrev.00/
> 
> Test:
> ? hotspot_gc_shenandoah on Windows and Linux.
> 
> Thanks,
> 
> -Zhengyu
> 


From aph at redhat.com  Mon Nov  4 17:04:22 2019
From: aph at redhat.com (Andrew Haley)
Date: Mon, 4 Nov 2019 17:04:22 +0000
Subject: [aarch64-port-dev ] Fwd: RFR(xs): 8233019:
 java.lang.Class.isPrimitive() (C1) returns wrong result if Klass* is
 aligned to 32bit
In-Reply-To: <CAA-vtUw43yQMgyTudNKyMx9YGwtcEt_JqqBOkJZR4yT+qBJjSw@mail.gmail.com>
References: <CAA-vtUzdW000k20ZUEH03PbCA5Zt2S6MUGupt_F842fL36Uwvg@mail.gmail.com>
 <CAA-vtUw43yQMgyTudNKyMx9YGwtcEt_JqqBOkJZR4yT+qBJjSw@mail.gmail.com>
Message-ID: <8850def2-0d1f-6aa1-17ac-1b9dd4d50a34@redhat.com>

On 11/4/19 3:21 PM, Thomas St?fe wrote:
> Hi,
> 
> could some aarch64 people please take a quick look at this small patch?
> 
> The aarch64 part is really tiny, but I have no possibility to test this.
> 
> Last webrev:
> http://cr.openjdk.java.net/~stuefe/webrevs/8233019--c1-intrinsic-for-java.lang.class.isprimitive()-does-32bit-compare/webrev.01/webrev/
> Issue: https://bugs.openjdk.java.net/browse/JDK-8233019

Looking.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Mon Nov  4 17:22:03 2019
From: aph at redhat.com (Andrew Haley)
Date: Mon, 4 Nov 2019 17:22:03 +0000
Subject: [aarch64-port-dev ] Fwd: RFR(xs): 8233019:
 java.lang.Class.isPrimitive() (C1) returns wrong result if Klass* is
 aligned to 32bit
In-Reply-To: <CAA-vtUw43yQMgyTudNKyMx9YGwtcEt_JqqBOkJZR4yT+qBJjSw@mail.gmail.com>
References: <CAA-vtUzdW000k20ZUEH03PbCA5Zt2S6MUGupt_F842fL36Uwvg@mail.gmail.com>
 <CAA-vtUw43yQMgyTudNKyMx9YGwtcEt_JqqBOkJZR4yT+qBJjSw@mail.gmail.com>
Message-ID: <bb704c28-528a-316a-f465-055114f2b97f@redhat.com>

On 11/4/19 3:21 PM, Thomas St?fe wrote:
> could some aarch64 people please take a quick look at this small patch?
> 
> The aarch64 part is really tiny, but I have no possibility to test this.
> 
> Last webrev:
> http://cr.openjdk.java.net/~stuefe/webrevs/8233019--c1-intrinsic-for-java.lang.class.isprimitive()-does-32bit-compare/webrev.01/webrev/
> Issue: https://bugs.openjdk.java.net/browse/JDK-8233019

Seems fine.

Before:

 ;;  block B0 [0, 4]
  0x0000ffffa1d93f54:   ldr	x0, [x1, #80]               ; implicit exception: dispatches to 0x0000ffffa1d93f78
  0x0000ffffa1d93f58:   cmp	w0, #0x0
  0x0000ffffa1d93f5c:   cset	x0, eq  // eq = none        ;*invokevirtual isPrimitive {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - IsPrimitiveTest::isPrimitive at 1 (line 4)

After:

 ;;  block B0 [0, 4]
  0x0000ffff71dc75d4:   ldr	x0, [x1, #80]               ; implicit exception: dispatches to 0x0000ffff71dc75f8
  0x0000ffff71dc75d8:   cmp	x0, #0x0
  0x0000ffff71dc75dc:   cset	x0, eq  // eq = none        ;*invokevirtual isPrimitive {reexecute=0 rethrow=0 return_oop=0}
                                                           ; - IsPrimitiveTest::isPrimitive at 1 (line 4)

i.e. the first test is "cmp w0, #0x0", the second is "cmp x0, #0x0".
The first is a 32-bit comparison, the second 64-bit.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From zgu at redhat.com  Mon Nov  4 17:32:11 2019
From: zgu at redhat.com (Zhengyu Gu)
Date: Mon, 4 Nov 2019 12:32:11 -0500
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <6ff66df6-cba8-e2a3-30ba-0ba5656e15fb@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
 <ae3917ce-39f6-d77b-37b6-081e272a7ef7@redhat.com>
 <fef3a5cf-5d57-9c0b-50a8-09fba7d4e25e@redhat.com>
 <6ff66df6-cba8-e2a3-30ba-0ba5656e15fb@redhat.com>
Message-ID: <e226840c-001c-db44-f1e1-47438f583bda@redhat.com>


>> On AArch64, you usually get noreg for tmp1 and tmp_thread. I can not
>> tell if you can safely use tmp1 if it is valid.
>>
>> I don't use tmp1 here, since I don't think it is worth the trouble, as
>> we have spare scratch registers. I do use tmp1 in x86 through.
> 
> OK, so please just do this for now:

Thanks!

-Zhengyu

> 
>>> void ShenandoahBarrierSetAssembler::load_at(MacroAssembler* masm, DecoratorSet decorators, BasicType type,
>>>                                               Register dst, Address src, Register, Register) {
> 
> I'm working on a redesign of the way that scratch registers are used in
> AArch64, and this code is likely to have to be changed. Accurate information
> about register usage is likely to be crucial for that.
> 

From zgu at redhat.com  Mon Nov  4 17:33:20 2019
From: zgu at redhat.com (Zhengyu Gu)
Date: Mon, 4 Nov 2019 12:33:20 -0500
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <c60d30b0-e920-28f3-1077-d182662da48d@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
 <ea93151c-7a73-b5df-4e43-a21e9188c938@redhat.com>
 <c60d30b0-e920-28f3-1077-d182662da48d@redhat.com>
Message-ID: <87b115fb-5353-b21b-3cbc-f862bd932b3e@redhat.com>

Updated: http://cr.openjdk.java.net/~zgu/JDK-8233401/webrev.01/index.html

Okay now?

Thanks,

-Zhengyu

On 11/4/19 10:35 AM, Roman Kennke wrote:
>>> Please review this refactor of Shenandoah load barrier. The goal is to make the barrier structurally
>>> similar cross interpreter, C1 and C2, improve readability and maintainability.
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8233401
>>> Webrev: http://cr.openjdk.java.net/~zgu/JDK-8233401/webrev.00/index.html
>>
>> This is cute patch.
>>
>> *) Typo "non-reference load":
>>
>>   207   // 1: none-reference load, no additional barrier is needed
>>
>> *) The comment style is inconsistent with other places:
>>
>>   537 Node* ShenandoahBarrierSetC2::load_at_resolved(C2Access& access, const Type* val_type) const {
>>   538   // 1: load reference
>>   539   Node* load = BarrierSetC2::load_at_resolved(access, val_type);
>>   540   // For none-reference load, no additional barrier is needed
>>
>> *) In constructions like this, it seems more consistent to introduce the local variable for matching
>> the decorator?
>>
>>   387     // Native barrier is for concurrent root processing
>>   388     if (((decorators & IN_NATIVE) != 0) &&
>>   389         ShenandoahConcurrentRoots::can_do_concurrent_roots()) {
>>
>> Otherwise looks good. Roman needs to take a look as well.
> 
> Yes, otherwise looks good.
> 
> Thanks,
> Roman
> 
> 

From aph at redhat.com  Mon Nov  4 17:38:14 2019
From: aph at redhat.com (Andrew Haley)
Date: Mon, 4 Nov 2019 17:38:14 +0000
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <87b115fb-5353-b21b-3cbc-f862bd932b3e@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
 <ea93151c-7a73-b5df-4e43-a21e9188c938@redhat.com>
 <c60d30b0-e920-28f3-1077-d182662da48d@redhat.com>
 <87b115fb-5353-b21b-3cbc-f862bd932b3e@redhat.com>
Message-ID: <3d70db1c-c927-48f8-23ab-8937838e0302@redhat.com>

On 11/4/19 5:33 PM, Zhengyu Gu wrote:
> Updated: http://cr.openjdk.java.net/~zgu/JDK-8233401/webrev.01/index.html
> 
> Okay now?
AArch64 still says

 void ShenandoahBarrierSetAssembler::load_at(MacroAssembler* masm, DecoratorSet decorators, BasicType type,
                                             Register dst, Address src, Register tmp1, Register tmp_thread) {

instead of

 void ShenandoahBarrierSetAssembler::load_at(MacroAssembler* masm, DecoratorSet decorators, BasicType type,
                                             Register dst, Address src, Register, Register) {

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From zgu at redhat.com  Mon Nov  4 18:18:38 2019
From: zgu at redhat.com (Zhengyu Gu)
Date: Mon, 4 Nov 2019 13:18:38 -0500
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <3d70db1c-c927-48f8-23ab-8937838e0302@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
 <ea93151c-7a73-b5df-4e43-a21e9188c938@redhat.com>
 <c60d30b0-e920-28f3-1077-d182662da48d@redhat.com>
 <87b115fb-5353-b21b-3cbc-f862bd932b3e@redhat.com>
 <3d70db1c-c927-48f8-23ab-8937838e0302@redhat.com>
Message-ID: <0d347d16-f870-798f-0165-1ee4dfae511b@redhat.com>


On 11/4/19 12:38 PM, Andrew Haley wrote:
> On 11/4/19 5:33 PM, Zhengyu Gu wrote:
>> Updated: http://cr.openjdk.java.net/~zgu/JDK-8233401/webrev.01/index.html
>>
>> Okay now?
> AArch64 still says
> 
>   void ShenandoahBarrierSetAssembler::load_at(MacroAssembler* masm, DecoratorSet decorators, BasicType type,
>                                               Register dst, Address src, Register tmp1, Register tmp_thread) {
> 
> instead of
> 
>   void ShenandoahBarrierSetAssembler::load_at(MacroAssembler* masm, DecoratorSet decorators, BasicType type,
>                                               Register dst, Address src, Register, Register) {

They are still needed for calling super class's load_at(). Even though, 
they are not used there neither.

   // 1: non-reference load, no additional barrier is needed
   if (!is_reference_type(type) ) {
     BarrierSetAssembler::load_at(masm, decorators, type, dst, src, 
tmp1, tmp_thread);
     return;
   }


-Zhengyu

> 

From zgu at redhat.com  Mon Nov  4 18:23:12 2019
From: zgu at redhat.com (Zhengyu Gu)
Date: Mon, 4 Nov 2019 13:23:12 -0500
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <0d347d16-f870-798f-0165-1ee4dfae511b@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
 <ea93151c-7a73-b5df-4e43-a21e9188c938@redhat.com>
 <c60d30b0-e920-28f3-1077-d182662da48d@redhat.com>
 <87b115fb-5353-b21b-3cbc-f862bd932b3e@redhat.com>
 <3d70db1c-c927-48f8-23ab-8937838e0302@redhat.com>
 <0d347d16-f870-798f-0165-1ee4dfae511b@redhat.com>
Message-ID: <859e48d6-9af5-b4af-32ac-4b07ce92e94d@redhat.com>


On 11/4/19 1:18 PM, Zhengyu Gu wrote:
> 
> 
> On 11/4/19 12:38 PM, Andrew Haley wrote:
>> On 11/4/19 5:33 PM, Zhengyu Gu wrote:
>>> Updated: 
>>> http://cr.openjdk.java.net/~zgu/JDK-8233401/webrev.01/index.html
>>>
>>> Okay now?
>> AArch64 still says
>>
>> ? void ShenandoahBarrierSetAssembler::load_at(MacroAssembler* masm, 
>> DecoratorSet decorators, BasicType type,
>> ????????????????????????????????????????????? Register dst, Address 
>> src, Register tmp1, Register tmp_thread) {
>>
>> instead of
>>
>> ? void ShenandoahBarrierSetAssembler::load_at(MacroAssembler* masm, 
>> DecoratorSet decorators, BasicType type,
>> ????????????????????????????????????????????? Register dst, Address 
>> src, Register, Register) {
> 
> They are still needed for calling super class's load_at(). Even though, 
> they are not used there neither.
Or I should say, they are not used there right now, but may be used in 
future ...

-Zhengyu

> 
>  ? // 1: non-reference load, no additional barrier is needed
>  ? if (!is_reference_type(type) ) {
>  ??? BarrierSetAssembler::load_at(masm, decorators, type, dst, src, 
> tmp1, tmp_thread);
>  ??? return;
>  ? }
> 
> 
> -Zhengyu
> 
>>

From thomas.stuefe at gmail.com  Mon Nov  4 18:24:22 2019
From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=)
Date: Mon, 4 Nov 2019 19:24:22 +0100
Subject: [aarch64-port-dev ] Fwd: RFR(xs): 8233019:
 java.lang.Class.isPrimitive() (C1) returns wrong result if Klass* is
 aligned to 32bit
In-Reply-To: <bb704c28-528a-316a-f465-055114f2b97f@redhat.com>
References: <CAA-vtUzdW000k20ZUEH03PbCA5Zt2S6MUGupt_F842fL36Uwvg@mail.gmail.com>
 <CAA-vtUw43yQMgyTudNKyMx9YGwtcEt_JqqBOkJZR4yT+qBJjSw@mail.gmail.com>
 <bb704c28-528a-316a-f465-055114f2b97f@redhat.com>
Message-ID: <CAA-vtUztFqq3HgSrczeSp_1OxqZE1bBHZTcG_z+za4eBRPqS_Q@mail.gmail.com>

Thanks Andrew!

On Mon, Nov 4, 2019, 18:22 Andrew Haley <aph at redhat.com> wrote:

> On 11/4/19 3:21 PM, Thomas St?fe wrote:
> > could some aarch64 people please take a quick look at this small patch?
> >
> > The aarch64 part is really tiny, but I have no possibility to test this.
> >
> > Last webrev:
> >
> http://cr.openjdk.java.net/~stuefe/webrevs/8233019--c1-intrinsic-for-java.lang.class.isprimitive()-does-32bit-compare/webrev.01/webrev/
> > Issue: https://bugs.openjdk.java.net/browse/JDK-8233019
>
> Seems fine.
>
> Before:
>
>  ;;  block B0 [0, 4]
>   0x0000ffffa1d93f54:   ldr     x0, [x1, #80]               ; implicit
> exception: dispatches to 0x0000ffffa1d93f78
>   0x0000ffffa1d93f58:   cmp     w0, #0x0
>   0x0000ffffa1d93f5c:   cset    x0, eq  // eq = none
> ;*invokevirtual isPrimitive {reexecute=0 rethrow=0 return_oop=0}
>                                                             ; -
> IsPrimitiveTest::isPrimitive at 1 (line 4)
>
> After:
>
>  ;;  block B0 [0, 4]
>   0x0000ffff71dc75d4:   ldr     x0, [x1, #80]               ; implicit
> exception: dispatches to 0x0000ffff71dc75f8
>   0x0000ffff71dc75d8:   cmp     x0, #0x0
>   0x0000ffff71dc75dc:   cset    x0, eq  // eq = none
> ;*invokevirtual isPrimitive {reexecute=0 rethrow=0 return_oop=0}
>                                                            ; -
> IsPrimitiveTest::isPrimitive at 1 (line 4)
>
> i.e. the first test is "cmp w0, #0x0", the second is "cmp x0, #0x0".
> The first is a 32-bit comparison, the second 64-bit.
>
> --
> Andrew Haley  (he/him)
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> https://keybase.io/andrewhaley
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>
>

From patrick at os.amperecomputing.com  Tue Nov  5 01:39:29 2019
From: patrick at os.amperecomputing.com (Patrick Zhang OS)
Date: Tue, 5 Nov 2019 01:39:29 +0000
Subject: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub
 threshold of string_compare intrinsic tunable
In-Reply-To: <MN2PR01MB609385B40719285BD087416A8F610@MN2PR01MB6093.prod.exchangelabs.com>
References: <MN2PR01MB609385B40719285BD087416A8F610@MN2PR01MB6093.prod.exchangelabs.com>
Message-ID: <MN2PR01MB60933365EFE204BFD94261938F7E0@MN2PR01MB6093.prod.exchangelabs.com>

Reformat the description below. Please help review, thanks.

Regards
Patrick

-----Original Message-----
From: aarch64-port-dev <aarch64-port-dev-bounces at openjdk.java.net> On Behalf Of Patrick Zhang OS
Sent: Tuesday, October 29, 2019 5:59 PM
To: aarch64-port-dev at openjdk.java.net
Subject: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

Hi,

Could you please review this patch, thanks.

JBS: https://bugs.openjdk.java.net/browse/JDK-8229351 
Webrev: https://cr.openjdk.java.net/~qpzhang/8229351/webrev.02
(this starts from .02 since there had been some internal review and updates)

Changes:
1. Split the STUB_THRESHOLD from the hard-coded 72 to be CompareLongStringLimitLatin and CompareLongStringLimitUTF as a more flexible control over the stub thresholds for string_compare intrinsics, especially for various uArchs.
2. MacroAssembler::string_compare LL and UU shared the same threshold, actually UU may only require the half (length of chars) of that of LL's, because one character has two-bytes for UU, while for compacted LL strings, one character means one byte. In addition, LU/UL may need a separated threshold, as the stub function is different from the same encoding one, and the performance may vary as well.
3. In generate_compare_long_string_same_encoding, the hard-coded 72 was originally able to ensure that there can be always 64 bytes at least for the prefetch code path. However once a smaller stub threshold is set, a new condition is needed to tell if this would be still valid, or has to go to the NO_PREFETCH branch. This change can ensure the correctness.
4. In generate_compare_long_string_different_encoding, some temp vars for handling the last 4 characters are not valid any longer, cleaned up strU and strL, and related pointers initialization to the next U (cnt1) and L (tmp2).
5. In compare_string_16_x_LU, the reference to r10 (tmp1) is not needed, as tmpU or tmpL point to the same register.

Tests:
1. For function check, I have run
jdk jtreg tier1 tests, with default vm flags
hotspot jtreg tests: runtime/compiler/gc parts, with "-Xcomp -XX:-TieredCompilation"
jck10/api/java.lang 1609 cases and other selected modules, no new failures found, with default vm flags and "-Xcomp -XX:-TieredCompilation" respectively;
some specific test cases had been carefully executed to double check, i.e., TestStringCompareToDifferentLength.java [1] and TestStringCompareToDifferentLength.java [1] introduced by [2], StrCmpTest.java [3] introduced by [4].
2. For performance check, I have run
string-density-bench/CompareToBench.java [5] and StringCompareBench.java [6] respectively,
and SPECjbb2015.jar, no obvious performance change has been found (since the default threshold is NOT changed within this patch).
FYI. with Ampere eMAG system, microbenchmarks [5][6] can have 1.5x consistent perf gain with LU/UL comparison for shorter strings (<72 chars, smaller stub thresholds), and slight improvement (5~10%) with LL/UU cases.

Refs:
[1] https://hg.openjdk.java.net/jdk/jdk/file/3df2bf731a87/test/hotspot/jtreg/compiler/intrinsics/string
[2] https://bugs.openjdk.java.net/browse/JDK-8218966 AArch64: String.compareTo() can read memory after string 
[3] https://cr.openjdk.java.net/~dpochepk/8202326/StrCmpTest.java 
[4] https://bugs.openjdk.java.net/browse/JDK-8202326 AARCH64: optimize string compare intrinsic
[5] https://cr.openjdk.java.net/~shade/density/string-density-bench.jar
[6] https://cr.openjdk.java.net/~dpochepk/8202326/StringCompareBench.java

Regards
Patrick


From ci_notify at linaro.org  Tue Nov  5 03:04:02 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Tue, 5 Nov 2019 03:04:02 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK JDK on AArch64
Message-ID: <368223805.11982.1572923043042.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/summary/2019/308/summary.html
 
-------------------------------------------------------------------------------
client-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,780; fail: 19; not run: 90

-------------------------------------------------------------------------------
client-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,495; fail: 670; error: 23

-------------------------------------------------------------------------------
client-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/sep/18 pass: 5,727
Build 1: aarch64/2019/sep/20 pass: 5,728
Build 2: aarch64/2019/sep/23 pass: 5,727
Build 3: aarch64/2019/oct/07 pass: 5,750
Build 4: aarch64/2019/oct/09 pass: 5,747; fail: 1
Build 5: aarch64/2019/oct/11 pass: 5,751; fail: 1
Build 6: aarch64/2019/oct/14 pass: 5,753
Build 7: aarch64/2019/oct/16 pass: 5,753; fail: 1
Build 8: aarch64/2019/oct/18 pass: 5,760
Build 9: aarch64/2019/oct/21 pass: 5,716; fail: 43; error: 1
Build 10: aarch64/2019/oct/23 pass: 5,760; fail: 1
Build 11: aarch64/2019/oct/28 pass: 5,766
Build 12: aarch64/2019/oct/30 pass: 5,768
Build 13: aarch64/2019/nov/01 pass: 5,768; fail: 1
Build 14: aarch64/2019/nov/04 pass: 5,769

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/sep/18 pass: 8,675; fail: 517; error: 18
Build 1: aarch64/2019/sep/20 pass: 8,685; fail: 503; error: 22
Build 2: aarch64/2019/sep/23 pass: 8,696; fail: 497; error: 19
Build 3: aarch64/2019/oct/07 pass: 8,683; fail: 517; error: 18
Build 4: aarch64/2019/oct/09 pass: 8,692; fail: 507; error: 21
Build 5: aarch64/2019/oct/11 pass: 8,693; fail: 511; error: 18
Build 6: aarch64/2019/oct/14 pass: 8,706; fail: 497; error: 20
Build 7: aarch64/2019/oct/16 pass: 8,702; fail: 509; error: 17
Build 8: aarch64/2019/oct/18 pass: 8,694; fail: 522; error: 17
Build 9: aarch64/2019/oct/21 pass: 8,705; fail: 512; error: 18
Build 10: aarch64/2019/oct/23 pass: 8,712; fail: 505; error: 18
Build 11: aarch64/2019/oct/28 pass: 8,711; fail: 509; error: 18
Build 12: aarch64/2019/oct/30 pass: 8,723; fail: 504; error: 19
Build 13: aarch64/2019/nov/01 pass: 8,774; fail: 506; error: 18
Build 14: aarch64/2019/nov/04 pass: 8,777; fail: 509; error: 17

2 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/sep/18 pass: 3,978
Build 1: aarch64/2019/sep/20 pass: 3,979
Build 2: aarch64/2019/sep/23 pass: 3,979
Build 3: aarch64/2019/oct/07 pass: 3,979
Build 4: aarch64/2019/oct/09 pass: 3,979
Build 5: aarch64/2019/oct/11 pass: 3,979
Build 6: aarch64/2019/oct/14 pass: 3,979
Build 7: aarch64/2019/oct/16 pass: 3,979
Build 8: aarch64/2019/oct/18 pass: 3,979
Build 9: aarch64/2019/oct/21 pass: 3,979
Build 10: aarch64/2019/oct/23 pass: 3,980
Build 11: aarch64/2019/oct/28 pass: 3,980
Build 12: aarch64/2019/oct/30 pass: 3,980
Build 13: aarch64/2019/nov/01 pass: 3,980
Build 14: aarch64/2019/nov/04 pass: 3,980

-------------------------------------------------------------------------------
server-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,787; fail: 18; not run: 90

-------------------------------------------------------------------------------
server-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,476; fail: 686; error: 27

-------------------------------------------------------------------------------
server-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 8.34x
Relative performance: Server critical-jOPS (nc): 10.57x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 201.64

Server 201.64 / Server 2014-04-01 (71.00): 2.84x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-09-19 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/261/results/
2019-09-21 pass rate: 10487/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/263/results/
2019-09-29 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/266/results/
2019-10-08 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/280/results/
2019-10-10 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/282/results/
2019-10-12 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/284/results/
2019-10-15 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/287/results/
2019-10-17 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/289/results/
2019-10-19 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/291/results/
2019-10-22 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/294/results/
2019-10-23 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/296/results/
2019-10-29 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/301/results/
2019-10-31 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/303/results/
2019-11-02 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/305/results/
2019-11-05 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/308/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/

From felix.yang at huawei.com  Tue Nov  5 06:20:40 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Tue, 5 Nov 2019 06:20:40 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
	operations
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>

Hi,

Please review this small improvements of aarch64 atomic operations.
This eliminates the use of full memory barriers.
Passed tier1-3 testing.


Patch:

diff -r 2700c409ff10 src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp
--- a/src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp Sun Nov 03 18:02:29 2019 -0500
+++ b/src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp Wed Nov 06 14:13:00 2019 +0800
@@ -40,8 +40,7 @@
{
   template<typename I, typename D>
   D add_and_fetch(I add_value, D volatile* dest, atomic_memory_order order) const {
-    D res = __atomic_add_fetch(dest, add_value, __ATOMIC_RELEASE);
-    FULL_MEM_BARRIER;
+    D res = __atomic_add_fetch(dest, add_value, __ATOMIC_ACQ_REL);
     return res;
   }
};
@@ -52,8 +51,7 @@
                                                      T volatile* dest,
                                                      atomic_memory_order order) const {
   STATIC_ASSERT(byte_size == sizeof(T));
-  T res = __sync_lock_test_and_set(dest, exchange_value);
-  FULL_MEM_BARRIER;
+  T res = __atomic_exchange_n(dest, exchange_value, __ATOMIC_ACQ_REL);
   return res;
}

From aph at redhat.com  Tue Nov  5 08:52:20 2019
From: aph at redhat.com (Andrew Haley)
Date: Tue, 5 Nov 2019 08:52:20 +0000
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <859e48d6-9af5-b4af-32ac-4b07ce92e94d@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
 <ea93151c-7a73-b5df-4e43-a21e9188c938@redhat.com>
 <c60d30b0-e920-28f3-1077-d182662da48d@redhat.com>
 <87b115fb-5353-b21b-3cbc-f862bd932b3e@redhat.com>
 <3d70db1c-c927-48f8-23ab-8937838e0302@redhat.com>
 <0d347d16-f870-798f-0165-1ee4dfae511b@redhat.com>
 <859e48d6-9af5-b4af-32ac-4b07ce92e94d@redhat.com>
Message-ID: <6c110878-a477-df8a-e566-84b113806044@redhat.com>

On 11/4/19 6:23 PM, Zhengyu Gu wrote:
>> They are still needed for calling super class's load_at(). Even though, 
>> they are not used there neither.

Aha! Sorry, I missed that.

> Or I should say, they are not used there right now, but may be used in 
> future ...

So add them in the future, surely. All you're doing by passing unused
args is confusing the reader. It definitely succeeded with me...

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From adinn at redhat.com  Tue Nov  5 12:24:02 2019
From: adinn at redhat.com (Andrew Dinn)
Date: Tue, 5 Nov 2019 12:24:02 +0000
Subject: [aarch64-port-dev ] RFD: scratch registers cleanup [Long post]
In-Reply-To: <3e4aaf79-59a9-b346-e6b0-69839acf723e@redhat.com>
References: <3e4aaf79-59a9-b346-e6b0-69839acf723e@redhat.com>
Message-ID: <47b66917-cd4c-d3bf-5e48-f7054fe45eec@redhat.com>

Hi Andrew,

This is a good start and it could be used as is. However,I think it
needs some polishing to improve the way it works. That could involve
upgrading the model for how scratch registers are declared, allocated,
released and re-allocated. Alternatively, it could just involve
improving the way the current definitions are employed.

I'll start with general comments which address that broader point before
going on to mention specific issues to do with the current patch. These
latter comments will also help to motivate the general point but I think
it is better to make them in context. See after the sig for these comments.

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill

----- 8< -------- 8< -------- 8< -------- 8< -------- 8< ---

I think this is a very good start to remedying a nasty problem. However,
I'm not yet convinced the model for how you manage and use scratch
registers is the best way to do this. I'll explain why below and suggest
a variation that might or might not work. If it doesn't then correct me
and that will at least help understand why you ended up with the current
design. If it does work or, at least, suggests how to move towards
something better then let's try another round.

What troubles me most of all is that the mechanism you have provided can
be quite opaque about ownership/liveness of registers. I think that
happens because at root it does not require that declaration +
allocation, release and subsequent re-allocation are co-located in the
same program scope (or,at least, in the same method).

Let's work through some examples to clarify that view. As it stands it
is possible to allocate some scratch registers in a caller method and
then release those registers in a called method (possibly more than one
call down the stack) -- indeed, the latter operation occurs in this code
from the patch:

c1_LIRAssembler_aarch64.cpp:

 960 void LIR_Assembler::mem2reg(LIR_Opr src, LIR_Opr dest, BasicType
type, LIR_PatchCode patch_code, CodeEmitInfo* info, bool wide, bool /*
unaligned */) {
 961   LIR_Address* addr = src->as_address_ptr();
       . . .
 976   int null_check_here = code_offset();
 977   FreeScratchRegs dummy(__ as());
       . . .

ote that we are not in the scope of a visible ScratchRegister
declaration here. So, the use fo a FreeScratchRegs declaration implies
that we may be in the scope of a declaration up the call chain. If that
may be the case then how can the callee safely cancel that allocation?
How can it be sure that the caller is not relying on a scratch register
retaining its value across the call.

Likewise from the POV of the caller which might have declared a
ScratchRegister there is nothing in the code (except perhaps for
comments) to indicate that said scratch register might be overwritten
under the call to this method (or perhaps even a call to an indirect
caller). To those who do not know the details of the callee a
ScratchRegister declared in the caller will appear to remain valid
across the call even though it may be overwritten.

Another example occurs later in the same file

1464 void LIR_Assembler::emit_opTypeCheck(LIR_OpTypeCheck* op) {
1465   const bool should_profile = op->should_profile();
1466
1467   LIR_Code code = op->code();
1468   if (code == lir_store_check) {
       . . .
1495     if (should_profile) {
1496       Label not_null;
1497       __ cbnz(value, not_null);
1498       // Object is null; update MDO and exit
1499       ScratchRegister rscratch1(__ as(), r8);
1500       ScratchRegister rscratch2(__ as(), r9);       . . .
       . . .
1554   } else if (code == lir_checkcast) {
1555     FreeScratchRegs dummy(__ as());
       . . .
1564   } else if (code == lir_instanceof) {
1565     FreeScratchRegs dummy(__ as());
       . . .

Once again the release operations occur in a scope where there is no
prior declaration in the same scope. So, this use of release
declarations in the else-if clauses implies that a caller may have
allocated a scratch register yet they presume it is safe to release it.

Meanwhile, to make matters worse, the ScratchRegister declarations in
the initial if clause imply that a caller which exercises this path will
/not/ have declared a scratch register. That further implies that the
validity of these combined assumptions about what declarations might be
in scope depends on the callee's conditional logic. Now that's starting
to get a tad too opaque and I fear we are heading down a garden path.

Of course, this may just be down to the fact that the inclusion of those
release operations was superfluous; that they could be removed from
these examples. However, I don't like the fact that one can construct
such examples. I would much prefer it if this usage was avoided where
possible by requiring the release to occur in the scope where the
scratch register was declared. i.e. the caller must arrange to free
scratch registers if a caller might need them. That would make it much
easier to detect circumstances where the caller was trying to have its
cake and eat it.

This model of usage is indeed what happens much of the time e.g. also
from the same file:

1280 void LIR_Assembler::type_profile_helper(Register mdo,
1281                                         ciMethodData *md,
ciProfileData *data,
1282                                         Register recv, Label*
update_done) {
1283   ScratchRegister rscratch1(__ as(), r8);
1284   ScratchRegister rscratch2(__ as(), r9);
       . . .
1293     {
1294       FreeScratchRegs dummy(__ as());
       . . .
1298     }
       . . .

With this usage it is very clear that any scratch value established
before line 1294 cannot be relied on in code after line 1298.

I suspect that there are circumstances where a caller can or even must
free scratch registers down the call tree from a allocation so perhaps
we cannot afford always to dispense with this current usage for
FreeScratchRegs altogether. However, I'd prefer it if an explicit
scope-local release was used for wherever possible to avoid this
2nd-guessing of callers' intentions. In which case, it would be much
better if we could use the grammar of language definitions to support
that .. .

. . . which also provides the opportunity to remedy another shortcoming.
The current release operation does not specify which scratch register(s)
is (are) to be released. The only option is free all registers. However,
if a release operation must always be located in the scope of a
ScratchRegister then it is perfectly possible to do so by passing the
relevant ScratchRegister as a parameter. So, perhaps we could use macros
like:

  FreeScratchRegister dummy(rscratch1)

  FreeScratchRegisters dummy(rscratch1, rscratch2);

That makes it harder for scratch registers to be freed down the stack in
a caller because the caller will not have access to rscratch1 and
rscratch2 unless they are passed in by reerence or pointer. I'm assuming
code won't pass a ScratchRegister as a call argument, merely the
encapsulated Register (we might perhaps be able to knobble that by
ensuring that copying/dereferencing reset the register to noreg).

If this revised model works then the current FreeScratchRegs really
needs renaming to something less appealing like
DeleteCallerScratchBindings that won't tempt anyone to use it.

Also, it would perhaps be useful to provide a reverse operation for
re-allocating a previouly declared and freed scratch register e.g.

public void Foo::bar(...)
  ScratchRegister rscratch1(...);
  ScratchRegister rscratch2(...);
  . . .
  <use rscratch1 and rscratch2>
  . . .
  FreeScratchRegister dummy(rscratch1);
  call_user_of_rscratch1(...);
  UseScratchRegister dummy2(rscratch1);
  ...

This explicitly acknowledges that the call may want to use  rsctratch1
and explicitly signals in the caller code that rscratch1 cannot be
expected to remain constant across the call. Of course, one can always
achieve the same effect with block nesting but I think this is neater.

There is a bit more to this to do with conditional allocation of a
scratch register that I will discuss in the detailed comments (for
c1_LIRAssembler) but that needs to be done in context so ...

... on to the specific comments:

aarch64.ad:

Firstly, just a humdrum error. You missed a conversion for a specific
use of rscratch1

2851   enc_class aarch64_enc_stlrb(iRegI src, memory mem) %{
2852     MOV_VOLATILE(as_Register($src$$reg), $mem$$base, $mem$$index,
$mem$$scale, $mem$$disp,
2853                  rscratch1, stlrb);
2854   %}

This is still passing rscratch1 in the MOV_VOLATILE macro call where
subsequent encoding classes use r8_scratch. It doesn't actually matter
(i.e. it doesn't fail to compile) because MOV_VOLATILE ignores the
SCRATCH argument. I know you are trying to minimize changes but dropping
that argument in all uses would be a whole lot better . . . yes, maybe
in the next patch.

This story continues . . .

2930   enc_class aarch64_enc_fldars(vRegF dst, memory mem) %{
2931     MOV_VOLATILE(r8_scratch, $mem$$base, $mem$$index, $mem$$scale,
$mem$$disp,
2932              r8_scratch, ldarw);
2933     __ fmovs(as_FloatRegister($dst$$reg), r8_scratch);
2934   %}

I'm not sure why you invented the name r8_scratch as an alias for r8.
Does that really help? I couldn't discern why you sometimes used one and
why sometimes the other. If there is a rationale it probaly ought to be
documented somewhere (at least at the point where r8_scratch aand
r9_scratch are declared).

Anyway, what I don't follow is why are these uses just employ a bare
register name. Why are they exempt from the need to declare a scratch
register for r8 before using it?
Why not:

   enc_class aarch64_enc_fldars(vRegF dst, memory mem) %{
     ScratchRegister rscratch1(__ as(), r8);
     MOV_VOLATILE(rscratch1, $mem$$base, $mem$$index, $mem$$scale,
$mem$$disp,
              rscratch1, ldarw);
     __ fmovs(as_FloatRegister($dst$$reg), rscratch1);
     . . .

or, if need be

   enc_class aarch64_enc_fldars(vRegF dst, memory mem) %{
     {
       ScratchRegister rscratch1(__ as(), r8);
       MOV_VOLATILE(rscratch1, $mem$$base, $mem$$index, $mem$$scale,
$mem$$disp,
                rscratch1, ldarw);
       __ fmovs(as_FloatRegister($dst$$reg), rscratch1);
     }
     . . .

I realize this is extra 'protocol' for cases where register use is
managed by other means and where the invoked assembler methods are not
going to risk reuse of the register. However, the bare use of r8 (or
even under the name r8_scratch) is actually quite confusing. Using the
same protocol for acquiring a scratch register has the important virtue
of consistency. It also allows us to find a way later to avoid having to
explicitly name r8 and r9 in the declaration and, equally, having to
explicitly name them in these encodings e.g. we might just pass an index
in the constructor:

   ScratchRegister rscratch1(__ as(), 1);
   ScratchRegister rscratch2(__ as(), 2);

This is not the only place where your usage is confusing. You do
(although not consistently) use declarations in some of the later
encodings. For example, further down the file:

3526   enc_class aarch64_enc_fast_lock(iRegP object, iRegP box, iRegP
tmp, iRegP tmp2) %{
       . . .
3570     // markWord of object (disp_hdr) with the stack pointer.
3571     __ mov(r8_scratch, sp);
3572     __ sub(disp_hdr, disp_hdr, r8_scratch);
       . . .

has no declaration ... but is immediately followed by

3604   enc_class aarch64_enc_fast_unlock(iRegP object, iRegP box, iRegP
tmp, iRegP tmp2) %{
       . . .
3644     ScratchRegister rscratch1(__ as(), r8);
       . . .

Declarations are also used in some of the instruction definitions:

12103 instruct rolI_rReg(iRegINoSp dst, iRegI src, iRegI shift,
rFlagsReg cr)
       . . .
12109   ins_encode %{
12110     ScratchRegister rscratch1(__ as(), r8);
12111     __ subw(rscratch1, zr, as_Register($shift$$reg));
       . . .

Is there a clear rationale for whether or not to declare a
ScratchRegister that I am missing? I'd be happier with them being used
everywhere, avoiding explicit mention of Register names at points of use.

Also, one other thing I don't understand but I think is just an error:

3008   enc_class aarch64_enc_stlxr(iRegLNoSp src, memory mem) %{
3009     MacroAssembler _masm(&cbuf);
3010     Register src_reg = as_Register($src$$reg);
3011     Register base = as_Register($mem$$base);
3012     ScratchRegister rscratch2(__ as(), r9);
         . . .
3017        if (disp != 0) {
3018         __ lea(r9_scratch, Address(base, disp));
3019         __ stlxr(r8_scratch, src_reg, r9_scratch);
         . . .

Why are you declaring r9 as temp rscratch2, not declaring r8 as
rscratch1 and then using both as r9_scratch and r8_scratch? Note also
that in the previous encoding (aarch64_enc_ldaxr) you use r9_scratch and
r8_scratch and don't declare any ScratchRegister.

Oh and

3117   enc_class aarch64_enc_prefetchw(memory mem) %{
       . . .
3130         ScratchRegister r8_scratch(__ as(), r8);
3131         __ lea(r8_scratch, Address(base, disp));
3132         __ prfm(Address(r8_scratch, index_reg,
Address::lsl(scale)), PSTL1KEEP);

Why name the scratch register r8_scratch in this decl (shadowing an
existing name) rather than calling it rscratch1?


c1_LIRAssembler_aarch64.cpp:

I found the comment here to be misleading

1581 void LIR_Assembler::casw(Register addr, Register newval, Register
cmpval) {
1582   // r8 is used to pass an argument here, not as scratch. See
1583   // LIRGenerator::atomic_cmpxchg.
1584   __ cmpxchg(addr, cmpval, newval, Assembler::word, /* acquire*/
true, /* release*/ true, /* weak*/ false, r8);
1585   __ cset(r8, Assembler::NE);
1586   __ membar(__ AnyAny);
1587 }

Firstly, to be more strict/precise, r8 is used to return a result from
cmpxchg in order to then pass it on to cset (it read to me like you were
saying that it passes a value into cmpxchg).

More importantly, what the comment does not clarify is that the reason
you cannot allocate r8 as a scratch register here at the point of call
with a ScratchRegister decl is because cmpxchg itself conditionally
reserves r8 as a scratch register in the case where a client passes
noreg. So using a ScratchRegister here would break cmpxchg and not using
a ScratchRegister in cmpxchg would disallow other callers from passing
in noreg or require it to provide two paths to handle the two cases,
noreg or a scratch reg, allocating a ScratchRegister in only one of
those paths. This is all really a bit clumsy and certainly unclear.

This sort of case where a called method conditionally allocates a
scratch register is an important one to be able to handle. I think a way
to deal with this might be to allow methods like cmpxchg to declare a
local scratch register which /uses/ a supplied register if provided and
only /allocates/ a specific scratch register if noreg is supplied. That
would allow the caller to allocate a scratch register and pass it in or
not allocate the register and pass in noreg. Of course it also allows a
caller to pass in a non-scratch register.

So, given the above definition and your current definition for cmpxchg, viz:

2497 void MacroAssembler::cmpxchg(Register addr, Register expected,
2498                              Register new_val,
2499                              enum operand_size size,
2500                              bool acquire, bool release,
2501                              bool weak,
2502                              Register result) {
2503   ScratchRegister rscratch1(as(), r8);
2504   if (result == noreg)  result = rscratch1;
       <blithely use result>

we could replace them with

 void LIR_Assembler::casw(Register addr, Register newval, Register cmpval) {
   // allocate scratch register to return result
   // and pass it on to cset
   ScratchRegister rscratch1( __ as(), r8);
   __ cmpxchg(addr, cmpval, newval, Assembler::word, /* acquire*/ true,
/* release*/ true, /* weak*/ false, rscratch1);
   __ cset(rscratch1, Assembler::NE);
   __ membar(__ AnyAny);
 }

and

 void MacroAssembler::cmpxchg(Register addr, Register expected,
                              Register new_val,
                              enum operand_size size,
                              bool acquire, bool release,
                              bool weak,
                              Register result) {
   ScratchRegister rscratch1(as(),
                        /*use without alloc if reg*/ result,
                        /*alloc if noreg*/ r8);
   <blithely use rscratch1 instead of result>

This doesn't really work quite right because the caller has to inhibit
allocation in the called method either by passing 1) an explicit
non-scratch register or 2) an allocated scratch register i.e. it is not
quite uniform. However, in cases where noreg is passed the implication
is clear that that rscratch1 (possibly also rscratch2 if two noreg args
may be passed) will be allocated by the callee as an alternative.

n.b. the comment in the original version of LIR_Assembler::casw refers
to LIRGenerator::atomic_cmpxchg. Do you not actually mean
MacroAssembler::cmpxchg?


c1_Runtime_aarch64.cpp

I don't really like this:

  51 int StubAssembler::call_RT(Register oop_result1, Register
metadata_result, address entry, int args_size) {
  52   // setup registers
       . . .
  58   FreeScratchRegs dummy(as());
  59   ScratchRegister rscratch1(as(), r8);
  60   ScratchRegister rscratch2(as(), r9);

Why are the regs freed here rather than in the caller? I realise this
case is special because the callee can know that scratch regs are now
invalid (since we are about to plant a blr all bets are off). However,
this is a disparity with other use cases where the caller needs to make
the decision to free temps. Intentions and expectations would be clearer
if the caller were required to explicitly release scratch vars before a
call to call_RT (and then maybe reallocate them again afterwards).

interp_masm_aarch64.cpp:

You have these top level declarations:

  47 static const Register rscratch1 = r8;
  48 static const Register rscratch2 = r9;

So, we don't (now or in future) use ScratchRegister here? Why not? Or is
this just an expedient hack? Once again, I'd prefer scratch uses to be
uniform across all the code.

interp_masm_aarch64.hpp:

 293     set_last_Java_frame(esp, rfp, (address) pc(), r8);

So, here you are using r8 rather than r8_scratch. Was there a rationale
for that?

interpreterRT_aarch64.cpp:

  41 static const Register rscratch1 = r8;
  42 static const Register rscratch2 = r9;

Same comment as above for interp_masm_aarch64.cpp.


macroAssembler_aarch64.cpp:

Several more occurrences of callee freeing that make sense given that a
call is being planted but which I think would be clearer if done in the
relevant callers:

 679 void MacroAssembler::call_VM_base(Register oop_result,
 680                                   Register java_thread,
 681                                   Register last_java_sp,
 682                                   address  entry_point,
 683                                   int      number_of_arguments,
 684                                   bool     check_exceptions) {
 685   FreeScratchRegs regs(as());
       . . .

 804 address MacroAssembler::emit_trampoline_stub(int
insts_call_instruction_offset,
 805                                              address dest) {
       . . .
 821   FreeScratchRegs dummy(as());
       . . .

1456 void MacroAssembler::call_VM_leaf_base(address entry_point,
1457                                        int number_of_arguments,
1458                                        Label *retaddr) {
1459   Label E, L;
1460
1461   FreeScratchRegs dummy(as());  // VM calls clobber all registers but
1462                                 // we preserve rscratch1.
1463   ScratchRegister rscratch1(as(), r8);
       . . .

Also, I don't follow why you sometimes use r8 and other times
r9_scratch in this file.

methodHandles_aarch64.cpp

Once again free in a callee that I'd prefer to see done in the callers:

  97 void MethodHandles::jump_from_method_handle(MacroAssembler* _masm,
Register method, Register temp,
  98                                             bool for_compiler_entry) {
  99   FreeScratchRegs dummy(__ as());
 100   ScratchRegister rscratch1(__ as(), r8);


 128 void MethodHandles::jump_to_lambda_form(MacroAssembler* _masm,
 129                                         Register recv, Register
method_temp,
 130                                         Register temp2,
 131                                         bool for_compiler_entry) {
 132   FreeScratchRegs(__ as());


stubGenerator_aarch64.cpp

Two cases where it is clear a more selective release would be useful

2104     {
2105       FreeScratchRegs dummy(__ as());
2106       ScratchRegister rscratch2(__ as(), r9);
         . . .
2188     {
2189       FreeScratchRegs dummy(__ as());
2190       ScratchRegister rscratch2(__ as(), r9);


templateTable_aarch64.cpp

  46 static const Register rscratch1 = r8;
  47 static const Register rscratch2 = r9;

Same comment as above for interp_masm_aarch64.cpp.


From zgu at redhat.com  Tue Nov  5 12:33:19 2019
From: zgu at redhat.com (Zhengyu Gu)
Date: Tue, 5 Nov 2019 07:33:19 -0500
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <6c110878-a477-df8a-e566-84b113806044@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
 <ea93151c-7a73-b5df-4e43-a21e9188c938@redhat.com>
 <c60d30b0-e920-28f3-1077-d182662da48d@redhat.com>
 <87b115fb-5353-b21b-3cbc-f862bd932b3e@redhat.com>
 <3d70db1c-c927-48f8-23ab-8937838e0302@redhat.com>
 <0d347d16-f870-798f-0165-1ee4dfae511b@redhat.com>
 <859e48d6-9af5-b4af-32ac-4b07ce92e94d@redhat.com>
 <6c110878-a477-df8a-e566-84b113806044@redhat.com>
Message-ID: <84394d85-1b99-8139-3baf-7fbedba702c0@redhat.com>


On 11/5/19 3:52 AM, Andrew Haley wrote:
> On 11/4/19 6:23 PM, Zhengyu Gu wrote:
>>> They are still needed for calling super class's load_at(). Even though,
>>> they are not used there neither.
> 
> Aha! Sorry, I missed that.
> 
>> Or I should say, they are not used there right now, but may be used in
>> future ...
> 
> So add them in the future, surely. All you're doing by passing unused
> args is confusing the reader. It definitely succeeded with me...
> 
Sorry, I should just remove 'unused' comments. Okay with you?


Thanks,

-Zhengyu


From aph at redhat.com  Tue Nov  5 14:58:33 2019
From: aph at redhat.com (Andrew Haley)
Date: Tue, 5 Nov 2019 14:58:33 +0000
Subject: [aarch64-port-dev ] RFD: scratch registers cleanup [Long post]
In-Reply-To: <47b66917-cd4c-d3bf-5e48-f7054fe45eec@redhat.com>
References: <3e4aaf79-59a9-b346-e6b0-69839acf723e@redhat.com>
 <47b66917-cd4c-d3bf-5e48-f7054fe45eec@redhat.com>
Message-ID: <957a57a3-2cdb-f199-0f74-9a9795cab3d7@redhat.com>

On 11/5/19 12:24 PM, Andrew Dinn wrote:

> I think this is a very good start to remedying a nasty problem. However,
> I'm not yet convinced the model for how you manage and use scratch
> registers is the best way to do this. I'll explain why below and suggest
> a variation that might or might not work. If it doesn't then correct me
> and that will at least help understand why you ended up with the current
> design. If it does work or, at least, suggests how to move towards
> something better then let's try another round.
> 
> What troubles me most of all is that the mechanism you have provided can
> be quite opaque about ownership/liveness of registers. I think that
> happens because at root it does not require that declaration +
> allocation, release and subsequent re-allocation are co-located in the
> same program scope (or,at least, in the same method).
>
> Let's work through some examples to clarify that view. As it stands it
> is possible to allocate some scratch registers in a caller method and
> then release those registers in a called method (possibly more than one
> call down the stack) -- indeed, the latter operation occurs in this code
> from the patch:

Absolutely so, yes. Note that at this point I am not trying to reorganize
code but to make is clear(er) what is going on, and when a programmer does
something random to force that programmer to declare that randomness.

Abuses are possible, I agree.

> c1_LIRAssembler_aarch64.cpp:
> 
>  960 void LIR_Assembler::mem2reg(LIR_Opr src, LIR_Opr dest, BasicType
> type, LIR_PatchCode patch_code, CodeEmitInfo* info, bool wide, bool /*
> unaligned */) {
>  961   LIR_Address* addr = src->as_address_ptr();
>        . . .
>  976   int null_check_here = code_offset();
>  977   FreeScratchRegs dummy(__ as());
>        . . .
> 
> ote that we are not in the scope of a visible ScratchRegister
> declaration here. So, the use fo a FreeScratchRegs declaration implies
> that we may be in the scope of a declaration up the call chain. If that
> may be the case then how can the callee safely cancel that allocation?
> How can it be sure that the caller is not relying on a scratch register
> retaining its value across the call.

It can't. "Naked" FreeScratchRegs are an abomination to be be avoided
wherever possible. The only really justified use of them is when we're
making a callout to runtime code, at which point the programmer is
expected to know that the native ABI will clobber the scratch regs.

I don't want to clutter every single call to the runtime with
FreeScratchRegs. I don't think it would help anyone.

> Likewise from the POV of the caller which might have declared a
> ScratchRegister there is nothing in the code (except perhaps for
> comments) to indicate that said scratch register might be overwritten
> under the call to this method (or perhaps even a call to an indirect
> caller). To those who do not know the details of the callee a
> ScratchRegister declared in the caller will appear to remain valid
> across the call even though it may be overwritten.
> 
> Another example occurs later in the same file
> 
> 1464 void LIR_Assembler::emit_opTypeCheck(LIR_OpTypeCheck* op) {
> 1465   const bool should_profile = op->should_profile();
> 1466
> 1467   LIR_Code code = op->code();
> 1468   if (code == lir_store_check) {
>        . . .
> 1495     if (should_profile) {
> 1496       Label not_null;
> 1497       __ cbnz(value, not_null);
> 1498       // Object is null; update MDO and exit
> 1499       ScratchRegister rscratch1(__ as(), r8);
> 1500       ScratchRegister rscratch2(__ as(), r9);       . . .
>        . . .
> 1554   } else if (code == lir_checkcast) {
> 1555     FreeScratchRegs dummy(__ as());
>        . . .
> 1564   } else if (code == lir_instanceof) {
> 1565     FreeScratchRegs dummy(__ as());
>        . . .
> 
> Once again the release operations occur in a scope where there is no
> prior declaration in the same scope. So, this use of release
> declarations in the else-if clauses implies that a caller may have
> allocated a scratch register yet they presume it is safe to release it.

That one looks like a mistake. It's perfectly possible to FreeScratchRegs
where none are allocated. I don't believe that any registers will be
allocated before the start of emit_opTypeCheck().

> Meanwhile, to make matters worse, the ScratchRegister declarations in
> the initial if clause imply that a caller which exercises this path will
> /not/ have declared a scratch register. That further implies that the
> validity of these combined assumptions about what declarations might be
> in scope depends on the callee's conditional logic. Now that's starting
> to get a tad too opaque and I fear we are heading down a garden path.

Indeed.

> Of course, this may just be down to the fact that the inclusion of those
> release operations was superfluous; that they could be removed from
> these examples. However, I don't like the fact that one can construct
> such examples. I would much prefer it if this usage was avoided where
> possible by requiring the release to occur in the scope where the
> scratch register was declared. i.e. the caller must arrange to free
> scratch registers if a caller might need them. That would make it much
> easier to detect circumstances where the caller was trying to have its
> cake and eat it.

Sure, I'm happy with that in all cases except in the special case of
the runtime calls.

> This model of usage is indeed what happens much of the time e.g. also
> from the same file:
> 
> 1280 void LIR_Assembler::type_profile_helper(Register mdo,
> 1281                                         ciMethodData *md,
> ciProfileData *data,
> 1282                                         Register recv, Label*
> update_done) {
> 1283   ScratchRegister rscratch1(__ as(), r8);
> 1284   ScratchRegister rscratch2(__ as(), r9);
>        . . .
> 1293     {
> 1294       FreeScratchRegs dummy(__ as());
>        . . .
> 1298     }
>        . . .
> 
> With this usage it is very clear that any scratch value established
> before line 1294 cannot be relied on in code after line 1298.

That's what it's supposed to look like!

I don't really believe that it makes sense for us to prevent "Naked"
FreeScratchRegs automagically, but it would be good to reject such
things in code review

> I suspect that there are circumstances where a caller can or even
> must free scratch registers down the call tree from a allocation so
> perhaps we cannot afford always to dispense with this current usage
> for FreeScratchRegs altogether. However, I'd prefer it if an
> explicit scope-local release was used for wherever possible to avoid
> this 2nd-guessing of callers' intentions. In which case, it would be
> much better if we could use the grammar of language definitions to
> support that .. .

Now that's a good idea. I'm not sure how you'd actually enforce it in
C++, but you could have something like CalloutFreeScratchRegs for
callouts. Actually, maybe I *can* think of a way to do it with macro magic.

> . . . which also provides the opportunity to remedy another shortcoming.
> The current release operation does not specify which scratch register(s)
> is (are) to be released. The only option is free all registers. However,
> if a release operation must always be located in the scope of a
> ScratchRegister then it is perfectly possible to do so by passing the
> relevant ScratchRegister as a parameter. So, perhaps we could use macros
> like:
> 
>   FreeScratchRegister dummy(rscratch1)
> 
>   FreeScratchRegisters dummy(rscratch1, rscratch2);

Sure, it is, but IMO it's an over-elaboration. When calling down to a
sub-macro it's easer to follow what's going on if you just say "this
macro clobbers scratch", but I won't fight you if you're really keen
to do that.

> That makes it harder for scratch registers to be freed down the stack in
> a caller because the caller will not have access to rscratch1 and
> rscratch2 unless they are passed in by reerence or pointer. I'm assuming
> code won't pass a ScratchRegister as a call argument, merely the
> encapsulated Register (we might perhaps be able to knobble that by
> ensuring that copying/dereferencing reset the register to noreg).

Yeah. Passing scratch registers by other names is one of the worst
abuses I've had to deal with, and after some version of this patch
goes in such things may be fixed.

> If this revised model works then the current FreeScratchRegs really
> needs renaming to something less appealing like
> DeleteCallerScratchBindings that won't tempt anyone to use it.

That's a nice idea.

> Also, it would perhaps be useful to provide a reverse operation for
> re-allocating a previouly declared and freed scratch register e.g.
> 
> public void Foo::bar(...)
>   ScratchRegister rscratch1(...);
>   ScratchRegister rscratch2(...);
>   . . .
>   <use rscratch1 and rscratch2>
>   . . .
>   FreeScratchRegister dummy(rscratch1);
>   call_user_of_rscratch1(...);
>   UseScratchRegister dummy2(rscratch1);
>   ...

No: too complicated, excessive API service. Using nested scopes for
allocation and release corresponds will with the usage of C++, and
that's a good thing.

> This explicitly acknowledges that the call may want to use  rsctratch1
> and explicitly signals in the caller code that rscratch1 cannot be
> expected to remain constant across the call. Of course, one can always
> achieve the same effect with block nesting but I think this is neater.

I don't. That's what nesting is for!

> There is a bit more to this to do with conditional allocation of a
> scratch register that I will discuss in the detailed comments (for
> c1_LIRAssembler) but that needs to be done in context so ...
> 
> ... on to the specific comments:
> 
> aarch64.ad:
> 
> Firstly, just a humdrum error. You missed a conversion for a specific
> use of rscratch1
> 
> 2851   enc_class aarch64_enc_stlrb(iRegI src, memory mem) %{
> 2852     MOV_VOLATILE(as_Register($src$$reg), $mem$$base, $mem$$index,
> $mem$$scale, $mem$$disp,
> 2853                  rscratch1, stlrb);
> 2854   %}
> 
> This is still passing rscratch1 in the MOV_VOLATILE macro call where
> subsequent encoding classes use r8_scratch. It doesn't actually matter
> (i.e. it doesn't fail to compile) because MOV_VOLATILE ignores the
> SCRATCH argument. I know you are trying to minimize changes but dropping
> that argument in all uses would be a whole lot better . . . yes, maybe
> in the next patch.

Yes.

> This story continues . . .
> 
> 2930   enc_class aarch64_enc_fldars(vRegF dst, memory mem) %{
> 2931     MOV_VOLATILE(r8_scratch, $mem$$base, $mem$$index, $mem$$scale,
> $mem$$disp,
> 2932              r8_scratch, ldarw);
> 2933     __ fmovs(as_FloatRegister($dst$$reg), r8_scratch);
> 2934   %}
> 
> I'm not sure why you invented the name r8_scratch as an alias for r8.
> Does that really help? I couldn't discern why you sometimes used one and
> why sometimes the other. If there is a rationale it probaly ought to be
> documented somewhere (at least at the point where r8_scratch aand
> r9_scratch are declared).

I'm not sure, which is perhaps why it wasn't used consistently. Just a
reminder, I guess.

> Anyway, what I don't follow is why are these uses just employ a bare
> register name. Why are they exempt from the need to declare a scratch
> register for r8 before using it?
> Why not:
> 
>    enc_class aarch64_enc_fldars(vRegF dst, memory mem) %{
>      ScratchRegister rscratch1(__ as(), r8);
>      MOV_VOLATILE(rscratch1, $mem$$base, $mem$$index, $mem$$scale,
> $mem$$disp,
>               rscratch1, ldarw);
>      __ fmovs(as_FloatRegister($dst$$reg), rscratch1);
>      . . .
> 
> or, if need be
> 
>    enc_class aarch64_enc_fldars(vRegF dst, memory mem) %{
>      {
>        ScratchRegister rscratch1(__ as(), r8);
>        MOV_VOLATILE(rscratch1, $mem$$base, $mem$$index, $mem$$scale,
> $mem$$disp,
>                 rscratch1, ldarw);
>        __ fmovs(as_FloatRegister($dst$$reg), rscratch1);
>      }
>      . . .

I guess I could live with that. I don't want to introduce extra noise
where it can be avoided.

> This is not the only place where your usage is confusing. You do
> (although not consistently) use declarations in some of the later
> encodings. For example, further down the file:
> 
> 3526   enc_class aarch64_enc_fast_lock(iRegP object, iRegP box, iRegP
> tmp, iRegP tmp2) %{
>        . . .
> 3570     // markWord of object (disp_hdr) with the stack pointer.
> 3571     __ mov(r8_scratch, sp);
> 3572     __ sub(disp_hdr, disp_hdr, r8_scratch);
>        . . .
> 
> has no declaration ... but is immediately followed by
> 
> 3604   enc_class aarch64_enc_fast_unlock(iRegP object, iRegP box, iRegP
> tmp, iRegP tmp2) %{
>        . . .
> 3644     ScratchRegister rscratch1(__ as(), r8);
>        . . .

Fair enough. It's PoC, WIP. :-)

> Declarations are also used in some of the instruction definitions:
> 
> 12103 instruct rolI_rReg(iRegINoSp dst, iRegI src, iRegI shift,
> rFlagsReg cr)
>        . . .
> 12109   ins_encode %{
> 12110     ScratchRegister rscratch1(__ as(), r8);
> 12111     __ subw(rscratch1, zr, as_Register($shift$$reg));
>        . . .
> 
> Is there a clear rationale for whether or not to declare a
> ScratchRegister that I am missing? I'd be happier with them being used
> everywhere, avoiding explicit mention of Register names at points of use.

No, I might have just changed my mind partway through. In principle,
for the sake of the sanity of the maintenance programmer, it might be
simplest to insist that scratch register declarations are always used
in AD files. It would mean a change which moves the register tracker
from Assembler to CodeBuffer, because C2 creates Assemblers on the fly,
but such a change is not hard to do.

> Also, one other thing I don't understand but I think is just an error:
> 
> 3008   enc_class aarch64_enc_stlxr(iRegLNoSp src, memory mem) %{
> 3009     MacroAssembler _masm(&cbuf);
> 3010     Register src_reg = as_Register($src$$reg);
> 3011     Register base = as_Register($mem$$base);
> 3012     ScratchRegister rscratch2(__ as(), r9);
>          . . .
> 3017        if (disp != 0) {
> 3018         __ lea(r9_scratch, Address(base, disp));
> 3019         __ stlxr(r8_scratch, src_reg, r9_scratch);
>          . . .
> 
> Why are you declaring r9 as temp rscratch2, not declaring r8 as
> rscratch1 and then using both as r9_scratch and r8_scratch? Note also
> that in the previous encoding (aarch64_enc_ldaxr) you use r9_scratch and
> r8_scratch and don't declare any ScratchRegister.

I'm sure there was a reason, but...

> Why name the scratch register r8_scratch in this decl (shadowing an
> existing name) rather than calling it rscratch1?

That too.


> c1_LIRAssembler_aarch64.cpp:
> 
> I found the comment here to be misleading
> 
> 1581 void LIR_Assembler::casw(Register addr, Register newval, Register
> cmpval) {
> 1582   // r8 is used to pass an argument here, not as scratch. See
> 1583   // LIRGenerator::atomic_cmpxchg.
> 1584   __ cmpxchg(addr, cmpval, newval, Assembler::word, /* acquire*/
> true, /* release*/ true, /* weak*/ false, r8);
> 1585   __ cset(r8, Assembler::NE);
> 1586   __ membar(__ AnyAny);
> 1587 }
> 
> Firstly, to be more strict/precise, r8 is used to return a result from
> cmpxchg in order to then pass it on to cset (it read to me like you were
> saying that it passes a value into cmpxchg).

Passing scratch registers to macros as arguments is really awkward. It's
just about the most confusing and error-prone thing you can do. In a
later patch I'd like to get rid of all of it.

Probably the greatest contribution of this work is that it detects and
forces us to do something about all such usages.

> More importantly, what the comment does not clarify is that the reason
> you cannot allocate r8 as a scratch register here at the point of call
> with a ScratchRegister decl is because cmpxchg itself conditionally
> reserves r8 as a scratch register in the case where a client passes
> noreg. So using a ScratchRegister here would break cmpxchg and not using
> a ScratchRegister in cmpxchg would disallow other callers from passing
> in noreg or require it to provide two paths to handle the two cases,
> noreg or a scratch reg, allocating a ScratchRegister in only one of
> those paths. This is all really a bit clumsy and certainly unclear.

Right. The problem here is that the way the registers are used is
confusing and clumsy; the declarations (and comments) reflect that
clumsiness.

> This sort of case where a called method conditionally allocates a
> scratch register is an important one to be able to handle. I think a
> way to deal with this might be to allow methods like cmpxchg to
> declare a local scratch register which /uses/ a supplied register if
> provided and only /allocates/ a specific scratch register if noreg
> is supplied. That would allow the caller to allocate a scratch
> register and pass it in or not allocate the register and pass in
> noreg. Of course it also allows a caller to pass in a non-scratch
> register.

Better still, I think, for the caller to allocate the scratch register
and pass it down under the name rscratch1; ownership remains with the
caller.

> So, given the above definition and your current definition for cmpxchg, viz:
> 
> 2497 void MacroAssembler::cmpxchg(Register addr, Register expected,
> 2498                              Register new_val,
> 2499                              enum operand_size size,
> 2500                              bool acquire, bool release,
> 2501                              bool weak,
> 2502                              Register result) {
> 2503   ScratchRegister rscratch1(as(), r8);
> 2504   if (result == noreg)  result = rscratch1;
>        <blithely use result>
> 
> we could replace them with
> 
>  void LIR_Assembler::casw(Register addr, Register newval, Register cmpval) {
>    // allocate scratch register to return result
>    // and pass it on to cset
>    ScratchRegister rscratch1( __ as(), r8);
>    __ cmpxchg(addr, cmpval, newval, Assembler::word, /* acquire*/ true,
> /* release*/ true, /* weak*/ false, rscratch1);
>    __ cset(rscratch1, Assembler::NE);
>    __ membar(__ AnyAny);
>  }
> 
> and
> 
>  void MacroAssembler::cmpxchg(Register addr, Register expected,
>                               Register new_val,
>                               enum operand_size size,
>                               bool acquire, bool release,
>                               bool weak,
>                               Register result) {
>    ScratchRegister rscratch1(as(),
>                         /*use without alloc if reg*/ result,
>                         /*alloc if noreg*/ r8);
>    <blithely use rscratch1 instead of result>
> 
> This doesn't really work quite right because the caller has to inhibit
> allocation in the called method either by passing 1) an explicit
> non-scratch register or 2) an allocated scratch register i.e. it is not
> quite uniform. However, in cases where noreg is passed the implication
> is clear that that rscratch1 (possibly also rscratch2 if two noreg args
> may be passed) will be allocated by the callee as an alternative.

Hmm, maybe. I suggest that we put up with the ugliness of nakedly
using r8 for now (accompanied by suitable comments) and then fix it up
later.

> n.b. the comment in the original version of LIR_Assembler::casw refers
> to LIRGenerator::atomic_cmpxchg. Do you not actually mean
> MacroAssembler::cmpxchg?

I can't remember.

> c1_Runtime_aarch64.cpp
> 
> I don't really like this:
> 
>   51 int StubAssembler::call_RT(Register oop_result1, Register
> metadata_result, address entry, int args_size) {
>   52   // setup registers
>        . . .
>   58   FreeScratchRegs dummy(as());
>   59   ScratchRegister rscratch1(as(), r8);
>   60   ScratchRegister rscratch2(as(), r9);
> 
> Why are the regs freed here rather than in the caller? I realise this
> case is special because the callee can know that scratch regs are now
> invalid (since we are about to plant a blr all bets are off). However,
> this is a disparity with other use cases where the caller needs to make
> the decision to free temps. Intentions and expectations would be clearer
> if the caller were required to explicitly release scratch vars before a
> call to call_RT (and then maybe reallocate them again afterwards).

No, I don't want to do that, as I said above. It's a callout, which
nukes more than just the scratch registers.

> interp_masm_aarch64.cpp:
> 
> You have these top level declarations:
> 
>   47 static const Register rscratch1 = r8;
>   48 static const Register rscratch2 = r9;
> 
> So, we don't (now or in future) use ScratchRegister here? Why not? Or is
> this just an expedient hack? Once again, I'd prefer scratch uses to be
> uniform across all the code.

It's very expedient, and in particular the interpreter has its own
convention for register usage. I don't believe that rewriting all of
the templates etc. to use scratch register declarations would reduce
errors but it might even introduce them.

I could live with the interpreter being changed later, but there's no
need for it now.

> interp_masm_aarch64.hpp:
> 
>  293     set_last_Java_frame(esp, rfp, (address) pc(), r8);
> 
> So, here you are using r8 rather than r8_scratch. Was there a rationale
> for that?

It's because of declaration is in the wrong place. If we moved it, it
could just use the same name as the rest of the interpreter.

> interpreterRT_aarch64.cpp:
> 
>   41 static const Register rscratch1 = r8;
>   42 static const Register rscratch2 = r9;
> 
> Same comment as above for interp_masm_aarch64.cpp.
> 
> 
> macroAssembler_aarch64.cpp:
> 
> Several more occurrences of callee freeing that make sense given that a
> call is being planted but which I think would be clearer if done in the
> relevant callers:

See above.

> Also, I don't follow why you sometimes use r8 and other times
> r9_scratch in this file.

Some cleanups required.

> methodHandles_aarch64.cpp
> 
> Once again free in a callee that I'd prefer to see done in the callers:
> 
>   97 void MethodHandles::jump_from_method_handle(MacroAssembler* _masm,
> Register method, Register temp,
>   98                                             bool for_compiler_entry) {
>   99   FreeScratchRegs dummy(__ as());
>  100   ScratchRegister rscratch1(__ as(), r8);

There's even less justification for putting FreeScratchRegs in the
caller at this point: jump_from_method_handle is not going to return.

> stubGenerator_aarch64.cpp
> 
> Two cases where it is clear a more selective release would be useful
> 
> 2104     {
> 2105       FreeScratchRegs dummy(__ as());
> 2106       ScratchRegister rscratch2(__ as(), r9);
>          . . .
> 2188     {
> 2189       FreeScratchRegs dummy(__ as());
> 2190       ScratchRegister rscratch2(__ as(), r9);

Maybe, but it would make for a more complicated API, which I'm not
convinced would really carry its weight. I can go with a more
complicated FreeScratchRegs which says what you've freed rather than
what you've reserved if you're really keen. NBut it's not obvious to
me that

FreeScratchRegs dummy(r8, __ as());

is really better than

FreeScratchRegs dummy(__ as());
ScratchRegister rscratch2(__ as(), r9);

... especially since the latter actually relies on the programmer
reading upwards to see that r9 is allocated.

> templateTable_aarch64.cpp
> 
>   46 static const Register rscratch1 = r8;
>   47 static const Register rscratch2 = r9;
> 
> Same comment as above for interp_masm_aarch64.cpp.

Same reply.  :-)

Thanks.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Tue Nov  5 15:08:37 2019
From: aph at redhat.com (Andrew Haley)
Date: Tue, 5 Nov 2019 15:08:37 +0000
Subject: [aarch64-port-dev ] RFD: scratch registers cleanup [Long post]
In-Reply-To: <957a57a3-2cdb-f199-0f74-9a9795cab3d7@redhat.com>
References: <3e4aaf79-59a9-b346-e6b0-69839acf723e@redhat.com>
 <47b66917-cd4c-d3bf-5e48-f7054fe45eec@redhat.com>
 <957a57a3-2cdb-f199-0f74-9a9795cab3d7@redhat.com>
Message-ID: <5ff0143f-881e-4e02-13a8-56bd71b0086d@redhat.com>

On 11/5/19 2:58 PM, Andrew Haley wrote:
> No: too complicated, excessive API service.

s/service/surface/

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Tue Nov  5 16:26:19 2019
From: aph at redhat.com (Andrew Haley)
Date: Tue, 5 Nov 2019 16:26:19 +0000
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <84394d85-1b99-8139-3baf-7fbedba702c0@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
 <ea93151c-7a73-b5df-4e43-a21e9188c938@redhat.com>
 <c60d30b0-e920-28f3-1077-d182662da48d@redhat.com>
 <87b115fb-5353-b21b-3cbc-f862bd932b3e@redhat.com>
 <3d70db1c-c927-48f8-23ab-8937838e0302@redhat.com>
 <0d347d16-f870-798f-0165-1ee4dfae511b@redhat.com>
 <859e48d6-9af5-b4af-32ac-4b07ce92e94d@redhat.com>
 <6c110878-a477-df8a-e566-84b113806044@redhat.com>
 <84394d85-1b99-8139-3baf-7fbedba702c0@redhat.com>
Message-ID: <2be52de0-6f12-d989-cf69-5807b2160cb0@redhat.com>

On 11/5/19 12:33 PM, Zhengyu Gu wrote:
> 
> 
> On 11/5/19 3:52 AM, Andrew Haley wrote:
>> On 11/4/19 6:23 PM, Zhengyu Gu wrote:
>>>> They are still needed for calling super class's load_at(). Even though,
>>>> they are not used there neither.
>>
>> Aha! Sorry, I missed that.
>>
>>> Or I should say, they are not used there right now, but may be used in
>>> future ...
>>
>> So add them in the future, surely. All you're doing by passing unused
>> args is confusing the reader. It definitely succeeded with me...
>>
> Sorry, I should just remove 'unused' comments. Okay with you?

OK.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Wed Nov  6 09:45:01 2019
From: aph at redhat.com (Andrew Haley)
Date: Wed, 6 Nov 2019 09:45:01 +0000
Subject: [aarch64-port-dev ] RFR: AArch64: JDK-8232046: AArch64 build
 failure after JDK-8225681
In-Reply-To: <48557f4d-bd8c-ff22-7c3e-fe8ec3f532dd@redhat.com>
References: <41cfc3b9-eb1e-b445-3136-9de93eb66cb2@redhat.com>
 <87fd08bf-82c0-0bb8-e322-311c878b43b4@redhat.com>
 <c016f9e1-c04d-2226-86e2-fafaf7394772@oracle.com>
 <48557f4d-bd8c-ff22-7c3e-fe8ec3f532dd@redhat.com>
Message-ID: <97bf1ce3-0cb4-b229-d5b0-c48f0ee84647@redhat.com>

On 10/11/19 1:51 PM, Andrew Dinn wrote:
> Hi Erik,
> 
> On 11/10/2019 13:04, erik.osterlund at oracle.com wrote:
>> Looks good to me. I feel like something is weird about the 0 is
>> logically -1 mapping (shouldn't it have populated the generic jump with
>> -1 in the first place instead?), but that weirdness should not hold back
>> this fix. Ship it.
> 
> Perhaps. Although -1 is not used anywhere else in the AArch64 code --
> all other sites use a self-reference (jump target address == address of
> jump) from the get-go as well as after a reset. They then lie
> consistently about that to keep the generic code happy. I am not sure
> there is any good reason to use that in place of -1 but I always default
> to the assumption that Andrew Hayley had a reason for breaking with
> protocol.
> 
> So, I'd really have preferred to have used a self-reference as the
> initial value in this case too. Indeed, I tried that but it failed to
> relocate when the nmethod was installed. When debugging that failure I
> spotted a cryptic breadcrumb comment left by Andrew Haley about relocs
> not doing the right thing when the generate buffer was copied. So, I
> decided to leave well alone at that point.
> 
> This may only be an artefact of Andrew Haley not understanding relocs
> fully when he first wrote the code. When he is back I'll talk to him and
> see if we can correct this to use a self-reference or event switching
> all jumps to use -1 a an empty marker. Even if we can only manage
> consistent lying about the -1 that would be an improvement.

Getting back to this...

>From what I remember, the AArch64 code is based on what x86 did at the
time.  It might well be that such kludges are no longer
necessary. There's no reason not to make this code consistent with all
of the other usages.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Wed Nov  6 09:46:24 2019
From: aph at redhat.com (Andrew Haley)
Date: Wed, 6 Nov 2019 09:46:24 +0000
Subject: [aarch64-port-dev ] RFR (trivial) : fix aarch64-8u type profile
 bug
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED5FF2814@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED5FE7C9A@dggeml527-mbx.china.huawei.com>
 <222f9c0b-7320-8d22-cd44-c4f3af7c1311@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED5FE7E9C@dggeml527-mbx.china.huawei.com>
 <880f5072-91ba-66bd-94be-429556e7c132@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED5FF2814@dggeml527-mbx.china.huawei.com>
Message-ID: <e450e5ac-81c9-8cd6-460c-ed74e562993e@redhat.com>

On 9/24/19 9:22 AM, Yangfei (Felix) wrote:
>>> This also reminds me of another two aarch64-specific profiling issues:
>>> https://bugs.openjdk.java.net/browse/JDK-8188221
>>> https://bugs.openjdk.java.net/browse/JDK-8189439
>>>
>>> I think they also should be incorporated in aarch64 8u.  What do you think?
>> I've always been reluctant to backport performance-only patches to 8u, but I
>> admit that version will be around for a long time, so OK.
>>
> Looks like the upstream patches can be simplified: 'mdp' is loaded by test_method_data_pointer which is called by profile_return_type & profile_parameters_type.  

Sorry for not replying before now.

Maybe they can, but this is a backport.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From zgu at redhat.com  Wed Nov  6 12:18:28 2019
From: zgu at redhat.com (Zhengyu Gu)
Date: Wed, 6 Nov 2019 07:18:28 -0500
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <2be52de0-6f12-d989-cf69-5807b2160cb0@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
 <ea93151c-7a73-b5df-4e43-a21e9188c938@redhat.com>
 <c60d30b0-e920-28f3-1077-d182662da48d@redhat.com>
 <87b115fb-5353-b21b-3cbc-f862bd932b3e@redhat.com>
 <3d70db1c-c927-48f8-23ab-8937838e0302@redhat.com>
 <0d347d16-f870-798f-0165-1ee4dfae511b@redhat.com>
 <859e48d6-9af5-b4af-32ac-4b07ce92e94d@redhat.com>
 <6c110878-a477-df8a-e566-84b113806044@redhat.com>
 <84394d85-1b99-8139-3baf-7fbedba702c0@redhat.com>
 <2be52de0-6f12-d989-cf69-5807b2160cb0@redhat.com>
Message-ID: <93330192-7143-ca82-9872-fe627a97772e@redhat.com>


On 11/5/19 11:26 AM, Andrew Haley wrote:
> On 11/5/19 12:33 PM, Zhengyu Gu wrote:
>>
>>
>> On 11/5/19 3:52 AM, Andrew Haley wrote:
>>> On 11/4/19 6:23 PM, Zhengyu Gu wrote:
>>>>> They are still needed for calling super class's load_at(). Even though,
>>>>> they are not used there neither.
>>>
>>> Aha! Sorry, I missed that.
>>>
>>>> Or I should say, they are not used there right now, but may be used in
>>>> future ...
>>>
>>> So add them in the future, surely. All you're doing by passing unused
>>> args is confusing the reader. It definitely succeeded with me...
>>>
>> Sorry, I should just remove 'unused' comments. Okay with you?
> 
> OK.

Updated: http://cr.openjdk.java.net/~zgu/JDK-8233401/webrev.02/index.html

Thanks,

-Zhengyu

> 

From shade at redhat.com  Wed Nov  6 12:18:58 2019
From: shade at redhat.com (Aleksey Shipilev)
Date: Wed, 6 Nov 2019 13:18:58 +0100
Subject: [aarch64-port-dev ] RFR (XS) 8233695: AArch64 build failures after
	-Wno-extra removal
Message-ID: <d929bdb8-0a7d-dc93-9fa0-7558d7f8aae8@redhat.com>

Bug:
  https://bugs.openjdk.java.net/browse/JDK-8233695

Fix:
  https://cr.openjdk.java.net/~shade/8233695/webrev.01/

This is the simplest fix I can think of in orderAccess: cast away constness. Also, trivially unused
parameter in templateInterpreterGenerator_aarch64.cpp. The issues is exposed in 14, but it is
actually there in all releases down to 8-aarch64.

Testing: aarch64 build, tier1 (running)

-- 
Thanks,
-Aleksey


From aph at redhat.com  Wed Nov  6 12:33:29 2019
From: aph at redhat.com (Andrew Haley)
Date: Wed, 6 Nov 2019 12:33:29 +0000
Subject: [aarch64-port-dev ] RFR (XS) 8233695: AArch64 build failures
 after -Wno-extra removal
In-Reply-To: <d929bdb8-0a7d-dc93-9fa0-7558d7f8aae8@redhat.com>
References: <d929bdb8-0a7d-dc93-9fa0-7558d7f8aae8@redhat.com>
Message-ID: <46aa9d40-7353-c5b6-6bd9-4152eee0a3b8@redhat.com>

On 11/6/19 12:18 PM, Aleksey Shipilev wrote:
> Bug:
>   https://bugs.openjdk.java.net/browse/JDK-8233695
> 
> Fix:
>   https://cr.openjdk.java.net/~shade/8233695/webrev.01/
> 
> This is the simplest fix I can think of in orderAccess: cast away constness. Also, trivially unused
> parameter in templateInterpreterGenerator_aarch64.cpp. The issues is exposed in 14, but it is
> actually there in all releases down to 8-aarch64.

It's better to use const_cast<T> here. Otherwise OK, thanks.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Wed Nov  6 12:35:54 2019
From: aph at redhat.com (Andrew Haley)
Date: Wed, 6 Nov 2019 12:35:54 +0000
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <93330192-7143-ca82-9872-fe627a97772e@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
 <ea93151c-7a73-b5df-4e43-a21e9188c938@redhat.com>
 <c60d30b0-e920-28f3-1077-d182662da48d@redhat.com>
 <87b115fb-5353-b21b-3cbc-f862bd932b3e@redhat.com>
 <3d70db1c-c927-48f8-23ab-8937838e0302@redhat.com>
 <0d347d16-f870-798f-0165-1ee4dfae511b@redhat.com>
 <859e48d6-9af5-b4af-32ac-4b07ce92e94d@redhat.com>
 <6c110878-a477-df8a-e566-84b113806044@redhat.com>
 <84394d85-1b99-8139-3baf-7fbedba702c0@redhat.com>
 <2be52de0-6f12-d989-cf69-5807b2160cb0@redhat.com>
 <93330192-7143-ca82-9872-fe627a97772e@redhat.com>
Message-ID: <7e9ace3d-8d15-e87a-f01c-90fc4b6faa6a@redhat.com>

On 11/6/19 12:18 PM, Zhengyu Gu wrote:
> Updated: http://cr.openjdk.java.net/~zgu/JDK-8233401/webrev.02/index.html

OK.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From shade at redhat.com  Wed Nov  6 12:46:55 2019
From: shade at redhat.com (Aleksey Shipilev)
Date: Wed, 6 Nov 2019 13:46:55 +0100
Subject: [aarch64-port-dev ] RFR (XS) 8233695: AArch64 build failures
 after -Wno-extra removal
In-Reply-To: <46aa9d40-7353-c5b6-6bd9-4152eee0a3b8@redhat.com>
References: <d929bdb8-0a7d-dc93-9fa0-7558d7f8aae8@redhat.com>
 <46aa9d40-7353-c5b6-6bd9-4152eee0a3b8@redhat.com>
Message-ID: <9f35ecb2-83e6-bf04-5f91-a671fd65070f@redhat.com>

On 11/6/19 1:33 PM, Andrew Haley wrote:
> On 11/6/19 12:18 PM, Aleksey Shipilev wrote:
>> Bug:
>>   https://bugs.openjdk.java.net/browse/JDK-8233695
>>
>> Fix:
>>   https://cr.openjdk.java.net/~shade/8233695/webrev.01/
>>
>> This is the simplest fix I can think of in orderAccess: cast away constness. Also, trivially unused
>> parameter in templateInterpreterGenerator_aarch64.cpp. The issues is exposed in 14, but it is
>> actually there in all releases down to 8-aarch64.
> 
> It's better to use const_cast<T> here. Otherwise OK, thanks.

Right. Like this?
  https://cr.openjdk.java.net/~shade/8233695/webrev.02/

-- 
Thanks,
-Aleksey


From aph at redhat.com  Wed Nov  6 13:14:15 2019
From: aph at redhat.com (Andrew Haley)
Date: Wed, 6 Nov 2019 13:14:15 +0000
Subject: [aarch64-port-dev ] RFR (XS) 8233695: AArch64 build failures
 after -Wno-extra removal
In-Reply-To: <9f35ecb2-83e6-bf04-5f91-a671fd65070f@redhat.com>
References: <d929bdb8-0a7d-dc93-9fa0-7558d7f8aae8@redhat.com>
 <46aa9d40-7353-c5b6-6bd9-4152eee0a3b8@redhat.com>
 <9f35ecb2-83e6-bf04-5f91-a671fd65070f@redhat.com>
Message-ID: <360b522d-ff3d-16af-b833-27b518e39fc3@redhat.com>

On 11/6/19 12:46 PM, Aleksey Shipilev wrote:
> Right. Like this?
>   https://cr.openjdk.java.net/~shade/8233695/webrev.02/

Exactly.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Wed Nov  6 13:43:31 2019
From: aph at redhat.com (Andrew Haley)
Date: Wed, 6 Nov 2019 13:43:31 +0000
Subject: [aarch64-port-dev ] RFD: scratch registers cleanup [Long post]
In-Reply-To: <3e4aaf79-59a9-b346-e6b0-69839acf723e@redhat.com>
References: <3e4aaf79-59a9-b346-e6b0-69839acf723e@redhat.com>
Message-ID: <7264d4c2-1520-c49f-2e73-f1527fb90da3@redhat.com>

One other thing: this exercise has shown that in many cases we trash
scratch registers in places where it really doesn't matter, and we'd
be much better off rewriting them not to do so.

This makes push_call_clobbered_registers() something that can safely
be used everywhere. But I'm holding off any of this because I want the
first patch to be, if at all possible, neutral with regard to code
generated.

diff -r 33f9271b3167 src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp
--- a/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp	Mon Nov 04 13:13:34 2019 -0500
+++ b/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp	Wed Nov 06 08:36:08 2019 -0500
@@ -2624,15 +2624,17 @@
   int step = 4 * wordSize;
   push(RegSet::range(r0, r18) - RegSet::of(rscratch1, rscratch2), sp);
   sub(sp, sp, step);
-  mov(rscratch1, -step);
+  mov(r0, -step);
   // Push v0-v7, v16-v31.
   for (int i = 31; i>= 4; i -= 4) {
     if (i <= v7->encoding() || i >= v16->encoding())
       st1(as_FloatRegister(i-3), as_FloatRegister(i-2), as_FloatRegister(i-1),
-          as_FloatRegister(i), T1D, Address(post(sp, rscratch1)));
+          as_FloatRegister(i), T1D, Address(post(sp, r0)));
   }
   st1(as_FloatRegister(0), as_FloatRegister(1), as_FloatRegister(2),
       as_FloatRegister(3), T1D, Address(sp));
+  // Reload r0 from where it was saved before pushing v0-v7, v16-v31.
+  ldr(r0, Address(sp, (8 + 16) * wordSize));
 }

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From shade at redhat.com  Wed Nov  6 13:45:27 2019
From: shade at redhat.com (Aleksey Shipilev)
Date: Wed, 6 Nov 2019 14:45:27 +0100
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <93330192-7143-ca82-9872-fe627a97772e@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
 <ea93151c-7a73-b5df-4e43-a21e9188c938@redhat.com>
 <c60d30b0-e920-28f3-1077-d182662da48d@redhat.com>
 <87b115fb-5353-b21b-3cbc-f862bd932b3e@redhat.com>
 <3d70db1c-c927-48f8-23ab-8937838e0302@redhat.com>
 <0d347d16-f870-798f-0165-1ee4dfae511b@redhat.com>
 <859e48d6-9af5-b4af-32ac-4b07ce92e94d@redhat.com>
 <6c110878-a477-df8a-e566-84b113806044@redhat.com>
 <84394d85-1b99-8139-3baf-7fbedba702c0@redhat.com>
 <2be52de0-6f12-d989-cf69-5807b2160cb0@redhat.com>
 <93330192-7143-ca82-9872-fe627a97772e@redhat.com>
Message-ID: <7f8dd01f-30f7-f8c9-544a-c06f2a49eea0@redhat.com>

On 11/6/19 1:18 PM, Zhengyu Gu wrote:
> Updated: http://cr.openjdk.java.net/~zgu/JDK-8233401/webrev.02/index.html

Minor nits:

*) shenandoahBarrierSetAssembler_aarch64.cpp: excess space between parentheses:

 368   if (!is_reference_type(type) ) {

*) shenandoahBarrierSetC1.cpp: so, native oop loads used to call to
ShenandoahRuntime::load_reference_barrier_native before this refactoring? That would mean it is
enabled even when "passive" is enabled (which implies -ShenandoahLRB)? Current change looks fine,
but we need to recognize this is the behavioral change. Please link the issue where that regression
was introduced.

Otherwise looks fine to me, let Roman ack it too.

-- 
Thanks,
-Aleksey


From zgu at redhat.com  Wed Nov  6 14:15:55 2019
From: zgu at redhat.com (Zhengyu Gu)
Date: Wed, 6 Nov 2019 09:15:55 -0500
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <7f8dd01f-30f7-f8c9-544a-c06f2a49eea0@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
 <ea93151c-7a73-b5df-4e43-a21e9188c938@redhat.com>
 <c60d30b0-e920-28f3-1077-d182662da48d@redhat.com>
 <87b115fb-5353-b21b-3cbc-f862bd932b3e@redhat.com>
 <3d70db1c-c927-48f8-23ab-8937838e0302@redhat.com>
 <0d347d16-f870-798f-0165-1ee4dfae511b@redhat.com>
 <859e48d6-9af5-b4af-32ac-4b07ce92e94d@redhat.com>
 <6c110878-a477-df8a-e566-84b113806044@redhat.com>
 <84394d85-1b99-8139-3baf-7fbedba702c0@redhat.com>
 <2be52de0-6f12-d989-cf69-5807b2160cb0@redhat.com>
 <93330192-7143-ca82-9872-fe627a97772e@redhat.com>
 <7f8dd01f-30f7-f8c9-544a-c06f2a49eea0@redhat.com>
Message-ID: <0251231d-047e-0117-25b0-8ecfc9b30b7f@redhat.com>

Thanks for the review, Aleksey.

On 11/6/19 8:45 AM, Aleksey Shipilev wrote:
> On 11/6/19 1:18 PM, Zhengyu Gu wrote:
>> Updated: http://cr.openjdk.java.net/~zgu/JDK-8233401/webrev.02/index.html
> 
> Minor nits:
> 
> *) shenandoahBarrierSetAssembler_aarch64.cpp: excess space between parentheses:
> 
>   368   if (!is_reference_type(type) ) {

Will fix before push.

> 
> *) shenandoahBarrierSetC1.cpp: so, native oop loads used to call to
> ShenandoahRuntime::load_reference_barrier_native before this refactoring? That would mean it is
> enabled even when "passive" is enabled (which implies -ShenandoahLRB)? Current change looks fine,
> but we need to recognize this is the behavioral change. Please link the issue where that regression
> was introduced.

Correct, we don't need load_reference_barrier_native barrier if weak 
roots are processed at STW pauses.

Added comments about this behavioral change in CR and linked to JDK-8227635.

-Zhengyu


> 
> Otherwise looks fine to me, let Roman ack it too.
> 

From rkennke at redhat.com  Wed Nov  6 14:39:58 2019
From: rkennke at redhat.com (Roman Kennke)
Date: Wed, 6 Nov 2019 15:39:58 +0100
Subject: [aarch64-port-dev ] RFR 8233401: Shenandoah: Refactor/cleanup
 Shenandoah load barrier code
In-Reply-To: <7e9ace3d-8d15-e87a-f01c-90fc4b6faa6a@redhat.com>
References: <45287c04-370c-cb0b-1603-c93fe15da3d9@redhat.com>
 <ea93151c-7a73-b5df-4e43-a21e9188c938@redhat.com>
 <c60d30b0-e920-28f3-1077-d182662da48d@redhat.com>
 <87b115fb-5353-b21b-3cbc-f862bd932b3e@redhat.com>
 <3d70db1c-c927-48f8-23ab-8937838e0302@redhat.com>
 <0d347d16-f870-798f-0165-1ee4dfae511b@redhat.com>
 <859e48d6-9af5-b4af-32ac-4b07ce92e94d@redhat.com>
 <6c110878-a477-df8a-e566-84b113806044@redhat.com>
 <84394d85-1b99-8139-3baf-7fbedba702c0@redhat.com>
 <2be52de0-6f12-d989-cf69-5807b2160cb0@redhat.com>
 <93330192-7143-ca82-9872-fe627a97772e@redhat.com>
 <7e9ace3d-8d15-e87a-f01c-90fc4b6faa6a@redhat.com>
Message-ID: <b959ae55-78bd-dc52-bc40-489cc41623a3@redhat.com>

>> Updated: http://cr.openjdk.java.net/~zgu/JDK-8233401/webrev.02/index.html
> 
> OK.

Ok too.
Roman


From adinn at redhat.com  Wed Nov  6 14:58:37 2019
From: adinn at redhat.com (Andrew Dinn)
Date: Wed, 6 Nov 2019 14:58:37 +0000
Subject: [aarch64-port-dev ] RFD: scratch registers cleanup [Long post]
In-Reply-To: <7264d4c2-1520-c49f-2e73-f1527fb90da3@redhat.com>
References: <3e4aaf79-59a9-b346-e6b0-69839acf723e@redhat.com>
 <7264d4c2-1520-c49f-2e73-f1527fb90da3@redhat.com>
Message-ID: <13a7fdcf-5360-f6d0-02fb-b163f1196de1@redhat.com>

On 06/11/2019 13:43, Andrew Haley wrote:
> One other thing: this exercise has shown that in many cases we trash
> scratch registers in places where it really doesn't matter, and we'd
> be much better off rewriting them not to do so.

Agreed.

> This makes push_call_clobbered_registers() something that can safely
> be used everywhere. But I'm holding off any of this because I want the
> first patch to be, if at all possible, neutral with regard to code
> generated.
> . . .
Yes, that patch is an improvement.

regards,


Andrew Dinn
-----------


From aleksei.voitylov at bell-sw.com  Wed Nov  6 16:53:02 2019
From: aleksei.voitylov at bell-sw.com (Aleksei Voitylov)
Date: Wed, 6 Nov 2019 19:53:02 +0300
Subject: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub
 threshold of string_compare intrinsic tunable
In-Reply-To: <MN2PR01MB609385B40719285BD087416A8F610@MN2PR01MB6093.prod.exchangelabs.com>
References: <MN2PR01MB609385B40719285BD087416A8F610@MN2PR01MB6093.prod.exchangelabs.com>
Message-ID: <9248d1b3-7341-68b9-7dd1-02dfd48567a6@bell-sw.com>

Hi Patrick,

I like the fact that this patch does not add much to the complexity of
the code. Here are some experiments that you could find useful.

Cortex A73??? Size??? base (ns/op)??? patched (ns/op)??? Diff?
StringCompareBench.StringCompareLL??? 256??? 14422257,98???
15302300,24??? -6,10%
StringCompareBench.StringCompareLL??? 512??? 27998036,21???
28317818,08??? -1,14%

ThunderX2 ?? Size??? base (ns/op)??? patched (ns/op)??? Diff
StringCompareBench.StringCompareLL??? 128??? 4265122,232???
13099099,67??? -207,12%
StringCompareBench.StringCompareLL??? 256??? 3539452,533???
3599407,432??? -1,69%

StringCompareBench.StringCompareUU??? 128??? 6899938,75???
7174601,241??? -3,98%
StringCompareBench.StringCompareUU??? 256??? 7654538,841???
7826599,466??? -2,25%

StringCompareBench.cachedStringCompareLL??? 128??? 19,673??? 21,242???
-7,98%
StringCompareBench.cachedStringCompareLL??? 256??? 34,179??? 36,452???
-6,65%
StringCompareBench.cachedStringCompareLL??? 512??? 59,574??? 64,088???
-7,58%
StringCompareBench.cachedStringCompareLL??? 1024??? 110,37??? 118,477???
-7,35%
StringCompareBench.cachedStringCompareLL??? 1000000??? 114028,907???
115388,681??? -1,19%

StringCompareBench.cachedStringCompareUU??? 128??? 33,752??? 36,922???
-9,39%
StringCompareBench.cachedStringCompareUU??? 256??? 60,939??? 64,096???
-5,18%
StringCompareBench.cachedStringCompareUU??? 512??? 115,328??? 118,48???
-2,73%
StringCompareBench.cachedStringCompareUU??? 1024??? 239,332??? 242,97???
-1,52%
StringCompareBench.cachedStringCompareUU??? 1000000??? 226491,096???
233638,328??? -3,16%

It might be the case that the newly added branch is the culprit:

+????? __ subs(rscratch2, cnt2, largeLoopExitCondition);
+????? __ br(__ LT, NO_PREFETCH);

Maybe you could skip it when CompareLongStringLimitLatin and
CompareLongStringLimitUTF are large enough (then stub code is only
called with string length large enough to skip branch above). Then (the
properly commented) code would look like:

if ((stub_threshold-wordSize/(isLL ? 1 : 2)) < largeLoopExitCondition) {
???? __ subs(rscratch2, cnt2, largeLoopExitCondition);
???? __ br(__ LT, NO_PREFETCH);
}

and in this case we shouldn't see any performance penalties.

-Aleksei

On 29/10/2019 12:58, Patrick Zhang OS wrote:
> Hi,
>
> Could you please review this patch, thanks.
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8229351
> Webrev: http://cr.openjdk.java.net/~qpzhang/8229351/webrev.02
> (this starts from .02 since there had been some internal review and updates)
>
> Changes:
>
> 1.       Split the STUB_THRESHOLD from the hard-coded 72 to be CompareLongStringLimitLatin and CompareLongStringLimitUTF as a more flexible control over the stub thresholds for string_compare intrinsics, especially for various uArchs.
>
> 2.       MacroAssembler::string_compare LL and UU shared the same threshold, actually UU may only require the half (length of chars) of that of LL's, because one character has two-bytes for UU, while for compacted LL strings, one character means one byte. In addition, LU/UL may need a separated threshold, as the stub function is different from the same encoding one, and the performance may vary as well.
>
> 3.       In generate_compare_long_string_same_encoding, the hard-coded 72 was originally able to ensure that there can be always 64 bytes at least for the prefetch code path. However once a smaller stub threshold is set, a new condition is needed to tell if this would be still valid, or has to go to the NO_PREFETCH branch. This change can ensure the correctness.
>
> 4.       In generate_compare_long_string_different_encoding, some temp vars for handling the last 4 characters are not valid any longer, cleaned up strU and strL, and related pointers initialization to the next U (cnt1) and L (tmp2).
>
> 5.       In compare_string_16_x_LU, the reference to r10 (tmp1) is not needed, as tmpU or tmpL point to the same register.
>
> Tests:
>
>   1.  For function check, I have run
>
> jdk jtreg tier1 tests, with default vm flags
>
> hotspot jtreg tests: runtime/compiler/gc parts, with "-Xcomp -XX:-TieredCompilation"
>
> jck10/api/java.lang 1609 cases and other selected modules, no new failures found, with default vm flags and "-Xcomp -XX:-TieredCompilation" respectively;
>
> some specific test cases had been carefully executed to double check, i.e., TestStringCompareToDifferentLength.java [1] and TestStringCompareToDifferentLength.java [1] introduced by [2], StrCmpTest.java [3] introduced by [4].
>
>   1.  For performance check, I have run
>
> string-density-bench/CompareToBench.java [5] and StringCompareBench.java [6] respectively,
>
> and SPECjbb2015.jar, no obvious performance change has been found (since the default threshold is NOT changed within this patch).
>
> FYI. with Ampere eMAG system, microbenchmarks [5][6] can have 1.5x consistent perf gain with LU/UL comparison for shorter strings (<72 chars, smaller stub thresholds), and slight improvement (5~10%) with LL/UU cases.
>
> Refs:
> [1] http://hg.openjdk.java.net/jdk/jdk/file/3df2bf731a87/test/hotspot/jtreg/compiler/intrinsics/string
> [2] https://bugs.openjdk.java.net/browse/JDK-8218966 AArch64: String.compareTo() can read memory after string
> [3] http://cr.openjdk.java.net/~dpochepk/8202326/StrCmpTest.java, contributed by Dmitrij Pochepko
> [4] https://bugs.openjdk.java.net/browse/JDK-8202326 AARCH64: optimize string compare intrinsic
> [5] http://cr.openjdk.java.net/~shade/density/string-density-bench.jar, contributed by Aleksey Shipilev
> [6] http://cr.openjdk.java.net/~dpochepk/8202326/StringCompareBench.java, contributed by Dmitrij Pochepko
>
> Regards
> Patrick
>

From felix.yang at huawei.com  Thu Nov  7 01:17:05 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Thu, 7 Nov 2019 01:17:05 +0000
Subject: [aarch64-port-dev ] RFR(XS): 8233466: aarch64: remove unnecessary
 load of mdo when profiling return and parameters type
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED6027D6E@dggeml527-mbx.china.huawei.com>

Hi,

   Please review the following patch:

      Bug: https://bugs.openjdk.java.net/browse/JDK-8233466

Webrev: http://cr.openjdk.java.net/~fyang/8233466/webrev.00/


When profiling return and parameters type from the interpreter on aarch64 platform, 'mdp' is loaded by test_method_data_pointer which is called by profile_return_type & profile_parameters_type.

It's not necessary to load mdo before calling __ profile_return_type or __ profile_parameters_type.


Passed tier1-3 testing.


Thanks,

Felix

From ci_notify at linaro.org  Thu Nov  7 01:27:13 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Thu, 7 Nov 2019 01:27:13 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK JDK on AArch64
Message-ID: <1809406039.12368.1573090034194.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/summary/2019/310/summary.html
 
-------------------------------------------------------------------------------
client-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,780; fail: 19; not run: 90

-------------------------------------------------------------------------------
client-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,495; fail: 670; error: 23

-------------------------------------------------------------------------------
client-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/sep/20 pass: 5,728
Build 1: aarch64/2019/sep/23 pass: 5,727
Build 2: aarch64/2019/oct/07 pass: 5,750
Build 3: aarch64/2019/oct/09 pass: 5,747; fail: 1
Build 4: aarch64/2019/oct/11 pass: 5,751; fail: 1
Build 5: aarch64/2019/oct/14 pass: 5,753
Build 6: aarch64/2019/oct/16 pass: 5,753; fail: 1
Build 7: aarch64/2019/oct/18 pass: 5,760
Build 8: aarch64/2019/oct/21 pass: 5,716; fail: 43; error: 1
Build 9: aarch64/2019/oct/23 pass: 5,760; fail: 1
Build 10: aarch64/2019/oct/28 pass: 5,766
Build 11: aarch64/2019/oct/30 pass: 5,768
Build 12: aarch64/2019/nov/01 pass: 5,768; fail: 1
Build 13: aarch64/2019/nov/04 pass: 5,769
Build 14: aarch64/2019/nov/06 pass: 5,766; fail: 2

1 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/sep/20 pass: 8,685; fail: 503; error: 22
Build 1: aarch64/2019/sep/23 pass: 8,696; fail: 497; error: 19
Build 2: aarch64/2019/oct/07 pass: 8,683; fail: 517; error: 18
Build 3: aarch64/2019/oct/09 pass: 8,692; fail: 507; error: 21
Build 4: aarch64/2019/oct/11 pass: 8,693; fail: 511; error: 18
Build 5: aarch64/2019/oct/14 pass: 8,706; fail: 497; error: 20
Build 6: aarch64/2019/oct/16 pass: 8,702; fail: 509; error: 17
Build 7: aarch64/2019/oct/18 pass: 8,694; fail: 522; error: 17
Build 8: aarch64/2019/oct/21 pass: 8,705; fail: 512; error: 18
Build 9: aarch64/2019/oct/23 pass: 8,712; fail: 505; error: 18
Build 10: aarch64/2019/oct/28 pass: 8,711; fail: 509; error: 18
Build 11: aarch64/2019/oct/30 pass: 8,723; fail: 504; error: 19
Build 12: aarch64/2019/nov/01 pass: 8,774; fail: 506; error: 18
Build 13: aarch64/2019/nov/04 pass: 8,777; fail: 509; error: 17
Build 14: aarch64/2019/nov/06 pass: 8,775; fail: 507; error: 19

7 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/sep/20 pass: 3,979
Build 1: aarch64/2019/sep/23 pass: 3,979
Build 2: aarch64/2019/oct/07 pass: 3,979
Build 3: aarch64/2019/oct/09 pass: 3,979
Build 4: aarch64/2019/oct/11 pass: 3,979
Build 5: aarch64/2019/oct/14 pass: 3,979
Build 6: aarch64/2019/oct/16 pass: 3,979
Build 7: aarch64/2019/oct/18 pass: 3,979
Build 8: aarch64/2019/oct/21 pass: 3,979
Build 9: aarch64/2019/oct/23 pass: 3,980
Build 10: aarch64/2019/oct/28 pass: 3,980
Build 11: aarch64/2019/oct/30 pass: 3,980
Build 12: aarch64/2019/nov/01 pass: 3,980
Build 13: aarch64/2019/nov/04 pass: 3,980
Build 14: aarch64/2019/nov/06 pass: 3,980

-------------------------------------------------------------------------------
server-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,787; fail: 18; not run: 90

-------------------------------------------------------------------------------
server-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,476; fail: 686; error: 27

-------------------------------------------------------------------------------
server-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 7.84x
Relative performance: Server critical-jOPS (nc): 9.75x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 207.57

Server 207.57 / Server 2014-04-01 (71.00): 2.92x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-09-21 pass rate: 10487/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/263/results/
2019-09-29 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/266/results/
2019-10-08 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/280/results/
2019-10-10 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/282/results/
2019-10-12 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/284/results/
2019-10-15 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/287/results/
2019-10-17 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/289/results/
2019-10-19 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/291/results/
2019-10-22 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/294/results/
2019-10-23 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/296/results/
2019-10-29 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/301/results/
2019-10-31 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/303/results/
2019-11-02 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/305/results/
2019-11-05 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/308/results/
2019-11-07 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/310/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/

From felix.yang at huawei.com  Thu Nov  7 01:27:11 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Thu, 7 Nov 2019 01:27:11 +0000
Subject: [aarch64-port-dev ] RFR (trivial) : fix aarch64-8u type profile
 bug
In-Reply-To: <e450e5ac-81c9-8cd6-460c-ed74e562993e@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED5FE7C9A@dggeml527-mbx.china.huawei.com>
 <222f9c0b-7320-8d22-cd44-c4f3af7c1311@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED5FE7E9C@dggeml527-mbx.china.huawei.com>
 <880f5072-91ba-66bd-94be-429556e7c132@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED5FF2814@dggeml527-mbx.china.huawei.com>
 <e450e5ac-81c9-8cd6-460c-ed74e562993e@redhat.com>
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED6027D7F@dggeml527-mbx.china.huawei.com>

> -----Original Message-----
> From: Andrew Haley [mailto:aph at redhat.com]
> Sent: Wednesday, November 6, 2019 5:46 PM
> To: Yangfei (Felix) <felix.yang at huawei.com>;
> aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR (trivial) : fix aarch64-8u type profile bug
> 
> On 9/24/19 9:22 AM, Yangfei (Felix) wrote:
> >>> This also reminds me of another two aarch64-specific profiling issues:
> >>> https://bugs.openjdk.java.net/browse/JDK-8188221
> >>> https://bugs.openjdk.java.net/browse/JDK-8189439
> >>>
> >>> I think they also should be incorporated in aarch64 8u.  What do you
> think?
> >> I've always been reluctant to backport performance-only patches to
> >> 8u, but I admit that version will be around for a long time, so OK.
> >>
> > Looks like the upstream patches can be simplified: 'mdp' is loaded by
> test_method_data_pointer which is called by profile_return_type &
> profile_parameters_type.
> 
> Sorry for not replying before now.
> 
> Maybe they can, but this is a backport.

Hi,

  OK.  I have sent a separate mail to fix this for jdk14: https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2019-November/036877.html 
  Please review that one.  
  I want to fix this for jdk14 before I do the aarch64 8u backport.  

Thanks,
Felix

From patrick at os.amperecomputing.com  Thu Nov  7 10:55:40 2019
From: patrick at os.amperecomputing.com (Patrick Zhang OS)
Date: Thu, 7 Nov 2019 10:55:40 +0000
Subject: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub
 threshold of string_compare intrinsic tunable
In-Reply-To: <9248d1b3-7341-68b9-7dd1-02dfd48567a6@bell-sw.com>
References: <MN2PR01MB609385B40719285BD087416A8F610@MN2PR01MB6093.prod.exchangelabs.com>
 <9248d1b3-7341-68b9-7dd1-02dfd48567a6@bell-sw.com>
Message-ID: <MN2PR01MB609316337FABB9DE034156EA8F780@MN2PR01MB6093.prod.exchangelabs.com>

Hi Aleksei,

Thanks for testing it and the data. I only had the source of StringCompareBench.java [6], my numbers (the diffs) are within 2%, while the -207.12% looks quite weird. I initially did not add the condition to control the br, since generate_ compare_long_string_different_encoding has the similar unconditional br. By the way, the original logic allowed prefetching the memory behind array border, for the first 64 bytes. I think securing the prefetch is the right thing to do, but it could certainly stop some cases from going to the large loop with prefetching. Welcome further comments, thanks.

http://cr.openjdk.java.net/~qpzhang/8229351/webrev.03/

Regards
Patrick

From: Aleksei Voitylov <aleksei.voitylov at bell-sw.com>
Sent: Thursday, November 7, 2019 12:53 AM
To: Patrick Zhang OS <patrick at os.amperecomputing.com>
Cc: aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable


Hi Patrick,

I like the fact that this patch does not add much to the complexity of the code. Here are some experiments that you could find useful.
Cortex A73    Size    base (ns/op)    patched (ns/op)    Diff
StringCompareBench.StringCompareLL    256    14422257,98    15302300,24    -6,10%
StringCompareBench.StringCompareLL    512    27998036,21    28317818,08    -1,14%

ThunderX2    Size    base (ns/op)    patched (ns/op)    Diff
StringCompareBench.StringCompareLL    128    4265122,232    13099099,67    -207,12%
StringCompareBench.StringCompareLL    256    3539452,533    3599407,432    -1,69%

StringCompareBench.StringCompareUU    128    6899938,75    7174601,241    -3,98%
StringCompareBench.StringCompareUU    256    7654538,841    7826599,466    -2,25%

StringCompareBench.cachedStringCompareLL    128    19,673    21,242    -7,98%
StringCompareBench.cachedStringCompareLL    256    34,179    36,452    -6,65%
StringCompareBench.cachedStringCompareLL    512    59,574    64,088    -7,58%
StringCompareBench.cachedStringCompareLL    1024    110,37    118,477    -7,35%
StringCompareBench.cachedStringCompareLL    1000000    114028,907    115388,681    -1,19%

StringCompareBench.cachedStringCompareUU    128    33,752    36,922    -9,39%
StringCompareBench.cachedStringCompareUU    256    60,939    64,096    -5,18%
StringCompareBench.cachedStringCompareUU    512    115,328    118,48    -2,73%
StringCompareBench.cachedStringCompareUU    1024    239,332    242,97    -1,52%
StringCompareBench.cachedStringCompareUU    1000000    226491,096    233638,328    -3,16%
It might be the case that the newly added branch is the culprit:

+      __ subs(rscratch2, cnt2, largeLoopExitCondition);
+      __ br(__ LT, NO_PREFETCH);

Maybe you could skip it when CompareLongStringLimitLatin and CompareLongStringLimitUTF are large enough (then stub code is only called with string length large enough to skip branch above). Then (the properly commented) code would look like:

if ((stub_threshold-wordSize/(isLL ? 1 : 2)) < largeLoopExitCondition) {
     __ subs(rscratch2, cnt2, largeLoopExitCondition);
     __ br(__ LT, NO_PREFETCH);
}

and in this case we shouldn't see any performance penalties.

-Aleksei

On 29/10/2019 12:58, Patrick Zhang OS wrote:

Hi,


Could you please review this patch, thanks.


JBS: https://bugs.openjdk.java.net/browse/JDK-8229351

Webrev: http://cr.openjdk.java.net/~qpzhang/8229351/webrev.02

(this starts from .02 since there had been some internal review and updates)


Changes:


1.       Split the STUB_THRESHOLD from the hard-coded 72 to be CompareLongStringLimitLatin and CompareLongStringLimitUTF as a more flexible control over the stub thresholds for string_compare intrinsics, especially for various uArchs.


2.       MacroAssembler::string_compare LL and UU shared the same threshold, actually UU may only require the half (length of chars) of that of LL's, because one character has two-bytes for UU, while for compacted LL strings, one character means one byte. In addition, LU/UL may need a separated threshold, as the stub function is different from the same encoding one, and the performance may vary as well.


3.       In generate_compare_long_string_same_encoding, the hard-coded 72 was originally able to ensure that there can be always 64 bytes at least for the prefetch code path. However once a smaller stub threshold is set, a new condition is needed to tell if this would be still valid, or has to go to the NO_PREFETCH branch. This change can ensure the correctness.


4.       In generate_compare_long_string_different_encoding, some temp vars for handling the last 4 characters are not valid any longer, cleaned up strU and strL, and related pointers initialization to the next U (cnt1) and L (tmp2).


5.       In compare_string_16_x_LU, the reference to r10 (tmp1) is not needed, as tmpU or tmpL point to the same register.


Tests:


  1.  For function check, I have run


jdk jtreg tier1 tests, with default vm flags


hotspot jtreg tests: runtime/compiler/gc parts, with "-Xcomp -XX:-TieredCompilation"


jck10/api/java.lang 1609 cases and other selected modules, no new failures found, with default vm flags and "-Xcomp -XX:-TieredCompilation" respectively;


some specific test cases had been carefully executed to double check, i.e., TestStringCompareToDifferentLength.java [1] and TestStringCompareToDifferentLength.java [1] introduced by [2], StrCmpTest.java [3] introduced by [4].


  1.  For performance check, I have run


string-density-bench/CompareToBench.java [5] and StringCompareBench.java [6] respectively,


and SPECjbb2015.jar, no obvious performance change has been found (since the default threshold is NOT changed within this patch).


FYI. with Ampere eMAG system, microbenchmarks [5][6] can have 1.5x consistent perf gain with LU/UL comparison for shorter strings (<72 chars, smaller stub thresholds), and slight improvement (5~10%) with LL/UU cases.


Refs:

[1] http://hg.openjdk.java.net/jdk/jdk/file/3df2bf731a87/test/hotspot/jtreg/compiler/intrinsics/string

[2] https://bugs.openjdk.java.net/browse/JDK-8218966 AArch64: String.compareTo() can read memory after string

[3] http://cr.openjdk.java.net/~dpochepk/8202326/StrCmpTest.java, contributed by Dmitrij Pochepko

[4] https://bugs.openjdk.java.net/browse/JDK-8202326 AARCH64: optimize string compare intrinsic

[5] http://cr.openjdk.java.net/~shade/density/string-density-bench.jar, contributed by Aleksey Shipilev

[6] http://cr.openjdk.java.net/~dpochepk/8202326/StringCompareBench.java, contributed by Dmitrij Pochepko


Regards

Patrick


From zgu at redhat.com  Thu Nov  7 14:55:13 2019
From: zgu at redhat.com (Zhengyu Gu)
Date: Thu, 7 Nov 2019 09:55:13 -0500
Subject: [aarch64-port-dev ] RFR(XS) 8233337: Shenandoah: Cleanup AArch64
 SBSA::load_reference_barrier_not_null()
Message-ID: <5de422c4-ea76-81b0-8413-d3e81f60a09d@redhat.com>

Please review this cleanup patch suggested by Andrew Haley. Please see 
[1] for details


Bug: https://bugs.openjdk.java.net/browse/JDK-8233337
Webrev: http://cr.openjdk.java.net/~zgu/JDK-8233337/webrev.00/

Test:
   hotspot_gc_shenandoah (fastdebug and release)
   on AArch64 Linux

Thanks,

-Zhengyu


[1] 
https://mail.openjdk.java.net/pipermail/shenandoah-dev/2019-October/010976.html


From rkennke at redhat.com  Thu Nov  7 15:37:08 2019
From: rkennke at redhat.com (Roman Kennke)
Date: Thu, 7 Nov 2019 16:37:08 +0100
Subject: [aarch64-port-dev ] RFR(XS) 8233337: Shenandoah: Cleanup
 AArch64 SBSA::load_reference_barrier_not_null()
In-Reply-To: <5de422c4-ea76-81b0-8413-d3e81f60a09d@redhat.com>
References: <5de422c4-ea76-81b0-8413-d3e81f60a09d@redhat.com>
Message-ID: <ebb302ba-786a-3b89-e64a-188cfa68405c@redhat.com>

Looks good,thanks!

Roman


> Please review this cleanup patch suggested by Andrew Haley. Please see
> [1] for details
> 
> 
> Bug: https://bugs.openjdk.java.net/browse/JDK-8233337
> Webrev: http://cr.openjdk.java.net/~zgu/JDK-8233337/webrev.00/
> 
> Test:
> ? hotspot_gc_shenandoah (fastdebug and release)
> ? on AArch64 Linux
> 
> Thanks,
> 
> -Zhengyu
> 
> 
> 
> 
> [1]
> https://mail.openjdk.java.net/pipermail/shenandoah-dev/2019-October/010976.html
> 
> 


From zgu at redhat.com  Thu Nov  7 19:01:42 2019
From: zgu at redhat.com (Zhengyu Gu)
Date: Thu, 7 Nov 2019 14:01:42 -0500
Subject: [aarch64-port-dev ] RFR 8233339: Shenandoah: Centralize load
 barrier decisions into ShenandoahBarrierSet
In-Reply-To: <d9ad3216-1645-b88d-1cc6-d59f4caab2a7@redhat.com>
References: <6ef89df6-84db-0ffe-d1fc-7ffde7e622bf@redhat.com>
 <4ed90469-8689-b49d-69f1-98f644e9edd0@redhat.com>
 <a11a7164-de20-79d0-72d7-8bfc2fe3bfd2@redhat.com>
 <0fb9cd70-0a89-8c14-7469-55205c4c3808@redhat.com>
 <d9ad3216-1645-b88d-1cc6-d59f4caab2a7@redhat.com>
Message-ID: <9f4e51fd-dd2c-1f74-e695-51923c75a52a@redhat.com>

> 
> Filed: https://bugs.openjdk.java.net/browse/JDK-8233401

Rebased on top of JDK-8233401

Webrev: http://cr.openjdk.java.net/~zgu/JDK-8233339/webrev.02/index.html

Thanks,

-Zhengyu


> 
> Matter of fact, I would like to hold off this code review, till reactor 
> is done.
> 
> Thanks,
> 
> -Zhengyu
> 
>>
>> *) shenandoahBarrierSetAssembler_x86.cpp, I believe it would be more 
>> straightforward to save
>> branching on local variable "need_load_reference_barrier" by spelling 
>> out the "disabled" path
>> directly (in fact, I think you are almost there in 
>> shenandoahBarrierSetC1.cpp!):
>>
>> ?? if (!ShenandoahBarrierSet::need_load_reference_barrier(decorators, 
>> type)) {
>> ???? BarrierSetAssembler::load_at(masm, decorators, type, dst, src, 
>> tmp1, tmp_thread);
>> ???? return;
>> ?? }
>>
>> ?? ... code that assumes need_load_reference_barrier = true follows ...
>>
>> ?? Register result_dst = dst;
>> ?? bool use_tmp1_for_dst = false;
>>
>> *) shenandoahBarrierSetC1.cpp: local variable 
>> "need_load_reference_barrier" is not needed, there is
>> only a single use
>>
>> *) shenandoahBarrierSetC2.cpp: this block should go all the way up:
>>
>> ? 557?? if 
>> (!ShenandoahBarrierSet::need_load_reference_barrier(decorators, type)) {
>> ? 558???? return load;
>> ? 559?? }
>>
>> *) shenandoahBarrierSet.cpp: this is just "return 
>> is_reference_type(type)". Saves some inversions.
>>
>> ?? 78?? if (!is_reference_type(type)) return false;
>> ?? 79?? return true;
>>
>> *) shenandoahBarrierSet.cpp: should be "Should be subset of LRB":
>>
>> ?? 83?? assert(need_load_reference_barrier(decorators, type), "Why 
>> ask?");
>>
>> *) shenandoahBarrierSet.cpp: seems like this assert is subsumed by the 
>> previous one?
>>
>> ??? 84?? assert(is_reference_type(type), "Why we here?");
>>
>>

From rkennke at redhat.com  Thu Nov  7 19:41:00 2019
From: rkennke at redhat.com (Roman Kennke)
Date: Thu, 7 Nov 2019 20:41:00 +0100
Subject: [aarch64-port-dev ] RFR 8233339: Shenandoah: Centralize load
 barrier decisions into ShenandoahBarrierSet
In-Reply-To: <9f4e51fd-dd2c-1f74-e695-51923c75a52a@redhat.com>
References: <6ef89df6-84db-0ffe-d1fc-7ffde7e622bf@redhat.com>
 <4ed90469-8689-b49d-69f1-98f644e9edd0@redhat.com>
 <a11a7164-de20-79d0-72d7-8bfc2fe3bfd2@redhat.com>
 <0fb9cd70-0a89-8c14-7469-55205c4c3808@redhat.com>
 <d9ad3216-1645-b88d-1cc6-d59f4caab2a7@redhat.com>
 <9f4e51fd-dd2c-1f74-e695-51923c75a52a@redhat.com>
Message-ID: <4b24e6ac-2109-4a7d-83aa-c2427343e22b@redhat.com>

That looks good to me.

Thanks,
Roman

>>
>> Filed: https://bugs.openjdk.java.net/browse/JDK-8233401
> 
> Rebased on top of JDK-8233401
> 
> Webrev: http://cr.openjdk.java.net/~zgu/JDK-8233339/webrev.02/index.html
> 
> Thanks,
> 
> -Zhengyu
> 
> 
>>
>> Matter of fact, I would like to hold off this code review, till
>> reactor is done.
>>
>> Thanks,
>>
>> -Zhengyu
>>
>>>
>>> *) shenandoahBarrierSetAssembler_x86.cpp, I believe it would be more
>>> straightforward to save
>>> branching on local variable "need_load_reference_barrier" by spelling
>>> out the "disabled" path
>>> directly (in fact, I think you are almost there in
>>> shenandoahBarrierSetC1.cpp!):
>>>
>>> ?? if (!ShenandoahBarrierSet::need_load_reference_barrier(decorators,
>>> type)) {
>>> ???? BarrierSetAssembler::load_at(masm, decorators, type, dst, src,
>>> tmp1, tmp_thread);
>>> ???? return;
>>> ?? }
>>>
>>> ?? ... code that assumes need_load_reference_barrier = true follows ...
>>>
>>> ?? Register result_dst = dst;
>>> ?? bool use_tmp1_for_dst = false;
>>>
>>> *) shenandoahBarrierSetC1.cpp: local variable
>>> "need_load_reference_barrier" is not needed, there is
>>> only a single use
>>>
>>> *) shenandoahBarrierSetC2.cpp: this block should go all the way up:
>>>
>>> ? 557?? if
>>> (!ShenandoahBarrierSet::need_load_reference_barrier(decorators, type)) {
>>> ? 558???? return load;
>>> ? 559?? }
>>>
>>> *) shenandoahBarrierSet.cpp: this is just "return
>>> is_reference_type(type)". Saves some inversions.
>>>
>>> ?? 78?? if (!is_reference_type(type)) return false;
>>> ?? 79?? return true;
>>>
>>> *) shenandoahBarrierSet.cpp: should be "Should be subset of LRB":
>>>
>>> ?? 83?? assert(need_load_reference_barrier(decorators, type), "Why
>>> ask?");
>>>
>>> *) shenandoahBarrierSet.cpp: seems like this assert is subsumed by
>>> the previous one?
>>>
>>> ??? 84?? assert(is_reference_type(type), "Why we here?");
>>>
>>>


From zgu at redhat.com  Thu Nov  7 19:42:27 2019
From: zgu at redhat.com (Zhengyu Gu)
Date: Thu, 7 Nov 2019 14:42:27 -0500
Subject: [aarch64-port-dev ] RFR 8233339: Shenandoah: Centralize load
 barrier decisions into ShenandoahBarrierSet
In-Reply-To: <4b24e6ac-2109-4a7d-83aa-c2427343e22b@redhat.com>
References: <6ef89df6-84db-0ffe-d1fc-7ffde7e622bf@redhat.com>
 <4ed90469-8689-b49d-69f1-98f644e9edd0@redhat.com>
 <a11a7164-de20-79d0-72d7-8bfc2fe3bfd2@redhat.com>
 <0fb9cd70-0a89-8c14-7469-55205c4c3808@redhat.com>
 <d9ad3216-1645-b88d-1cc6-d59f4caab2a7@redhat.com>
 <9f4e51fd-dd2c-1f74-e695-51923c75a52a@redhat.com>
 <4b24e6ac-2109-4a7d-83aa-c2427343e22b@redhat.com>
Message-ID: <31415213-3464-619a-0741-ca14f7b9cbcf@redhat.com>

Thanks for the review, Roman

-Zhengyu

On 11/7/19 2:41 PM, Roman Kennke wrote:
> That looks good to me.
> 
> Thanks,
> Roman
> 
>>>
>>> Filed: https://bugs.openjdk.java.net/browse/JDK-8233401
>>
>> Rebased on top of JDK-8233401
>>
>> Webrev: http://cr.openjdk.java.net/~zgu/JDK-8233339/webrev.02/index.html
>>
>> Thanks,
>>
>> -Zhengyu
>>
>>
>>>
>>> Matter of fact, I would like to hold off this code review, till
>>> reactor is done.
>>>
>>> Thanks,
>>>
>>> -Zhengyu
>>>
>>>>
>>>> *) shenandoahBarrierSetAssembler_x86.cpp, I believe it would be more
>>>> straightforward to save
>>>> branching on local variable "need_load_reference_barrier" by spelling
>>>> out the "disabled" path
>>>> directly (in fact, I think you are almost there in
>>>> shenandoahBarrierSetC1.cpp!):
>>>>
>>>>  ?? if (!ShenandoahBarrierSet::need_load_reference_barrier(decorators,
>>>> type)) {
>>>>  ???? BarrierSetAssembler::load_at(masm, decorators, type, dst, src,
>>>> tmp1, tmp_thread);
>>>>  ???? return;
>>>>  ?? }
>>>>
>>>>  ?? ... code that assumes need_load_reference_barrier = true follows ...
>>>>
>>>>  ?? Register result_dst = dst;
>>>>  ?? bool use_tmp1_for_dst = false;
>>>>
>>>> *) shenandoahBarrierSetC1.cpp: local variable
>>>> "need_load_reference_barrier" is not needed, there is
>>>> only a single use
>>>>
>>>> *) shenandoahBarrierSetC2.cpp: this block should go all the way up:
>>>>
>>>>  ? 557?? if
>>>> (!ShenandoahBarrierSet::need_load_reference_barrier(decorators, type)) {
>>>>  ? 558???? return load;
>>>>  ? 559?? }
>>>>
>>>> *) shenandoahBarrierSet.cpp: this is just "return
>>>> is_reference_type(type)". Saves some inversions.
>>>>
>>>>  ?? 78?? if (!is_reference_type(type)) return false;
>>>>  ?? 79?? return true;
>>>>
>>>> *) shenandoahBarrierSet.cpp: should be "Should be subset of LRB":
>>>>
>>>>  ?? 83?? assert(need_load_reference_barrier(decorators, type), "Why
>>>> ask?");
>>>>
>>>> *) shenandoahBarrierSet.cpp: seems like this assert is subsumed by
>>>> the previous one?
>>>>
>>>>  ??? 84?? assert(is_reference_type(type), "Why we here?");
>>>>
>>>>
> 

From felix.yang at huawei.com  Fri Nov  8 08:30:00 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Fri, 8 Nov 2019 08:30:00 +0000
Subject: [aarch64-port-dev ] 8233839: aarch64: missing memory barrier in
 NewObjectArrayStub and NewTypeArrayStub
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED60280EF@dggeml527-mbx.china.huawei.com>

Hi,

I witnessed random fail of one jcstress test on my 128-core aarch64 server: "org.openjdk.jcstress.tests.defaultValues.arrays.small.plain.StringTest"
Bug: https://bugs.openjdk.java.net/browse/JDK-8233839

I used the latest aarch64 jdk8u release build.  Please refer to the bugzilla for details and the analysis.
  I checked the assembler code emitted by LIR_Assembler::emit_alloc_array:
For the fast path, the StoreStore memory barrier is there.  But it?s not the case for the slow path.

  Patch adding the missing barrier for 14:

diff -r ad157fab6bf5 src/hotspot/cpu/aarch64/c1_Runtime1_aarch64.cpp
--- a/src/hotspot/cpu/aarch64/c1_Runtime1_aarch64.cpp   Thu Nov 07 16:26:57 2019 -0800
+++ b/src/hotspot/cpu/aarch64/c1_Runtime1_aarch64.cpp   Fri Nov 08 16:10:08 2019 +0800
@@ -840,6 +840,7 @@
           __ sub(arr_size, arr_size, t1);  // body length
           __ add(t1, t1, obj);       // body start
           __ initialize_body(t1, arr_size, 0, t2);
+          __ membar(Assembler::StoreStore);
           __ verify_oop(obj);

           __ ret(lr);

  JDK builds OK and passed tier1 test.

Thanks,
Felix

From adinn at redhat.com  Fri Nov  8 09:04:08 2019
From: adinn at redhat.com (Andrew Dinn)
Date: Fri, 8 Nov 2019 09:04:08 +0000
Subject: [aarch64-port-dev ] 8233839: aarch64: missing memory barrier in
 NewObjectArrayStub and NewTypeArrayStub
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED60280EF@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60280EF@dggeml527-mbx.china.huawei.com>
Message-ID: <382f7448-c392-5d98-ecf5-ac22e86f4888@redhat.com>

On 08/11/2019 08:30, Yangfei (Felix) wrote:
> I witnessed random fail of one jcstress test on my 128-core aarch64 server: "org.openjdk.jcstress.tests.defaultValues.arrays.small.plain.StringTest"
> Bug: https://bugs.openjdk.java.net/browse/JDK-8233839
> 
> I used the latest aarch64 jdk8u release build.  Please refer to the bugzilla for details and the analysis.
>   I checked the assembler code emitted by LIR_Assembler::emit_alloc_array:
> For the fast path, the StoreStore memory barrier is there.  But it?s not the case for the slow path.
> 
>   Patch adding the missing barrier for 14:
> 
> diff -r ad157fab6bf5 src/hotspot/cpu/aarch64/c1_Runtime1_aarch64.cpp
> --- a/src/hotspot/cpu/aarch64/c1_Runtime1_aarch64.cpp   Thu Nov 07 16:26:57 2019 -0800
> +++ b/src/hotspot/cpu/aarch64/c1_Runtime1_aarch64.cpp   Fri Nov 08 16:10:08 2019 +0800
> @@ -840,6 +840,7 @@
>            __ sub(arr_size, arr_size, t1);  // body length
>            __ add(t1, t1, obj);       // body start
>            __ initialize_body(t1, arr_size, 0, t2);
> +          __ membar(Assembler::StoreStore);
>            __ verify_oop(obj);
> 
>            __ ret(lr);
> 
>   JDK builds OK and passed tier1 test.
Very nice detective work finding that one!

The jdk14 patch looks good. Also the same patch for jdk11 and the
variant for jdk8 are good.

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill


From ci_notify at linaro.org  Sun Nov 10 02:42:42 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Sun, 10 Nov 2019 02:42:42 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK 11u on AArch64
Message-ID: <961463389.104.1573353762559.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk11u/openjdk-jtreg-nightly-tests/summary/2019/313/summary.html
 
-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jun/27 pass: 5,737; fail: 5
Build 1: aarch64/2019/jul/02 pass: 5,737; fail: 5
Build 2: aarch64/2019/aug/03 pass: 5,746; fail: 4
Build 3: aarch64/2019/aug/10 pass: 5,747; fail: 4
Build 4: aarch64/2019/aug/15 pass: 5,753; fail: 4
Build 5: aarch64/2019/aug/22 pass: 5,755; fail: 4
Build 6: aarch64/2019/sep/04 pass: 5,764; fail: 2
Build 7: aarch64/2019/sep/05 pass: 5,764; fail: 2
Build 8: aarch64/2019/sep/10 pass: 5,764; fail: 2
Build 9: aarch64/2019/sep/17 pass: 5,763; fail: 3
Build 10: aarch64/2019/sep/21 pass: 5,764; fail: 2
Build 11: aarch64/2019/oct/04 pass: 5,764; fail: 2
Build 12: aarch64/2019/oct/17 pass: 5,764; fail: 2
Build 13: aarch64/2019/oct/31 pass: 5,784; fail: 1
Build 14: aarch64/2019/nov/09 pass: 5,773; fail: 3

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jun/27 pass: 8,401; fail: 512; error: 22
Build 1: aarch64/2019/jul/02 pass: 8,407; fail: 498; error: 31
Build 2: aarch64/2019/aug/03 pass: 8,429; fail: 509; error: 18
Build 3: aarch64/2019/aug/10 pass: 8,450; fail: 485; error: 16
Build 4: aarch64/2019/aug/15 pass: 8,443; fail: 496; error: 13
Build 5: aarch64/2019/aug/22 pass: 8,446; fail: 494; error: 15
Build 6: aarch64/2019/sep/04 pass: 8,483; fail: 465; error: 10
Build 7: aarch64/2019/sep/05 pass: 8,465; fail: 479; error: 14
Build 8: aarch64/2019/sep/10 pass: 8,444; fail: 500; error: 14
Build 9: aarch64/2019/sep/17 pass: 8,462; fail: 482; error: 12
Build 10: aarch64/2019/sep/21 pass: 8,467; fail: 478; error: 13
Build 11: aarch64/2019/oct/04 pass: 8,444; fail: 498; error: 16
Build 12: aarch64/2019/oct/17 pass: 8,452; fail: 493; error: 16
Build 13: aarch64/2019/oct/31 pass: 8,468; fail: 490; error: 14
Build 14: aarch64/2019/nov/09 pass: 8,487; fail: 470; error: 16

2 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jun/27 pass: 3,908
Build 1: aarch64/2019/jul/02 pass: 3,908
Build 2: aarch64/2019/aug/03 pass: 3,908
Build 3: aarch64/2019/aug/10 pass: 3,909
Build 4: aarch64/2019/aug/15 pass: 3,909
Build 5: aarch64/2019/aug/22 pass: 3,909
Build 6: aarch64/2019/sep/04 pass: 3,910
Build 7: aarch64/2019/sep/05 pass: 3,910
Build 8: aarch64/2019/sep/10 pass: 3,910
Build 9: aarch64/2019/sep/17 pass: 3,910
Build 10: aarch64/2019/sep/21 pass: 3,910
Build 11: aarch64/2019/oct/04 pass: 3,910
Build 12: aarch64/2019/oct/17 pass: 3,910
Build 13: aarch64/2019/oct/31 pass: 3,910
Build 14: aarch64/2019/nov/09 pass: 3,910

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdk11u/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 7.71x
Relative performance: Server critical-jOPS (nc): 7.99x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk11u/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 204.57

Server 204.57 / Server 2014-04-01 (71.00): 2.88x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk11u/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-06-28 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/178/results/
2019-07-03 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/183/results/
2019-08-04 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/215/results/
2019-08-11 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/222/results/
2019-08-16 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/227/results/
2019-08-23 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/234/results/
2019-09-05 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/247/results/
2019-09-07 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/248/results/
2019-09-11 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/253/results/
2019-09-18 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/260/results/
2019-09-22 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/264/results/
2019-10-05 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/277/results/
2019-10-18 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/290/results/
2019-11-01 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/304/results/
2019-11-10 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/313/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/

From felix.yang at huawei.com  Mon Nov 11 01:41:22 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Mon, 11 Nov 2019 01:41:22 +0000
Subject: [aarch64-port-dev ] 8233839: aarch64: missing memory barrier in
 NewObjectArrayStub and NewTypeArrayStub
In-Reply-To: <382f7448-c392-5d98-ecf5-ac22e86f4888@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60280EF@dggeml527-mbx.china.huawei.com>
 <382f7448-c392-5d98-ecf5-ac22e86f4888@redhat.com>
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED60327A2@dggeml527-mbx.china.huawei.com>

> -----Original Message-----
> From: Andrew Dinn [mailto:adinn at redhat.com]
> Sent: Friday, November 8, 2019 5:04 PM
> To: Yangfei (Felix) <felix.yang at huawei.com>;
> hotspot-runtime-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net
> Subject: Re: 8233839: aarch64: missing memory barrier in NewObjectArrayStub
> and NewTypeArrayStub
> 
> On 08/11/2019 08:30, Yangfei (Felix) wrote:
> > I witnessed random fail of one jcstress test on my 128-core aarch64 server:
> "org.openjdk.jcstress.tests.defaultValues.arrays.small.plain.StringTest"
> > Bug: https://bugs.openjdk.java.net/browse/JDK-8233839
> >
> > I used the latest aarch64 jdk8u release build.  Please refer to the bugzilla for
> details and the analysis.
> >   I checked the assembler code emitted by
> LIR_Assembler::emit_alloc_array:
> > For the fast path, the StoreStore memory barrier is there.  But it?s not the
> case for the slow path.
> >
> >   Patch adding the missing barrier for 14:
> >
> > diff -r ad157fab6bf5 src/hotspot/cpu/aarch64/c1_Runtime1_aarch64.cpp
> > --- a/src/hotspot/cpu/aarch64/c1_Runtime1_aarch64.cpp   Thu Nov 07
> 16:26:57 2019 -0800
> > +++ b/src/hotspot/cpu/aarch64/c1_Runtime1_aarch64.cpp   Fri Nov 08
> 16:10:08 2019 +0800
> > @@ -840,6 +840,7 @@
> >            __ sub(arr_size, arr_size, t1);  // body length
> >            __ add(t1, t1, obj);       // body start
> >            __ initialize_body(t1, arr_size, 0, t2);
> > +          __ membar(Assembler::StoreStore);
> >            __ verify_oop(obj);
> >
> >            __ ret(lr);
> >
> >   JDK builds OK and passed tier1 test.
> Very nice detective work finding that one!
> 
> The jdk14 patch looks good. Also the same patch for jdk11 and the variant for
> jdk8 are good.
> 

Thanks for reviewing this.  
The jdk14 patch has been pushed as: https://hg.openjdk.java.net/jdk/jdk/rev/90cf1d4e712f  
Will push to aarch64 jdk8u after the jdk11u-fix-request is approved.  

Felix

From aph at redhat.com  Mon Nov 11 10:07:10 2019
From: aph at redhat.com (Andrew Haley)
Date: Mon, 11 Nov 2019 10:07:10 +0000
Subject: [aarch64-port-dev ] 8233839: aarch64: missing memory barrier in
 NewObjectArrayStub and NewTypeArrayStub
In-Reply-To: <382f7448-c392-5d98-ecf5-ac22e86f4888@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60280EF@dggeml527-mbx.china.huawei.com>
 <382f7448-c392-5d98-ecf5-ac22e86f4888@redhat.com>
Message-ID: <9270b589-736e-fcce-064b-dcc6b6570406@redhat.com>

On 11/8/19 9:04 AM, Andrew Dinn wrote:
> On 08/11/2019 08:30, Yangfei (Felix) wrote:
>> I witnessed random fail of one jcstress test on my 128-core aarch64 server: "org.openjdk.jcstress.tests.defaultValues.arrays.small.plain.StringTest"
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8233839
>>   JDK builds OK and passed tier1 test.
> Very nice detective work finding that one!
> 
> The jdk14 patch looks good. Also the same patch for jdk11 and the
> variant for jdk8 are good.

Looks like ARM32 does not have the same bug. PowerPC doesn't even
attempt a fast path in this case.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From adinn at redhat.com  Mon Nov 11 11:04:25 2019
From: adinn at redhat.com (Andrew Dinn)
Date: Mon, 11 Nov 2019 11:04:25 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
Message-ID: <421c54f0-43c0-a704-03c2-0d13c5dbeade@redhat.com>

Hi Felix,

On 05/11/2019 06:20, Yangfei (Felix) wrote:
> Please review this small improvements of aarch64 atomic operations.
> This eliminates the use of full memory barriers.
> Passed tier1-3 testing.
The patch looks ok to me.

regards,


Andrew Dinn
-----------


From aph at redhat.com  Mon Nov 11 11:17:01 2019
From: aph at redhat.com (Andrew Haley)
Date: Mon, 11 Nov 2019 11:17:01 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
Message-ID: <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>

On 11/5/19 6:20 AM, Yangfei (Felix) wrote:
> Please review this small improvements of aarch64 atomic operations.
> This eliminates the use of full memory barriers.
> Passed tier1-3 testing.

No, rejected.

Patch also must go to hotspot-dev.

Are you sure this is safe? The HotSpot internal barriers are specified
as being full two-way barriers, which these are not. Tier1 testing
really isn't going to do it. Now, you might argue that none of the
uses in HotSpot actually require anything stronger that acq/rel, but
good luck proving that.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From felix.yang at huawei.com  Mon Nov 11 12:01:24 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Mon, 11 Nov 2019 12:01:24 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>

> -----Original Message-----
> From: Andrew Haley [mailto:aph at redhat.com]
> Sent: Monday, November 11, 2019 7:17 PM
> To: Yangfei (Felix) <felix.yang at huawei.com>;
> aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
> operations
> 
> On 11/5/19 6:20 AM, Yangfei (Felix) wrote:
> > Please review this small improvements of aarch64 atomic operations.
> > This eliminates the use of full memory barriers.
> > Passed tier1-3 testing.
> 
> No, rejected.
> 
> Patch also must go to hotspot-dev.

CCing to hotspot-dev.  

> Are you sure this is safe? The HotSpot internal barriers are specified as being
> full two-way barriers, which these are not. Tier1 testing really isn't going to do
> it. Now, you might argue that none of the uses in HotSpot actually require
> anything stronger that acq/rel, but good luck proving that.

I was also curious about the reason why full memory barrier is used here.  
For add_and_fetch, I was thinking that there is no difference in functionality for the following two code snippet.  
It's interesting to know that this may make a difference.  Can you elaborate more on that please?  

1) without patch
.L2:
        ldxr    x2, [x1]
        add     x2, x2, x0
        stlxr   w3, x2, [x1]
        cbnz    w3, .L2
        dmb     ish
        mov     x0, x2
        ret
-----------------------------------------------
2) with patch
.L2:
        ldaxr   x2, [x1]
        add     x2, x2, x0
        stlxr   w3, x2, [x1]
        cbnz    w3, .L2
        mov     x0, x2
        ret

From felix.yang at huawei.com  Mon Nov 11 12:44:03 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Mon, 11 Nov 2019 12:44:03 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com> 
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED603699A@dggeml527-mbx.china.huawei.com>


> -----Original Message-----
> From: Yangfei (Felix)
> Sent: Monday, November 11, 2019 8:01 PM
> To: 'Andrew Haley' <aph at redhat.com>; aarch64-port-dev at openjdk.java.net
> Cc: 'hotspot-dev at openjdk.java.net' <hotspot-dev at openjdk.java.net>
> Subject: RE: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
> operations
> 
> > -----Original Message-----
> > From: Andrew Haley [mailto:aph at redhat.com]
> > Sent: Monday, November 11, 2019 7:17 PM
> > To: Yangfei (Felix) <felix.yang at huawei.com>;
> > aarch64-port-dev at openjdk.java.net
> > Subject: Re: [aarch64-port-dev ] RFR: aarch64: minor improvements of
> > atomic operations
> >
> > On 11/5/19 6:20 AM, Yangfei (Felix) wrote:
> > > Please review this small improvements of aarch64 atomic operations.
> > > This eliminates the use of full memory barriers.
> > > Passed tier1-3 testing.
> >
> > No, rejected.
> >
> > Patch also must go to hotspot-dev.
> 
> CCing to hotspot-dev.
> 
> > Are you sure this is safe? The HotSpot internal barriers are specified
> > as being full two-way barriers, which these are not. Tier1 testing
> > really isn't going to do it. Now, you might argue that none of the
> > uses in HotSpot actually require anything stronger that acq/rel, but good luck
> proving that.
> 
> I was also curious about the reason why full memory barrier is used here.
> For add_and_fetch, I was thinking that there is no difference in functionality for
> the following two code snippet.
> It's interesting to know that this may make a difference.  Can you elaborate
> more on that please?
> 
> 1) without patch
> .L2:
>         ldxr    x2, [x1]
>         add     x2, x2, x0
>         stlxr   w3, x2, [x1]
>         cbnz    w3, .L2
>         dmb     ish
>         mov     x0, x2
>         ret
> -----------------------------------------------
> 2) with patch
> .L2:
>         ldaxr   x2, [x1]
>         add     x2, x2, x0
>         stlxr   w3, x2, [x1]
>         cbnz    w3, .L2
>         mov     x0, x2
>         ret

And looks like the aarch64 port from Oracle also did the same thing: 
http://hg.openjdk.java.net/jdk-updates/jdk11u-dev/file/f8b2e95a1d41/src/hotspot/os_cpu/linux_arm/atomic_linux_arm.hpp

template<size_t byte_size>
struct Atomic::PlatformAdd
  : Atomic::AddAndFetch<Atomic::PlatformAdd<byte_size> >
{
  template<typename I, typename D>
  D add_and_fetch(I add_value, D volatile* dest, atomic_memory_order order) const;
};

template<>
template<typename I, typename D>
inline D Atomic::PlatformAdd<4>::add_and_fetch(I add_value, D volatile* dest,
                                               atomic_memory_order order) const {
  STATIC_ASSERT(4 == sizeof(I));
  STATIC_ASSERT(4 == sizeof(D));
#ifdef AARCH64
  D val;
  int tmp;
  __asm__ volatile(
    "1:\n\t"
    " ldaxr %w[val], [%[dest]]\n\t"
    " add %w[val], %w[val], %w[add_val]\n\t"
    " stlxr %w[tmp], %w[val], [%[dest]]\n\t"
    " cbnz %w[tmp], 1b\n\t"
    : [val] "=&r" (val), [tmp] "=&r" (tmp)
    : [add_val] "r" (add_value), [dest] "r" (dest)
    : "memory");
  return val;
#else
  return add_using_helper<int32_t>(os::atomic_add_func, add_value, dest);
#endif
}

From aph at redhat.com  Mon Nov 11 15:05:10 2019
From: aph at redhat.com (Andrew Haley)
Date: Mon, 11 Nov 2019 15:05:10 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
Message-ID: <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>

On 11/11/19 12:01 PM, Yangfei (Felix) wrote:

> I was also curious about the reason why full memory barrier is used
> here.  For add_and_fetch, I was thinking that there is no difference
> in functionality for the following two code snippet.  It's
> interesting to know that this may make a difference.  Can you
> elaborate more on that please?

For add_and_fetch the default atomic_memory_order is
memory_order_conservative. I'm not sure exactly what that means, but
it is stronger than SEQ_CST; it's been described as a "full barrier".

__ATOMIC_ACQ_REL for this operation translates approximately to

load
LoadLoad|LoadStore
add
StoreStore|LoadStore
store

In other words, there is nothing to prevent subsequent stores being
reordered with this store. Therefore your change does not meet the
specification for memory_order_conservative.

You could, if you wanted, only make this change for weaker memory
orderings, but AFAIK they are not used.

You could argue that AArch64 won't do such a reordering, but I'd reply
that even if AArch64 can't do such a reordering, GCC sure can.

And finally, is there any operation in HotSpot that actually requires
such strong memory semantics? Probably not, but no-one has ever been
brave enough to say so.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Mon Nov 11 15:06:23 2019
From: aph at redhat.com (Andrew Haley)
Date: Mon, 11 Nov 2019 15:06:23 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED603699A@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED603699A@dggeml527-mbx.china.huawei.com>
Message-ID: <98122035-8872-c77d-5309-b68f07dcaddb@redhat.com>

On 11/11/19 12:44 PM, Yangfei (Felix) wrote:
> And looks like the aarch64 port from Oracle also did the same thing: 
> http://hg.openjdk.java.net/jdk-updates/jdk11u-dev/file/f8b2e95a1d41/src/hotspot/os_cpu/linux_arm/atomic_linux_arm.hpp

That's not the same thing at all, it's fully SEQ_CST. Which is almost
certainly enough, but still doesn't meet spec.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Mon Nov 11 16:36:38 2019
From: aph at redhat.com (Andrew Haley)
Date: Mon, 11 Nov 2019 16:36:38 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
Message-ID: <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>

On 11/11/19 3:05 PM, Andrew Haley wrote:
> And finally, is there any operation in HotSpot that actually requires
> such strong memory semantics? Probably not, but no-one has ever been
> brave enough to say so.

Here's a place where it really does matter.

void ShenandoahPacer::restart_with(size_t non_taxable_bytes, double tax_rate) {
  size_t initial = (size_t)(non_taxable_bytes * tax_rate) >> LogHeapWordSize;
  STATIC_ASSERT(sizeof(size_t) <= sizeof(intptr_t));
  Atomic::xchg((intptr_t)initial, &_budget);
  Atomic::store(tax_rate, &_tax_rate);
  Atomic::inc(&_epoch);

Note: the xchg is conservative, the store is plain. The xchg value should be
visible before the store.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From erik.osterlund at oracle.com  Mon Nov 11 17:11:28 2019
From: erik.osterlund at oracle.com (=?utf-8?Q?Erik_=C3=96sterlund?=)
Date: Mon, 11 Nov 2019 18:11:28 +0100
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
	operations
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
Message-ID: <CAE74F32-A734-48B0-B284-A1BAE6258019@oracle.com>

Hi Felix,

Would uou mind pasting a link to the proposed change? I can not determine its validity otherwise.

Thanks,
/Erik

> On 11 Nov 2019, at 13:01, Yangfei (Felix) <felix.yang at huawei.com> wrote:
> 
> ?
>> 
>> -----Original Message-----
>> From: Andrew Haley [mailto:aph at redhat.com]
>> Sent: Monday, November 11, 2019 7:17 PM
>> To: Yangfei (Felix) <felix.yang at huawei.com>;
>> aarch64-port-dev at openjdk.java.net
>> Subject: Re: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
>> operations
>> 
>>> On 11/5/19 6:20 AM, Yangfei (Felix) wrote:
>>> Please review this small improvements of aarch64 atomic operations.
>>> This eliminates the use of full memory barriers.
>>> Passed tier1-3 testing.
>> 
>> No, rejected.
>> 
>> Patch also must go to hotspot-dev.
> 
> CCing to hotspot-dev.  
> 
>> Are you sure this is safe? The HotSpot internal barriers are specified as being
>> full two-way barriers, which these are not. Tier1 testing really isn't going to do
>> it. Now, you might argue that none of the uses in HotSpot actually require
>> anything stronger that acq/rel, but good luck proving that.
> 
> I was also curious about the reason why full memory barrier is used here.  
> For add_and_fetch, I was thinking that there is no difference in functionality for the following two code snippet.  
> It's interesting to know that this may make a difference.  Can you elaborate more on that please?  
> 
> 1) without patch
> .L2:
>        ldxr    x2, [x1]
>        add     x2, x2, x0
>        stlxr   w3, x2, [x1]
>        cbnz    w3, .L2
>        dmb     ish
>        mov     x0, x2
>        ret
> -----------------------------------------------
> 2) with patch
> .L2:
>        ldaxr   x2, [x1]
>        add     x2, x2, x0
>        stlxr   w3, x2, [x1]
>        cbnz    w3, .L2
>        mov     x0, x2
>        ret


From aph at redhat.com  Mon Nov 11 17:53:06 2019
From: aph at redhat.com (Andrew Haley)
Date: Mon, 11 Nov 2019 17:53:06 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <CAE74F32-A734-48B0-B284-A1BAE6258019@oracle.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <CAE74F32-A734-48B0-B284-A1BAE6258019@oracle.com>
Message-ID: <4455d529-0f43-e6ba-d3d8-2639f4d79802@redhat.com>

On 11/11/19 5:11 PM, Erik ?sterlund wrote:
> Hi Felix,
> 
> Would uou mind pasting a link to the proposed change? I can not determine its validity otherwise.


Patch:

diff -r 2700c409ff10 src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp
--- a/src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp Sun Nov 03 18:02:29 2019 -0500
+++ b/src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp Wed Nov 06 14:13:00 2019 +0800
@@ -40,8 +40,7 @@
{
   template<typename I, typename D>
   D add_and_fetch(I add_value, D volatile* dest, atomic_memory_order order) const {
-    D res = __atomic_add_fetch(dest, add_value, __ATOMIC_RELEASE);
-    FULL_MEM_BARRIER;
+    D res = __atomic_add_fetch(dest, add_value, __ATOMIC_ACQ_REL);
     return res;
   }
};
@@ -52,8 +51,7 @@
                                                      T volatile* dest,
                                                      atomic_memory_order order) const {
   STATIC_ASSERT(byte_size == sizeof(T));
-  T res = __sync_lock_test_and_set(dest, exchange_value);
-  FULL_MEM_BARRIER;
+  T res = __atomic_exchange_n(dest, exchange_value, __ATOMIC_ACQ_REL);
   return res;
}


-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From ci_notify at linaro.org  Tue Nov 12 02:20:29 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Tue, 12 Nov 2019 02:20:29 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK JDK on AArch64
Message-ID: <789911631.346.1573525230014.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/summary/2019/315/summary.html
 
-------------------------------------------------------------------------------
client-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,780; fail: 19; not run: 90

-------------------------------------------------------------------------------
client-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,495; fail: 670; error: 23

-------------------------------------------------------------------------------
client-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/07 pass: 5,750
Build 1: aarch64/2019/oct/09 pass: 5,747; fail: 1
Build 2: aarch64/2019/oct/11 pass: 5,751; fail: 1
Build 3: aarch64/2019/oct/14 pass: 5,753
Build 4: aarch64/2019/oct/16 pass: 5,753; fail: 1
Build 5: aarch64/2019/oct/18 pass: 5,760
Build 6: aarch64/2019/oct/21 pass: 5,716; fail: 43; error: 1
Build 7: aarch64/2019/oct/23 pass: 5,760; fail: 1
Build 8: aarch64/2019/oct/28 pass: 5,766
Build 9: aarch64/2019/oct/30 pass: 5,768
Build 10: aarch64/2019/nov/01 pass: 5,768; fail: 1
Build 11: aarch64/2019/nov/04 pass: 5,769
Build 12: aarch64/2019/nov/06 pass: 5,766; fail: 2
Build 13: aarch64/2019/nov/08 pass: 5,761
Build 14: aarch64/2019/nov/11 pass: 5,762

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/07 pass: 8,683; fail: 517; error: 18
Build 1: aarch64/2019/oct/09 pass: 8,692; fail: 507; error: 21
Build 2: aarch64/2019/oct/11 pass: 8,693; fail: 511; error: 18
Build 3: aarch64/2019/oct/14 pass: 8,706; fail: 497; error: 20
Build 4: aarch64/2019/oct/16 pass: 8,702; fail: 509; error: 17
Build 5: aarch64/2019/oct/18 pass: 8,694; fail: 522; error: 17
Build 6: aarch64/2019/oct/21 pass: 8,705; fail: 512; error: 18
Build 7: aarch64/2019/oct/23 pass: 8,712; fail: 505; error: 18
Build 8: aarch64/2019/oct/28 pass: 8,711; fail: 509; error: 18
Build 9: aarch64/2019/oct/30 pass: 8,723; fail: 504; error: 19
Build 10: aarch64/2019/nov/01 pass: 8,774; fail: 506; error: 18
Build 11: aarch64/2019/nov/04 pass: 8,777; fail: 509; error: 17
Build 12: aarch64/2019/nov/06 pass: 8,775; fail: 507; error: 19
Build 13: aarch64/2019/nov/08 pass: 8,774; fail: 510; error: 17
Build 14: aarch64/2019/nov/11 pass: 8,777; fail: 509; error: 15

2 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/07 pass: 3,979
Build 1: aarch64/2019/oct/09 pass: 3,979
Build 2: aarch64/2019/oct/11 pass: 3,979
Build 3: aarch64/2019/oct/14 pass: 3,979
Build 4: aarch64/2019/oct/16 pass: 3,979
Build 5: aarch64/2019/oct/18 pass: 3,979
Build 6: aarch64/2019/oct/21 pass: 3,979
Build 7: aarch64/2019/oct/23 pass: 3,980
Build 8: aarch64/2019/oct/28 pass: 3,980
Build 9: aarch64/2019/oct/30 pass: 3,980
Build 10: aarch64/2019/nov/01 pass: 3,980
Build 11: aarch64/2019/nov/04 pass: 3,980
Build 12: aarch64/2019/nov/06 pass: 3,980
Build 13: aarch64/2019/nov/08 pass: 3,980
Build 14: aarch64/2019/nov/11 pass: 3,980

-------------------------------------------------------------------------------
server-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,787; fail: 18; not run: 90

-------------------------------------------------------------------------------
server-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,476; fail: 686; error: 27

-------------------------------------------------------------------------------
server-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 7.80x
Relative performance: Server critical-jOPS (nc): 9.37x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 207.57

Server 207.57 / Server 2014-04-01 (71.00): 2.92x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-09-29 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/266/results/
2019-10-08 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/280/results/
2019-10-10 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/282/results/
2019-10-12 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/284/results/
2019-10-15 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/287/results/
2019-10-17 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/289/results/
2019-10-19 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/291/results/
2019-10-22 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/294/results/
2019-10-23 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/296/results/
2019-10-29 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/301/results/
2019-10-31 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/303/results/
2019-11-02 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/305/results/
2019-11-05 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/308/results/
2019-11-07 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/310/results/
2019-11-12 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/315/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/

From felix.yang at huawei.com  Tue Nov 12 02:57:37 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Tue, 12 Nov 2019 02:57:37 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED6036AE2@dggeml527-mbx.china.huawei.com>

> On 11/11/19 3:05 PM, Andrew Haley wrote:
> > And finally, is there any operation in HotSpot that actually requires
> > such strong memory semantics? Probably not, but no-one has ever been
> > brave enough to say so.
> 
> Here's a place where it really does matter.
> 
> void ShenandoahPacer::restart_with(size_t non_taxable_bytes, double
> tax_rate) {
>   size_t initial = (size_t)(non_taxable_bytes * tax_rate) >> LogHeapWordSize;
>   STATIC_ASSERT(sizeof(size_t) <= sizeof(intptr_t));
>   Atomic::xchg((intptr_t)initial, &_budget);
>   Atomic::store(tax_rate, &_tax_rate);
>   Atomic::inc(&_epoch);
> 
> Note: the xchg is conservative, the store is plain. The xchg value should be
> visible before the store.

Thanks for explaining this.  I see your point now.  
For memory_order_conservative order, looks like that ppc enforced an order which is stronger than aarch64.  
ppc issues two full memory barriers: one before the loop and one after the loop.  
But for aarch64, the preceding load/store can still floating after the first ldxr instruction :  

.L2:
        ldxr    x2, [x1]
        add     x2, x2, x0
        stlxr   w3, x2, [x1]
        cbnz    w3, .L2
        dmb     ish

So my question is: for "two-way memory barrier", do we need another full barrier before the loop?  

Felix

From felix.yang at huawei.com  Tue Nov 12 07:36:57 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Tue, 12 Nov 2019 07:36:57 +0000
Subject: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED6036B29@dggeml527-mbx.china.huawei.com>

Hi,

  I am witnessing some SIGILL jvm crashes on my aarch64 platform.
  I looked at the ISB usage, especially this one: https://mail.openjdk.java.net/pipermail/aarch64-port-dev/2014-September/001376.html
  One of changes is adding one ISB after the native call returns:

1100 static void rt_call(MacroAssembler* masm, address dest, int gpargs, int fpargs, int type) {
1101   CodeBlob *cb = CodeCache::find_blob(dest);
1102   if (cb) {
1103     __ far_call(RuntimeAddress(dest));
1104   } else {
1105     assert((unsigned)gpargs < 256, "eek!");
1106     assert((unsigned)fpargs < 32, "eek!");
1107     __ lea(rscratch1, RuntimeAddress(dest));
1108     __ blr(rscratch1);
1109     __ maybe_isb();    < ========
1110   }
1111 }

  The rt_call function is used in generate_native_wrapper to make the JNI call.
  As I didn?t see the barrier for the ppc or arm port.  I would like to know more details here.  Does anyone still remember?
  Also the ISB is planted only in the else block.  I assume this is also necessary for the if block.  Correct?


Thanks for your help,
Felix

From felix.yang at huawei.com  Tue Nov 12 08:37:02 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Tue, 12 Nov 2019 08:37:02 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com> 
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>

> -----Original Message-----
> From: Yangfei (Felix)
> Sent: Tuesday, November 12, 2019 10:58 AM
> To: 'Andrew Haley' <aph at redhat.com>; aarch64-port-dev at openjdk.java.net
> Cc: hotspot-dev at openjdk.java.net
> Subject: RE: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
> operations
> 
> > On 11/11/19 3:05 PM, Andrew Haley wrote:
> > > And finally, is there any operation in HotSpot that actually
> > > requires such strong memory semantics? Probably not, but no-one has
> > > ever been brave enough to say so.
> >
> > Here's a place where it really does matter.
> >
> > void ShenandoahPacer::restart_with(size_t non_taxable_bytes, double
> > tax_rate) {
> >   size_t initial = (size_t)(non_taxable_bytes * tax_rate) >>
> LogHeapWordSize;
> >   STATIC_ASSERT(sizeof(size_t) <= sizeof(intptr_t));
> >   Atomic::xchg((intptr_t)initial, &_budget);
> >   Atomic::store(tax_rate, &_tax_rate);
> >   Atomic::inc(&_epoch);
> >
> > Note: the xchg is conservative, the store is plain. The xchg value
> > should be visible before the store.
> 
> Thanks for explaining this.  I see your point now.
> For memory_order_conservative order, looks like that ppc enforced an order
> which is stronger than aarch64.
> ppc issues two full memory barriers: one before the loop and one after the
> loop.
> But for aarch64, the preceding load/store can still floating after the first ldxr
> instruction :
> 
> .L2:
>         ldxr    x2, [x1]
>         add     x2, x2, x0
>         stlxr   w3, x2, [x1]
>         cbnz    w3, .L2
>         dmb     ish
> 
> So my question is: for "two-way memory barrier", do we need another full
> barrier before the loop?

This has been discussed somewhere before: https://patchwork.kernel.org/patch/3575821/ 
Let's keep the current status for safe.  

Felix

From aph at redhat.com  Tue Nov 12 09:25:18 2019
From: aph at redhat.com (Andrew Haley)
Date: Tue, 12 Nov 2019 09:25:18 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
Message-ID: <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>

On 11/12/19 8:37 AM, Yangfei (Felix) wrote:
> This has been discussed somewhere before: https://patchwork.kernel.org/patch/3575821/ 
> Let's keep the current status for safe.  

Yes.

It's been interesting to see the progress of this patch. I don't think
it's the first time that someone has been tempted to change this code
to make it "more efficient".

I wonder if we could perhaps add a comment to that code so that it
doesn't happen again. I'm not sure exactly what the patch should say
beyond "do not touch". Perhaps something along the lines of "Do not
touch this code unless you have at least Black Belt, 4th Dan in memory
ordering."  :-)

More seriously, maybe simply "Note that memory_order_conservative
requires a full barrier after atomic stores. See
https://patchwork.kernel.org/patch/3575821/"

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From Joshua.Zhu at arm.com  Tue Nov 12 09:31:35 2019
From: Joshua.Zhu at arm.com (Joshua Zhu (Arm Technology China))
Date: Tue, 12 Nov 2019 09:31:35 +0000
Subject: [aarch64-port-dev ] RFR: 8233948: AArch64: Incorrect mapping
 between OptoReg and VMReg for high 64 bits of Vector Register
Message-ID: <VE1PR08MB4880832B668EFD8A88419E2188770@VE1PR08MB4880.eurprd08.prod.outlook.com>

Hi,

Please review the following patch:
JBS: https://bugs.openjdk.java.net/browse/JDK-8233948
Webrev: http://cr.openjdk.java.net/~jzhu/8233948/webrev.00/

In register definition of aarch64.ad, each vector register is defined as 4 slots with its calling convention, ideal type, ... and its VMReg
value. These VMReg values in reg_def are used by ADLC to generate mapping between OptoReg and VMReg: opto2vm[].

But VMReg is treated as 2 slots inconsistently for vector register [1]. This causes incorrect mapping between VMReg and OptoReg
for high 64 bits of vector register.

If we write the following codes which will access high 64 bits of vector register in a way like vector_calling_convention in panama
branch [2]:
    VMReg vmreg = v0->as_VMReg();
    VMRegPair p;
    p.set_pair(vmreg->next(3), vmreg);
And convert the VMRegPair into OptoReg [3]:
    Regmask rm;
    OptoReg::Name reg_fst = OptoReg::as_OptoReg(p.first());
    OptoReg::Name reg_snd = OptoReg::as_OptoReg(p.second());
    tty->print("fst=%d snd=%d\n", reg_fst, reg_snd);
    for (OptoReg::Name r = reg_fst; r <= reg_snd; r++) {
      rm->Insert(r);
    }
In this case, for V0's VMRegPair, first VMReg's value is 64 and second one is 67. After conversion by as_OptoReg(), first OptoReg
becomes 124 and second one becomes 129. Then totally 6 bits of RegMask are set incorrectly, should be 4 bits (represent 4 slots/halves).

VMReg, opto2vm[] and vm2opto[] are dumped by [4] as below for reference:
    http://cr.openjdk.java.net/~jzhu/8233948/RegDump_before_change.log

opto2vm[] has the following items:
    OptoReg: 126, VMReg: 66
    OptoReg: 127, VMReg: 67
    OptoReg: 128, VMReg: 66
    OptoReg: 129, VMReg: 67
OptoReg pair [126, 127] and [128, 129] are both mapped to the same VMReg Pair [66, 67].
vm2opto are then generated by traverse of opto2vm [5].
    VMReg: 66, OptoReg: 128
    VMReg: 67, OptoReg: 129
This caused incorrect RegMask generated in above case.

However for floating-point register, bottom 64 bits of NEON vector register overlaps with floating-point register. Their VMReg
and corresponding mapping is still consistent, therefore this issue is not exposed. But I think we should still fix it to make the
codes clean and avoid potential issue in future.

After fix, the dump is:
    http://cr.openjdk.java.net/~jzhu/8233948/RegDump_after_change.log

[1] https://hg.openjdk.java.net/jdk/jdk/file/d595f1faace2/src/hotspot/cpu/aarch64/vmreg_aarch64.inline.hpp#l35
[2] https://hg.openjdk.java.net/panama/dev/file/43bc39c09590/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp#l1140
[3] https://hg.openjdk.java.net/panama/dev/file/43bc39c09590/src/hotspot/share/opto/matcher.cpp#l1360
[4] http://cr.openjdk.java.net/~jzhu/8233948/dump.patch
[5] https://hg.openjdk.java.net/jdk/jdk/file/d595f1faace2/src/hotspot/share/opto/c2compiler.cpp#l59

Best Regards,
Joshua


From adinn at redhat.com  Tue Nov 12 09:42:09 2019
From: adinn at redhat.com (Andrew Dinn)
Date: Tue, 12 Nov 2019 09:42:09 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
 <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
Message-ID: <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com>

On 12/11/2019 09:25, Andrew Haley wrote:
> On 11/12/19 8:37 AM, Yangfei (Felix) wrote:
>> This has been discussed somewhere before: https://patchwork.kernel.org/patch/3575821/ 
>> Let's keep the current status for safe.  
> 
> Yes.
> 
> It's been interesting to see the progress of this patch. I don't think
> it's the first time that someone has been tempted to change this code
> to make it "more efficient".
> 
> I wonder if we could perhaps add a comment to that code so that it
> doesn't happen again. I'm not sure exactly what the patch should say
> beyond "do not touch". Perhaps something along the lines of "Do not
> touch this code unless you have at least Black Belt, 4th Dan in memory
> ordering."  :-)
> 
> More seriously, maybe simply "Note that memory_order_conservative
> requires a full barrier after atomic stores. See
> https://patchwork.kernel.org/patch/3575821/"
Yes, that would be a help. It's particularly easy to get confused here
because we happily omit the ordering of an stlr store wrt subsequent
stores when the strl is implementing a Java volatile write or a Java
cmpxchg.

So, it might be worth adding a rider that implementing the full
memory_order_conservative semantics is necessary because VM code relies
on the strong ordering wrt writes that the cmpxchg is required to provide.

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill


From patrick at os.amperecomputing.com  Tue Nov 12 09:52:04 2019
From: patrick at os.amperecomputing.com (Patrick Zhang OS)
Date: Tue, 12 Nov 2019 09:52:04 +0000
Subject: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub
 threshold of string_compare intrinsic tunable
In-Reply-To: <MN2PR01MB609316337FABB9DE034156EA8F780@MN2PR01MB6093.prod.exchangelabs.com>
References: <MN2PR01MB609385B40719285BD087416A8F610@MN2PR01MB6093.prod.exchangelabs.com>
 <9248d1b3-7341-68b9-7dd1-02dfd48567a6@bell-sw.com>
 <MN2PR01MB609316337FABB9DE034156EA8F780@MN2PR01MB6093.prod.exchangelabs.com>
Message-ID: <MN2PR01MB6093E9C78B79DD6C4F6692CF8F770@MN2PR01MB6093.prod.exchangelabs.com>

Ping...

Hi Aleksei,

Does the potential regression on my test system still exist? with the new patch webrev.03? If the added largeLoopExitCondition condition excluded your >128 chars strings from the large loop, and still caused performance drops, maybe hacking all the software prefetch hint distance and the checking condition to hardcoded 64, can be a good try. Although I think it would not be a right thing to do, in comparison with the similar logic in generate_compare_long_string_different_encoding. Thanks. 

http://cr.openjdk.java.net/~qpzhang/8229351/webrev.03/jdk.changeset 
     if (SoftwarePrefetchHintDistance >= 0) {
-      __ bind(LARGE_LOOP_PREFETCH);
+      if (remainingLimit < largeLoopExitCondition) {
+        // there could be fewer bytes left and invalid for this large loop with prefetching
+        __ subs(rscratch2, cnt2, largeLoopExitCondition); // => subs(rscratch2, cnt2, 64); ??
+        __ br(__ LT, NO_PREFETCH);
+      }
+      __ bind(LARGE_LOOP_PREFETCH); // 64 bytes loop
         __ prfm(Address(str1, SoftwarePrefetchHintDistance)); //  => __ prfm(Address(str1, 64));
         __ prfm(Address(str2, SoftwarePrefetchHintDistance)); //  => __ prfm(Address(str2, 64));

Regards
Patrick

-----Original Message-----
From: aarch64-port-dev <aarch64-port-dev-bounces at openjdk.java.net> On Behalf Of Patrick Zhang OS
Sent: Thursday, November 7, 2019 6:56 PM
To: Aleksei Voitylov <aleksei.voitylov at bell-sw.com>
Cc: aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

Hi Aleksei,

Thanks for testing it and the data. I only had the source of StringCompareBench.java [6], my numbers (the diffs) are within 2%, while the -207.12% looks quite weird. I initially did not add the condition to control the br, since generate_ compare_long_string_different_encoding has the similar unconditional br. By the way, the original logic allowed prefetching the memory behind array border, for the first 64 bytes. I think securing the prefetch is the right thing to do, but it could certainly stop some cases from going to the large loop with prefetching. Welcome further comments, thanks.

http://cr.openjdk.java.net/~qpzhang/8229351/webrev.03/

Regards
Patrick

From: Aleksei Voitylov <aleksei.voitylov at bell-sw.com>
Sent: Thursday, November 7, 2019 12:53 AM
To: Patrick Zhang OS <patrick at os.amperecomputing.com>
Cc: aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable


Hi Patrick,

I like the fact that this patch does not add much to the complexity of the code. Here are some experiments that you could find useful.
Cortex A73    Size    base (ns/op)    patched (ns/op)    Diff
StringCompareBench.StringCompareLL    256    14422257,98    15302300,24    -6,10%
StringCompareBench.StringCompareLL    512    27998036,21    28317818,08    -1,14%

ThunderX2    Size    base (ns/op)    patched (ns/op)    Diff
StringCompareBench.StringCompareLL    128    4265122,232    13099099,67    -207,12%
StringCompareBench.StringCompareLL    256    3539452,533    3599407,432    -1,69%

StringCompareBench.StringCompareUU    128    6899938,75    7174601,241    -3,98%
StringCompareBench.StringCompareUU    256    7654538,841    7826599,466    -2,25%

StringCompareBench.cachedStringCompareLL    128    19,673    21,242    -7,98%
StringCompareBench.cachedStringCompareLL    256    34,179    36,452    -6,65%
StringCompareBench.cachedStringCompareLL    512    59,574    64,088    -7,58%
StringCompareBench.cachedStringCompareLL    1024    110,37    118,477    -7,35%
StringCompareBench.cachedStringCompareLL    1000000    114028,907    115388,681    -1,19%

StringCompareBench.cachedStringCompareUU    128    33,752    36,922    -9,39%
StringCompareBench.cachedStringCompareUU    256    60,939    64,096    -5,18%
StringCompareBench.cachedStringCompareUU    512    115,328    118,48    -2,73%
StringCompareBench.cachedStringCompareUU    1024    239,332    242,97    -1,52%
StringCompareBench.cachedStringCompareUU    1000000    226491,096    233638,328    -3,16%
It might be the case that the newly added branch is the culprit:

+      __ subs(rscratch2, cnt2, largeLoopExitCondition);
+      __ br(__ LT, NO_PREFETCH);

Maybe you could skip it when CompareLongStringLimitLatin and CompareLongStringLimitUTF are large enough (then stub code is only called with string length large enough to skip branch above). Then (the properly commented) code would look like:

if ((stub_threshold-wordSize/(isLL ? 1 : 2)) < largeLoopExitCondition) {
     __ subs(rscratch2, cnt2, largeLoopExitCondition);
     __ br(__ LT, NO_PREFETCH);
}

and in this case we shouldn't see any performance penalties.

-Aleksei

On 29/10/2019 12:58, Patrick Zhang OS wrote:

Hi,


Could you please review this patch, thanks.


JBS: https://bugs.openjdk.java.net/browse/JDK-8229351

Webrev: http://cr.openjdk.java.net/~qpzhang/8229351/webrev.02

(this starts from .02 since there had been some internal review and updates)


Changes:


1.       Split the STUB_THRESHOLD from the hard-coded 72 to be CompareLongStringLimitLatin and CompareLongStringLimitUTF as a more flexible control over the stub thresholds for string_compare intrinsics, especially for various uArchs.


2.       MacroAssembler::string_compare LL and UU shared the same threshold, actually UU may only require the half (length of chars) of that of LL's, because one character has two-bytes for UU, while for compacted LL strings, one character means one byte. In addition, LU/UL may need a separated threshold, as the stub function is different from the same encoding one, and the performance may vary as well.


3.       In generate_compare_long_string_same_encoding, the hard-coded 72 was originally able to ensure that there can be always 64 bytes at least for the prefetch code path. However once a smaller stub threshold is set, a new condition is needed to tell if this would be still valid, or has to go to the NO_PREFETCH branch. This change can ensure the correctness.


4.       In generate_compare_long_string_different_encoding, some temp vars for handling the last 4 characters are not valid any longer, cleaned up strU and strL, and related pointers initialization to the next U (cnt1) and L (tmp2).


5.       In compare_string_16_x_LU, the reference to r10 (tmp1) is not needed, as tmpU or tmpL point to the same register.


Tests:


  1.  For function check, I have run


jdk jtreg tier1 tests, with default vm flags


hotspot jtreg tests: runtime/compiler/gc parts, with "-Xcomp -XX:-TieredCompilation"


jck10/api/java.lang 1609 cases and other selected modules, no new failures found, with default vm flags and "-Xcomp -XX:-TieredCompilation" respectively;


some specific test cases had been carefully executed to double check, i.e., TestStringCompareToDifferentLength.java [1] and TestStringCompareToDifferentLength.java [1] introduced by [2], StrCmpTest.java [3] introduced by [4].


  1.  For performance check, I have run


string-density-bench/CompareToBench.java [5] and StringCompareBench.java [6] respectively,


and SPECjbb2015.jar, no obvious performance change has been found (since the default threshold is NOT changed within this patch).


FYI. with Ampere eMAG system, microbenchmarks [5][6] can have 1.5x consistent perf gain with LU/UL comparison for shorter strings (<72 chars, smaller stub thresholds), and slight improvement (5~10%) with LL/UU cases.


Refs:

[1] http://hg.openjdk.java.net/jdk/jdk/file/3df2bf731a87/test/hotspot/jtreg/compiler/intrinsics/string

[2] https://bugs.openjdk.java.net/browse/JDK-8218966 AArch64: String.compareTo() can read memory after string

[3] http://cr.openjdk.java.net/~dpochepk/8202326/StrCmpTest.java, contributed by Dmitrij Pochepko

[4] https://bugs.openjdk.java.net/browse/JDK-8202326 AARCH64: optimize string compare intrinsic

[5] http://cr.openjdk.java.net/~shade/density/string-density-bench.jar, contributed by Aleksey Shipilev

[6] http://cr.openjdk.java.net/~dpochepk/8202326/StringCompareBench.java, contributed by Dmitrij Pochepko


Regards

Patrick


From Joshua.Zhu at arm.com  Tue Nov 12 09:55:02 2019
From: Joshua.Zhu at arm.com (Joshua Zhu (Arm Technology China))
Date: Tue, 12 Nov 2019 09:55:02 +0000
Subject: [aarch64-port-dev ] RFR: 8233948: AArch64: Incorrect mapping
 between OptoReg and VMReg for high 64 bits of Vector Register
In-Reply-To: <VE1PR08MB4880832B668EFD8A88419E2188770@VE1PR08MB4880.eurprd08.prod.outlook.com>
References: <VE1PR08MB4880832B668EFD8A88419E2188770@VE1PR08MB4880.eurprd08.prod.outlook.com>
Message-ID: <VE1PR08MB4880227F4AEE8A866F3AB9CD88770@VE1PR08MB4880.eurprd08.prod.outlook.com>

Hi,

Please review the following patch:
JBS: https://bugs.openjdk.java.net/browse/JDK-8233948
Webrev: http://cr.openjdk.java.net/~jzhu/8233948/webrev.00/

In register definition of aarch64.ad, each vector register is defined as 4 slots with its calling convention, ideal type, ... and its VMReg value. These VMReg values in reg_def are used by ADLC to generate mapping between OptoReg and VMReg: opto2vm[].

But VMReg is treated as 2 slots inconsistently for vector register [1]. This causes incorrect mapping between VMReg and OptoReg for high 64 bits of vector register.

If we write the following codes which will access high 64 bits of vector register in a way like vector_calling_convention in panama branch [2]:
    VMReg vmreg = v0->as_VMReg();
    VMRegPair p;
    p.set_pair(vmreg->next(3), vmreg);
And convert the VMRegPair into OptoReg [3]:
    Regmask rm;
    OptoReg::Name reg_fst = OptoReg::as_OptoReg(p.first());
    OptoReg::Name reg_snd = OptoReg::as_OptoReg(p.second());
    tty->print("fst=%d snd=%d\n", reg_fst, reg_snd);
    for (OptoReg::Name r = reg_fst; r <= reg_snd; r++) {
      rm->Insert(r);
    }
In this case, for V0's VMRegPair, first VMReg's value is 64 and second one is 67. After conversion by as_OptoReg(), first OptoReg becomes 124 and second one becomes 129. Then totally 6 bits of RegMask are set incorrectly, should be 4 bits (represent 4 slots/halves).

VMReg, opto2vm[] and vm2opto[] are dumped by [4] as below for reference:
    http://cr.openjdk.java.net/~jzhu/8233948/RegDump_before_change.log

opto2vm[] has the following items:
    OptoReg: 126, VMReg: 66
    OptoReg: 127, VMReg: 67
    OptoReg: 128, VMReg: 66
    OptoReg: 129, VMReg: 67
OptoReg pair [126, 127] and [128, 129] are both mapped to the same VMReg Pair [66, 67].
vm2opto are then generated by traverse of opto2vm [5].
    VMReg: 66, OptoReg: 128
    VMReg: 67, OptoReg: 129
This caused incorrect RegMask generated in above case.

However for floating-point register, bottom 64 bits of NEON vector register overlaps with floating-point register. Their VMReg and corresponding mapping is still consistent, therefore this issue is not exposed. But I think we should still fix it to make the codes clean and avoid potential issue in future.

After fix, the dump is:
    http://cr.openjdk.java.net/~jzhu/8233948/RegDump_after_change.log

[1] https://hg.openjdk.java.net/jdk/jdk/file/d595f1faace2/src/hotspot/cpu/aarch64/vmreg_aarch64.inline.hpp#l35
[2] https://hg.openjdk.java.net/panama/dev/file/43bc39c09590/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp#l1140
[3] https://hg.openjdk.java.net/panama/dev/file/43bc39c09590/src/hotspot/share/opto/matcher.cpp#l1360
[4] http://cr.openjdk.java.net/~jzhu/8233948/dump.patch
[5] https://hg.openjdk.java.net/jdk/jdk/file/d595f1faace2/src/hotspot/share/opto/c2compiler.cpp#l59

Best Regards,
Joshua

From felix.yang at huawei.com  Tue Nov 12 12:02:34 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Tue, 12 Nov 2019 12:02:34 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
 <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
 <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com>
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED6036C70@dggeml527-mbx.china.huawei.com>

> -----Original Message-----
> From: Andrew Dinn [mailto:adinn at redhat.com]
> Sent: Tuesday, November 12, 2019 5:42 PM
> To: Andrew Haley <aph at redhat.com>; Yangfei (Felix)
> <felix.yang at huawei.com>; aarch64-port-dev at openjdk.java.net
> Cc: hotspot-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
> operations
> 
> On 12/11/2019 09:25, Andrew Haley wrote:
> > On 11/12/19 8:37 AM, Yangfei (Felix) wrote:
> >> This has been discussed somewhere before:
> >> https://patchwork.kernel.org/patch/3575821/
> >> Let's keep the current status for safe.
> >
> > Yes.
> >
> > It's been interesting to see the progress of this patch. I don't think
> > it's the first time that someone has been tempted to change this code
> > to make it "more efficient".
> >
> > I wonder if we could perhaps add a comment to that code so that it
> > doesn't happen again. I'm not sure exactly what the patch should say
> > beyond "do not touch". Perhaps something along the lines of "Do not
> > touch this code unless you have at least Black Belt, 4th Dan in memory
> > ordering."  :-)
> >
> > More seriously, maybe simply "Note that memory_order_conservative
> > requires a full barrier after atomic stores. See
> > https://patchwork.kernel.org/patch/3575821/"
> Yes, that would be a help. It's particularly easy to get confused here because
> we happily omit the ordering of an stlr store wrt subsequent stores when the
> strl is implementing a Java volatile write or a Java cmpxchg.
> 
> So, it might be worth adding a rider that implementing the full
> memory_order_conservative semantics is necessary because VM code relies
> on the strong ordering wrt writes that the cmpxchg is required to provide.
> 

I also suggest we implement these functions with inline assembly here.  
For Atomic::PlatformXchg, we may issue two consecutive full memory barriers with the current status.  
I used GCC 7.3.0 to compile the following function:

$ cat test.c
long foo(long add_value, long volatile* dest, long exchange_value)
{
  long val = __sync_lock_test_and_set(dest, exchange_value);

  __sync_synchronize();

  return val;
}

$ cat test.s
        .arch armv8-a
        .file   "test.c"
        .text
        .align  2
        .p2align 3,,7
        .global foo
        .type   foo, %function
foo:
.L2:
        ldxr    x0, [x1]
        stxr    w3, x2, [x1]
        cbnz    w3, .L2
        dmb     ish           < ========
        dmb     ish           < ========
        ret
        .size   foo, .-foo
        .ident  "GCC: (GNU) 7.3.0"
        .section        .note.GNU-stack,"", at progbits

From felix.yang at huawei.com  Tue Nov 12 12:14:48 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Tue, 12 Nov 2019 12:14:48 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
 <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
 <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com> 
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED6036C85@dggeml527-mbx.china.huawei.com>

> -----Original Message-----
> From: Yangfei (Felix)
> Sent: Tuesday, November 12, 2019 8:03 PM
> To: 'Andrew Dinn' <adinn at redhat.com>; Andrew Haley <aph at redhat.com>;
> aarch64-port-dev at openjdk.java.net
> Cc: hotspot-dev at openjdk.java.net
> Subject: RE: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
> operations
> 
> > -----Original Message-----
> > From: Andrew Dinn [mailto:adinn at redhat.com]
> > Sent: Tuesday, November 12, 2019 5:42 PM
> > To: Andrew Haley <aph at redhat.com>; Yangfei (Felix)
> > <felix.yang at huawei.com>; aarch64-port-dev at openjdk.java.net
> > Cc: hotspot-dev at openjdk.java.net
> > Subject: Re: [aarch64-port-dev ] RFR: aarch64: minor improvements of
> > atomic operations
> >
> > On 12/11/2019 09:25, Andrew Haley wrote:
> > > On 11/12/19 8:37 AM, Yangfei (Felix) wrote:
> > >> This has been discussed somewhere before:
> > >> https://patchwork.kernel.org/patch/3575821/
> > >> Let's keep the current status for safe.
> > >
> > > Yes.
> > >
> > > It's been interesting to see the progress of this patch. I don't
> > > think it's the first time that someone has been tempted to change
> > > this code to make it "more efficient".
> > >
> > > I wonder if we could perhaps add a comment to that code so that it
> > > doesn't happen again. I'm not sure exactly what the patch should say
> > > beyond "do not touch". Perhaps something along the lines of "Do not
> > > touch this code unless you have at least Black Belt, 4th Dan in
> > > memory ordering."  :-)
> > >
> > > More seriously, maybe simply "Note that memory_order_conservative
> > > requires a full barrier after atomic stores. See
> > > https://patchwork.kernel.org/patch/3575821/"
> > Yes, that would be a help. It's particularly easy to get confused here
> > because we happily omit the ordering of an stlr store wrt subsequent
> > stores when the strl is implementing a Java volatile write or a Java cmpxchg.
> >
> > So, it might be worth adding a rider that implementing the full
> > memory_order_conservative semantics is necessary because VM code
> > relies on the strong ordering wrt writes that the cmpxchg is required to
> provide.
> >
> 
> I also suggest we implement these functions with inline assembly here.
> For Atomic::PlatformXchg, we may issue two consecutive full memory barriers
> with the current status.
> I used GCC 7.3.0 to compile the following function:
> 
> $ cat test.c
> long foo(long add_value, long volatile* dest, long exchange_value) {
>   long val = __sync_lock_test_and_set(dest, exchange_value);
> 
>   __sync_synchronize();
> 
>   return val;
> }
> 
> $ cat test.s
>         .arch armv8-a
>         .file   "test.c"
>         .text
>         .align  2
>         .p2align 3,,7
>         .global foo
>         .type   foo, %function
> foo:
> .L2:
>         ldxr    x0, [x1]
>         stxr    w3, x2, [x1]
>         cbnz    w3, .L2
>         dmb     ish           < ========
>         dmb     ish           < ========
>         ret
>         .size   foo, .-foo
>         .ident  "GCC: (GNU) 7.3.0"
>         .section        .note.GNU-stack,"", at progbits

Also this is different from the following sequence (stxr instead of stlxr).  

	<Access [A]>

	// atomic_op (B)
1:	ldxr	x0, [B]		// Exclusive load
	<op(B)>
	stlxr	w1, x0, [B]	// Exclusive store with release
	cbnz	w1, 1b
	dmb	ish		// Full barrier

	<Access [C]>

I think the two-way memory barrier may not be ensured for this case.  

Felix

From felix.yang at huawei.com  Tue Nov 12 14:42:55 2019
From: felix.yang at huawei.com (felix.yang at huawei.com)
Date: Tue, 12 Nov 2019 14:42:55 +0000
Subject: [aarch64-port-dev ] hg: aarch64-port/jdk8u-shenandoah/hotspot:
	8233839: aarch64: missing	memory barrier in
	NewObjectArrayStub and NewTypeArrayStub
Message-ID: <201911121442.xACEgtVb012981@aojmv0008.oracle.com>

Changeset: 09d4b646f756
Author:    fyang
Date:      2019-11-12 17:54 +0800
URL:       https://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/hotspot/rev/09d4b646f756

8233839: aarch64: missing memory barrier in NewObjectArrayStub and NewTypeArrayStub
Reviewed-by: adinn

! src/cpu/aarch64/vm/c1_Runtime1_aarch64.cpp


From aph at redhat.com  Tue Nov 12 16:04:57 2019
From: aph at redhat.com (Andrew Haley)
Date: Tue, 12 Nov 2019 16:04:57 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED6036C70@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
 <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
 <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036C70@dggeml527-mbx.china.huawei.com>
Message-ID: <58ba3a50-fe49-f231-85b2-37d8f8b136f0@redhat.com>

On 11/12/19 12:02 PM, Yangfei (Felix) wrote:
> I also suggest we implement these functions with inline assembly here.  

Please let's not. Long term it would be nice to migrate all of HotSpot
from the current inline hackery to real C++ atomics. There has been a
considerable effort to make C++ and Java memory models compatible, and
we should utilize this.

> For Atomic::PlatformXchg, we may issue two consecutive full memory
> barriers with the current status.

OK, but is this actually important? What uses it?

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aleksei.voitylov at bell-sw.com  Tue Nov 12 16:10:28 2019
From: aleksei.voitylov at bell-sw.com (Aleksei Voitylov)
Date: Tue, 12 Nov 2019 19:10:28 +0300
Subject: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub
 threshold of string_compare intrinsic tunable
In-Reply-To: <MN2PR01MB6093E9C78B79DD6C4F6692CF8F770@MN2PR01MB6093.prod.exchangelabs.com>
References: <MN2PR01MB609385B40719285BD087416A8F610@MN2PR01MB6093.prod.exchangelabs.com>
 <9248d1b3-7341-68b9-7dd1-02dfd48567a6@bell-sw.com>
 <MN2PR01MB609316337FABB9DE034156EA8F780@MN2PR01MB6093.prod.exchangelabs.com>
 <MN2PR01MB6093E9C78B79DD6C4F6692CF8F770@MN2PR01MB6093.prod.exchangelabs.com>
Message-ID: <4a067ad4-a7e8-813e-2633-91bb380f24b5@bell-sw.com>

Hi Patrick,

First, I'm a trespasser, not a reviewer. Reviewers will need to look at
this.

On the technical side:

This additional branch in v3 is still painful. You can reduce the amount
of branches in code path for lengths 128 and 256 by using that fact that
CompareLongStringLimitLatin and CompareLongStringLimitUTF are at least
24. Then we don't have to jump to NO_PREFETCH label, where check for
small string size is done. Instead we can jump to SMALL_LOOP label
(assuming cnt2 counter is updated accordingly). In this case NO_PREFETCH
label is not needed and we have 1 less branch.

Rough sketch of affected part based on v3 looks as follows. This version
was checked on ThunderX2 and it looks fine on length 128 and 256
perf-wise. I also added alignment for small loop, which also helps a
bit. Please keep in mind it's a sketch.

-Aleksei

@@ -4172,19 +4168,34 @@
???? Register result = r0, str1 = r1, cnt1 = r2, str2 = r3, cnt2 = r4,
???????? tmp1 = r10, tmp2 = r11;
???? Label SMALL_LOOP, LARGE_LOOP_PREFETCH, CHECK_LAST, DIFF2, TAIL,
-??????? LENGTH_DIFF, DIFF, LAST_CHECK_AND_LENGTH_DIFF,
+??????? LENGTH_DIFF, DIFF, LAST_CHECK_AND_LENGTH_DIFF, SMALL_LOOP_CHECK,
???????? DIFF_LAST_POSITION, DIFF_LAST_POSITION2;
???? // exit from large loop when less than 64 bytes left to read or
we're about
???? // to prefetch memory behind array border
???? int largeLoopExitCondition = MAX(64,
SoftwarePrefetchHintDistance)/(isLL ? 1 : 2);
+??? // calculate the remaining limit in chars which manages if this
stub should be called,
+??? // if the limit is large enough (>= largeLoopExitCondition), below
large loop with prefetching
+??? // can be executed at least once, and there is no need to do any
extra checking at the entrance.
+??? int remainingLimit = (isLL ? CompareLongStringLimitLatin :
CompareLongStringLimitUTF) -
+???????????????????????? (wordSize / (isLL ? 1 : 2));
???? // cnt1/cnt2 contains amount of characters to compare. cnt1 can be
re-used
???? // update cnt2 counter with already loaded 8 bytes
-??? __ sub(cnt2, cnt2, wordSize/(isLL ? 1 : 2));
+??? if (SoftwarePrefetchHintDistance >= 0 && remainingLimit <
largeLoopExitCondition) {
+????? __ sub(cnt2, cnt2, isLL ? 24 : 12);
+??? } else {
+????? __ sub(cnt2, cnt2, isLL ? 8 : 4);
+??? }
???? // update pointers, because of previous read
???? __ add(str1, str1, wordSize);
???? __ add(str2, str2, wordSize);
???? if (SoftwarePrefetchHintDistance >= 0) {
-????? __ bind(LARGE_LOOP_PREFETCH);
+????? if (remainingLimit < largeLoopExitCondition) {
+??????? // there could be fewer bytes left and invalid for this large
loop with prefetching
+??????? __ subs(rscratch2, cnt2, largeLoopExitCondition - (isLL ? 16 :
8));
+??????? __ br(__ LT, SMALL_LOOP);
+??????? __ add(cnt2, cnt2, isLL ? 16 : 8);
+????? }
+????? __ bind(LARGE_LOOP_PREFETCH); // 64 bytes loop
???????? __ prfm(Address(str1, SoftwarePrefetchHintDistance));
???????? __ prfm(Address(str2, SoftwarePrefetchHintDistance));
???????? compare_string_16_bytes_same(DIFF, DIFF2);
@@ -4196,11 +4207,11 @@
???????? __ br(__ GT, LARGE_LOOP_PREFETCH);
???????? __ cbz(cnt2, LAST_CHECK_AND_LENGTH_DIFF); // no more chars left?
???? }
-??? // less than 16 bytes left?
-??? __ subs(cnt2, cnt2, isLL ? 16 : 8);
-??? __ br(__ LT, TAIL);
+??? __ b(SMALL_LOOP_CHECK); // check if less than 16 bytes left
+??? __ align(OptoLoopAlignment);
???? __ bind(SMALL_LOOP);
?????? compare_string_16_bytes_same(DIFF, DIFF2);
+????? __ bind(SMALL_LOOP_CHECK);
?????? __ subs(cnt2, cnt2, isLL ? 16 : 8);
?????? __ br(__ GE, SMALL_LOOP);
???? __ bind(TAIL);


On 12/11/2019 12:52, Patrick Zhang OS wrote:
> Ping...
>
> Hi Aleksei,
>
> Does the potential regression on my test system still exist? with the new patch webrev.03? If the added largeLoopExitCondition condition excluded your >128 chars strings from the large loop, and still caused performance drops, maybe hacking all the software prefetch hint distance and the checking condition to hardcoded 64, can be a good try. Although I think it would not be a right thing to do, in comparison with the similar logic in generate_compare_long_string_different_encoding. Thanks. 
>
> http://cr.openjdk.java.net/~qpzhang/8229351/webrev.03/jdk.changeset 
>      if (SoftwarePrefetchHintDistance >= 0) {
> -      __ bind(LARGE_LOOP_PREFETCH);
> +      if (remainingLimit < largeLoopExitCondition) {
> +        // there could be fewer bytes left and invalid for this large loop with prefetching
> +        __ subs(rscratch2, cnt2, largeLoopExitCondition); // => subs(rscratch2, cnt2, 64); ??
> +        __ br(__ LT, NO_PREFETCH);
> +      }
> +      __ bind(LARGE_LOOP_PREFETCH); // 64 bytes loop
>          __ prfm(Address(str1, SoftwarePrefetchHintDistance)); //  => __ prfm(Address(str1, 64));
>          __ prfm(Address(str2, SoftwarePrefetchHintDistance)); //  => __ prfm(Address(str2, 64));
>
> Regards
> Patrick
>
> -----Original Message-----
> From: aarch64-port-dev <aarch64-port-dev-bounces at openjdk.java.net> On Behalf Of Patrick Zhang OS
> Sent: Thursday, November 7, 2019 6:56 PM
> To: Aleksei Voitylov <aleksei.voitylov at bell-sw.com>
> Cc: aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable
>
> Hi Aleksei,
>
> Thanks for testing it and the data. I only had the source of StringCompareBench.java [6], my numbers (the diffs) are within 2%, while the -207.12% looks quite weird. I initially did not add the condition to control the br, since generate_ compare_long_string_different_encoding has the similar unconditional br. By the way, the original logic allowed prefetching the memory behind array border, for the first 64 bytes. I think securing the prefetch is the right thing to do, but it could certainly stop some cases from going to the large loop with prefetching. Welcome further comments, thanks.
>
> http://cr.openjdk.java.net/~qpzhang/8229351/webrev.03/
>
> Regards
> Patrick
>
> From: Aleksei Voitylov <aleksei.voitylov at bell-sw.com>
> Sent: Thursday, November 7, 2019 12:53 AM
> To: Patrick Zhang OS <patrick at os.amperecomputing.com>
> Cc: aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable
>
>
> Hi Patrick,
>
> I like the fact that this patch does not add much to the complexity of the code. Here are some experiments that you could find useful.
> Cortex A73    Size    base (ns/op)    patched (ns/op)    Diff
> StringCompareBench.StringCompareLL    256    14422257,98    15302300,24    -6,10%
> StringCompareBench.StringCompareLL    512    27998036,21    28317818,08    -1,14%
>
> ThunderX2    Size    base (ns/op)    patched (ns/op)    Diff
> StringCompareBench.StringCompareLL    128    4265122,232    13099099,67    -207,12%
> StringCompareBench.StringCompareLL    256    3539452,533    3599407,432    -1,69%
>
> StringCompareBench.StringCompareUU    128    6899938,75    7174601,241    -3,98%
> StringCompareBench.StringCompareUU    256    7654538,841    7826599,466    -2,25%
>
> StringCompareBench.cachedStringCompareLL    128    19,673    21,242    -7,98%
> StringCompareBench.cachedStringCompareLL    256    34,179    36,452    -6,65%
> StringCompareBench.cachedStringCompareLL    512    59,574    64,088    -7,58%
> StringCompareBench.cachedStringCompareLL    1024    110,37    118,477    -7,35%
> StringCompareBench.cachedStringCompareLL    1000000    114028,907    115388,681    -1,19%
>
> StringCompareBench.cachedStringCompareUU    128    33,752    36,922    -9,39%
> StringCompareBench.cachedStringCompareUU    256    60,939    64,096    -5,18%
> StringCompareBench.cachedStringCompareUU    512    115,328    118,48    -2,73%
> StringCompareBench.cachedStringCompareUU    1024    239,332    242,97    -1,52%
> StringCompareBench.cachedStringCompareUU    1000000    226491,096    233638,328    -3,16%
> It might be the case that the newly added branch is the culprit:
>
> +      __ subs(rscratch2, cnt2, largeLoopExitCondition);
> +      __ br(__ LT, NO_PREFETCH);
>
> Maybe you could skip it when CompareLongStringLimitLatin and CompareLongStringLimitUTF are large enough (then stub code is only called with string length large enough to skip branch above). Then (the properly commented) code would look like:
>
> if ((stub_threshold-wordSize/(isLL ? 1 : 2)) < largeLoopExitCondition) {
>      __ subs(rscratch2, cnt2, largeLoopExitCondition);
>      __ br(__ LT, NO_PREFETCH);
> }
>
> and in this case we shouldn't see any performance penalties.
>
> -Aleksei
>
> On 29/10/2019 12:58, Patrick Zhang OS wrote:
>
> Hi,
>
>
>
> Could you please review this patch, thanks.
>
>
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8229351
>
> Webrev: http://cr.openjdk.java.net/~qpzhang/8229351/webrev.02
>
> (this starts from .02 since there had been some internal review and updates)
>
>
>
> Changes:
>
>
>
> 1.       Split the STUB_THRESHOLD from the hard-coded 72 to be CompareLongStringLimitLatin and CompareLongStringLimitUTF as a more flexible control over the stub thresholds for string_compare intrinsics, especially for various uArchs.
>
>
>
> 2.       MacroAssembler::string_compare LL and UU shared the same threshold, actually UU may only require the half (length of chars) of that of LL's, because one character has two-bytes for UU, while for compacted LL strings, one character means one byte. In addition, LU/UL may need a separated threshold, as the stub function is different from the same encoding one, and the performance may vary as well.
>
>
>
> 3.       In generate_compare_long_string_same_encoding, the hard-coded 72 was originally able to ensure that there can be always 64 bytes at least for the prefetch code path. However once a smaller stub threshold is set, a new condition is needed to tell if this would be still valid, or has to go to the NO_PREFETCH branch. This change can ensure the correctness.
>
>
>
> 4.       In generate_compare_long_string_different_encoding, some temp vars for handling the last 4 characters are not valid any longer, cleaned up strU and strL, and related pointers initialization to the next U (cnt1) and L (tmp2).
>
>
>
> 5.       In compare_string_16_x_LU, the reference to r10 (tmp1) is not needed, as tmpU or tmpL point to the same register.
>
>
>
> Tests:
>
>
>
>   1.  For function check, I have run
>
>
>
> jdk jtreg tier1 tests, with default vm flags
>
>
>
> hotspot jtreg tests: runtime/compiler/gc parts, with "-Xcomp -XX:-TieredCompilation"
>
>
>
> jck10/api/java.lang 1609 cases and other selected modules, no new failures found, with default vm flags and "-Xcomp -XX:-TieredCompilation" respectively;
>
>
>
> some specific test cases had been carefully executed to double check, i.e., TestStringCompareToDifferentLength.java [1] and TestStringCompareToDifferentLength.java [1] introduced by [2], StrCmpTest.java [3] introduced by [4].
>
>
>
>   1.  For performance check, I have run
>
>
>
> string-density-bench/CompareToBench.java [5] and StringCompareBench.java [6] respectively,
>
>
>
> and SPECjbb2015.jar, no obvious performance change has been found (since the default threshold is NOT changed within this patch).
>
>
>
> FYI. with Ampere eMAG system, microbenchmarks [5][6] can have 1.5x consistent perf gain with LU/UL comparison for shorter strings (<72 chars, smaller stub thresholds), and slight improvement (5~10%) with LL/UU cases.
>
>
>
> Refs:
>
> [1] http://hg.openjdk.java.net/jdk/jdk/file/3df2bf731a87/test/hotspot/jtreg/compiler/intrinsics/string
>
> [2] https://bugs.openjdk.java.net/browse/JDK-8218966 AArch64: String.compareTo() can read memory after string
>
> [3] http://cr.openjdk.java.net/~dpochepk/8202326/StrCmpTest.java, contributed by Dmitrij Pochepko
>
> [4] https://bugs.openjdk.java.net/browse/JDK-8202326 AARCH64: optimize string compare intrinsic
>
> [5] http://cr.openjdk.java.net/~shade/density/string-density-bench.jar, contributed by Aleksey Shipilev
>
> [6] http://cr.openjdk.java.net/~dpochepk/8202326/StringCompareBench.java, contributed by Dmitrij Pochepko
>
>
>
> Regards
>
> Patrick
>
>


From erik.osterlund at oracle.com  Tue Nov 12 17:38:11 2019
From: erik.osterlund at oracle.com (=?UTF-8?Q?Erik_=c3=96sterlund?=)
Date: Tue, 12 Nov 2019 18:38:11 +0100
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED6036C85@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
 <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
 <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036C85@dggeml527-mbx.china.huawei.com>
Message-ID: <83f92211-2c64-69d0-457c-c059acbccf63@oracle.com>

Hi Felix,

I was hoping to stay out of this conversation, but couldn't resist 
butting in unfortunately.
I have to agree with you - you are absolutely right. We have a mix of 
the JMM, the C++ memory model and HotSpot's memory model, which predates 
that. JMM and C++ memory model are indeed quite similar now in terms of 
semantics (yet there exists choice in implementation of it), but the old 
memory model used in HotSpot is kind of not similar. Ideally we would 
have less memory models and just go with the one used by C++/JMM, and 
then we just have to convince ourselves that the choice of 
implementation of seq_cst by the compiler is compatible to the one we 
use to implement the JMM in our JIT-compiled code. But it seems to me 
that we are not there.

Last time I discussed this with Andrew Haley, we disagreed and didn't 
really get anywhere. Andrew wanted to use the GCC intrinsics, and I was 
arguing that we should use inline assembly as a) the memory model we are 
supporting is not the same as what the intrinsic is providing, and b) we 
are relying on the implementation of the intrinsics to emit very 
specific instruction sequences to be compatible with the memory model, 
and it would be more clear if we could see in the inline assembly that 
we indeed used exactly those instructions that we expected and not 
something unexpected, which we would only randomly find out when 
disassembling the code (ahem).

Now it looks like you have discovered that we sometimes have double 
trailing dmb ish, and sometimes lacking leading dmb ish if I am reading 
this right. That seems to make the case stronger, that by looking at the 
intrinsic calls, it's not obvious what instruction sequence will be 
emitted, and whether that is compatible with the memory model it is 
implementing or not, and you really have to disassemble it to find out 
what we actually got. And it looks like what we got is not at all what 
we wanted.

My hope is that the AArch64 port should use inline assembly as you 
suggest, so we can see that the generated code is correct, as we wait 
for the glorious future where all HotSpot code has been rewritten to 
work with seq_cst (and we are *not* there now).

Having said that, now I will try to go and hide in a corner again...

Thanks,
/Erik

On 2019-11-12 13:14, Yangfei (Felix) wrote:
>> -----Original Message-----
>> From: Yangfei (Felix)
>> Sent: Tuesday, November 12, 2019 8:03 PM
>> To: 'Andrew Dinn' <adinn at redhat.com>; Andrew Haley <aph at redhat.com>;
>> aarch64-port-dev at openjdk.java.net
>> Cc: hotspot-dev at openjdk.java.net
>> Subject: RE: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
>> operations
>>
>>> -----Original Message-----
>>> From: Andrew Dinn [mailto:adinn at redhat.com]
>>> Sent: Tuesday, November 12, 2019 5:42 PM
>>> To: Andrew Haley <aph at redhat.com>; Yangfei (Felix)
>>> <felix.yang at huawei.com>; aarch64-port-dev at openjdk.java.net
>>> Cc: hotspot-dev at openjdk.java.net
>>> Subject: Re: [aarch64-port-dev ] RFR: aarch64: minor improvements of
>>> atomic operations
>>>
>>> On 12/11/2019 09:25, Andrew Haley wrote:
>>>> On 11/12/19 8:37 AM, Yangfei (Felix) wrote:
>>>>> This has been discussed somewhere before:
>>>>> https://patchwork.kernel.org/patch/3575821/
>>>>> Let's keep the current status for safe.
>>>> Yes.
>>>>
>>>> It's been interesting to see the progress of this patch. I don't
>>>> think it's the first time that someone has been tempted to change
>>>> this code to make it "more efficient".
>>>>
>>>> I wonder if we could perhaps add a comment to that code so that it
>>>> doesn't happen again. I'm not sure exactly what the patch should say
>>>> beyond "do not touch". Perhaps something along the lines of "Do not
>>>> touch this code unless you have at least Black Belt, 4th Dan in
>>>> memory ordering."  :-)
>>>>
>>>> More seriously, maybe simply "Note that memory_order_conservative
>>>> requires a full barrier after atomic stores. See
>>>> https://patchwork.kernel.org/patch/3575821/"
>>> Yes, that would be a help. It's particularly easy to get confused here
>>> because we happily omit the ordering of an stlr store wrt subsequent
>>> stores when the strl is implementing a Java volatile write or a Java cmpxchg.
>>>
>>> So, it might be worth adding a rider that implementing the full
>>> memory_order_conservative semantics is necessary because VM code
>>> relies on the strong ordering wrt writes that the cmpxchg is required to
>> provide.
>> I also suggest we implement these functions with inline assembly here.
>> For Atomic::PlatformXchg, we may issue two consecutive full memory barriers
>> with the current status.
>> I used GCC 7.3.0 to compile the following function:
>>
>> $ cat test.c
>> long foo(long add_value, long volatile* dest, long exchange_value) {
>>    long val = __sync_lock_test_and_set(dest, exchange_value);
>>
>>    __sync_synchronize();
>>
>>    return val;
>> }
>>
>> $ cat test.s
>>          .arch armv8-a
>>          .file   "test.c"
>>          .text
>>          .align  2
>>          .p2align 3,,7
>>          .global foo
>>          .type   foo, %function
>> foo:
>> .L2:
>>          ldxr    x0, [x1]
>>          stxr    w3, x2, [x1]
>>          cbnz    w3, .L2
>>          dmb     ish           < ========
>>          dmb     ish           < ========
>>          ret
>>          .size   foo, .-foo
>>          .ident  "GCC: (GNU) 7.3.0"
>>          .section        .note.GNU-stack,"", at progbits
> Also this is different from the following sequence (stxr instead of stlxr).
>
> 	<Access [A]>
>
> 	// atomic_op (B)
> 1:	ldxr	x0, [B]		// Exclusive load
> 	<op(B)>
> 	stlxr	w1, x0, [B]	// Exclusive store with release
> 	cbnz	w1, 1b
> 	dmb	ish		// Full barrier
>
> 	<Access [C]>
>
> I think the two-way memory barrier may not be ensured for this case.
>
> Felix


From Alan.Hayward at arm.com  Tue Nov 12 18:03:02 2019
From: Alan.Hayward at arm.com (Alan Hayward)
Date: Tue, 12 Nov 2019 18:03:02 +0000
Subject: [aarch64-port-dev ] RFR 8231841: AArch64: Add entry to pns output
 in help()
Message-ID: <BCE2D257-945C-4BFC-930E-D8922C59B6A6@arm.com>

Please could you review this change which adds AArch64 to the pns section of the help() output.

Bug: https://bugs.openjdk.java.net/browse/JDK-8231841
Webrev: http://cr.openjdk.java.net/~smonteith/8231841/webrev.0/


Built and ran tier1 on x86 and AArch64.


Thanks,
Alan.

From aph at redhat.com  Tue Nov 12 19:00:20 2019
From: aph at redhat.com (Andrew Haley)
Date: Tue, 12 Nov 2019 19:00:20 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <83f92211-2c64-69d0-457c-c059acbccf63@oracle.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
 <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
 <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036C85@dggeml527-mbx.china.huawei.com>
 <83f92211-2c64-69d0-457c-c059acbccf63@oracle.com>
Message-ID: <0d718a85-d669-f4b4-ae90-db1f7bb56b45@redhat.com>

On 11/12/19 5:38 PM, Erik ?sterlund wrote:
> My hope is that the AArch64 port should use inline assembly as you suggest, so we can see that the generated code is correct, as we wait for the glorious future where all HotSpot code has been rewritten to work with seq_cst (and we are *not* there now).

I don't doubt it. :-)

But my arguments about the C++ intrinsics being well-enough defined,
at least on AArch64 Linux, have not changed, and I'm not going to
argue all that again. I'll grant you that there may well be issues on
various x86 compilers, but that isn't relevant here.

> Now it looks like you have discovered that we sometimes have double trailing dmb ish, and sometimes lacking leading dmb ish if I am reading this right. That seems to make the case stronger,

Sure, we can use inline asm if there's no other way to do it, but I
don't think that's necessary. All we need is to use

  T res;
  __atomic_exchange(dest, &exchange_value, &res, __ATOMIC_RELEASE);
  FULL_MEM_BARRIER;

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From ci_notify at linaro.org  Tue Nov 12 19:07:37 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Tue, 12 Nov 2019 19:07:37 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK 13 on AArch64
Message-ID: <738577191.460.1573585658389.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk13/openjdk-jtreg-nightly-tests/summary/2019/316/summary.html
 
-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/02 pass: 5,645; fail: 2
Build 1: aarch64/2019/jul/04 pass: 5,644; fail: 2; error: 1
Build 2: aarch64/2019/jul/09 pass: 5,643; fail: 4
Build 3: aarch64/2019/jul/16 pass: 5,646; fail: 1
Build 4: aarch64/2019/jul/18 pass: 5,644; fail: 2; error: 1
Build 5: aarch64/2019/jul/20 pass: 5,645; fail: 1; error: 1
Build 6: aarch64/2019/jul/23 pass: 5,644; fail: 3
Build 7: aarch64/2019/jul/25 pass: 5,644; fail: 3
Build 8: aarch64/2019/jul/30 pass: 5,645; fail: 2
Build 9: aarch64/2019/aug/01 pass: 5,646; fail: 1
Build 10: aarch64/2019/aug/03 pass: 5,646; fail: 1
Build 11: aarch64/2019/aug/06 pass: 5,645; fail: 2
Build 12: aarch64/2019/aug/08 pass: 5,646; fail: 1
Build 13: aarch64/2019/aug/10 pass: 5,646; fail: 1
Build 14: aarch64/2019/nov/12 pass: 5,652

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/02 pass: 8,604; fail: 521; error: 25
Build 1: aarch64/2019/jul/04 pass: 8,601; fail: 523; error: 26
Build 2: aarch64/2019/jul/09 pass: 8,606; fail: 515; error: 29
Build 3: aarch64/2019/jul/16 pass: 8,593; fail: 531; error: 30
Build 4: aarch64/2019/jul/18 pass: 8,618; fail: 527; error: 26
Build 5: aarch64/2019/jul/20 pass: 8,619; fail: 519; error: 33
Build 6: aarch64/2019/jul/23 pass: 8,616; fail: 525; error: 30
Build 7: aarch64/2019/jul/25 pass: 8,620; fail: 528; error: 23
Build 8: aarch64/2019/jul/30 pass: 8,610; fail: 529; error: 32
Build 9: aarch64/2019/aug/01 pass: 8,620; fail: 527; error: 24
Build 10: aarch64/2019/aug/03 pass: 8,596; fail: 552; error: 23
Build 11: aarch64/2019/aug/06 pass: 8,616; fail: 528; error: 27
Build 12: aarch64/2019/aug/08 pass: 8,649; fail: 504; error: 18
Build 13: aarch64/2019/aug/10 pass: 8,647; fail: 507; error: 17
Build 14: aarch64/2019/nov/12 pass: 8,650; fail: 513; error: 16

3 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/02 pass: 3,962
Build 1: aarch64/2019/jul/04 pass: 3,962
Build 2: aarch64/2019/jul/09 pass: 3,962
Build 3: aarch64/2019/jul/16 pass: 3,963
Build 4: aarch64/2019/jul/18 pass: 3,964
Build 5: aarch64/2019/jul/20 pass: 3,964
Build 6: aarch64/2019/jul/23 pass: 3,964
Build 7: aarch64/2019/jul/25 pass: 3,964
Build 8: aarch64/2019/jul/30 pass: 3,964
Build 9: aarch64/2019/aug/01 pass: 3,964
Build 10: aarch64/2019/aug/03 pass: 3,964
Build 11: aarch64/2019/aug/06 pass: 3,964
Build 12: aarch64/2019/aug/08 pass: 3,964
Build 13: aarch64/2019/aug/10 pass: 3,964
Build 14: aarch64/2019/nov/12 pass: 3,964

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdk13/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 7.54x
Relative performance: Server critical-jOPS (nc): 8.89x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk13/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 204.57

Server 204.57 / Server 2014-04-01 (71.00): 2.88x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk13/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-07-03 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/183/results/
2019-07-04 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/185/results/
2019-07-10 pass rate: 10487/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/190/results/
2019-07-16 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/197/results/
2019-07-19 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/199/results/
2019-07-21 pass rate: 10487/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/201/results/
2019-07-24 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/204/results/
2019-07-26 pass rate: 10487/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/206/results/
2019-07-31 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/211/results/
2019-08-02 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/213/results/
2019-08-04 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/215/results/
2019-08-07 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/218/results/
2019-08-09 pass rate: 10487/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/220/results/
2019-08-11 pass rate: 10487/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/222/results/
2019-11-12 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/316/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/

From felix.yang at huawei.com  Wed Nov 13 02:35:39 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Wed, 13 Nov 2019 02:35:39 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <0d718a85-d669-f4b4-ae90-db1f7bb56b45@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
 <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
 <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036C85@dggeml527-mbx.china.huawei.com>
 <83f92211-2c64-69d0-457c-c059acbccf63@oracle.com>
 <0d718a85-d669-f4b4-ae90-db1f7bb56b45@redhat.com>
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED6036DF6@dggeml527-mbx.china.huawei.com>

> -----Original Message-----
> From: Andrew Haley [mailto:aph at redhat.com]
> Sent: Wednesday, November 13, 2019 3:00 AM
> To: Erik ?sterlund <erik.osterlund at oracle.com>; Yangfei (Felix)
> <felix.yang at huawei.com>; Andrew Dinn <adinn at redhat.com>;
> aarch64-port-dev at openjdk.java.net
> Cc: hotspot-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
> operations
> 
> On 11/12/19 5:38 PM, Erik ?sterlund wrote:
> > My hope is that the AArch64 port should use inline assembly as you suggest,
> so we can see that the generated code is correct, as we wait for the glorious
> future where all HotSpot code has been rewritten to work with seq_cst (and we
> are *not* there now).
> 
> I don't doubt it. :-)
> 
> But my arguments about the C++ intrinsics being well-enough defined, at least
> on AArch64 Linux, have not changed, and I'm not going to argue all that again.
> I'll grant you that there may well be issues on various x86 compilers, but that
> isn't relevant here.

Looks like I reignited an old discussion :- )

> > Now it looks like you have discovered that we sometimes have double
> > trailing dmb ish, and sometimes lacking leading dmb ish if I am
> > reading this right. That seems to make the case stronger,
> 
> Sure, we can use inline asm if there's no other way to do it, but I don't think
> that's necessary. All we need is to use
> 
>   T res;
>   __atomic_exchange(dest, &exchange_value, &res, __ATOMIC_RELEASE);
>   FULL_MEM_BARRIER;
> 

When we go the C++ intrinsics way, we should also handle Atomic::PlatformCmpxchg.  
When I compile the following function with GCC 4.9.3: 

long foo(long exchange_value, long volatile* dest, long compare_value)
{
  long val = __sync_val_compare_and_swap(dest, compare_value, exchange_value);
  return val;
}

I got:

.L2:
        ldaxr   x0, [x1]
        cmp     x0, x2
        bne     .L3
        stlxr   w4, x3, [x1]
        cbnz    w4, .L2
.L3:


Proposed patch:
diff -r 846fee5ea75e src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp
--- a/src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp Wed Nov 13 10:27:06 2019 +0900
+++ b/src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp Wed Nov 13 10:14:58 2019 +0800
@@ -52,7 +52,7 @@
                                                      T volatile* dest,
                                                      atomic_memory_order order) const {
   STATIC_ASSERT(byte_size == sizeof(T));
-  T res = __sync_lock_test_and_set(dest, exchange_value);
+  T res = __atomic_exchange_n(dest, exchange_value, __ATOMIC_RELEASE);
   FULL_MEM_BARRIER;
   return res;
 }
@@ -70,7 +70,11 @@
                               __ATOMIC_RELAXED, __ATOMIC_RELAXED);
     return value;
   } else {
-    return __sync_val_compare_and_swap(dest, compare_value, exchange_value);
+    T value = compare_value;
+    __atomic_compare_exchange(dest, &value, &exchange_value, /*weak*/false,
+                              __ATOMIC_RELEASE, __ATOMIC_RELAXED);
+    FULL_MEM_BARRIER;
+    return value;
   }
 }

From erik.osterlund at oracle.com  Wed Nov 13 08:38:25 2019
From: erik.osterlund at oracle.com (=?UTF-8?Q?Erik_=c3=96sterlund?=)
Date: Wed, 13 Nov 2019 09:38:25 +0100
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <0d718a85-d669-f4b4-ae90-db1f7bb56b45@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
 <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
 <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036C85@dggeml527-mbx.china.huawei.com>
 <83f92211-2c64-69d0-457c-c059acbccf63@oracle.com>
 <0d718a85-d669-f4b4-ae90-db1f7bb56b45@redhat.com>
Message-ID: <722749eb-12f5-16d7-f498-4147a2d32cd9@oracle.com>

Hi Andrew,

On 2019-11-12 20:00, Andrew Haley wrote:
> But my arguments about the C++ intrinsics being well-enough defined,
> at least on AArch64 Linux, have not changed, and I'm not going to
> argue all that again. I'll grant you that there may well be issues on
> various x86 compilers, but that isn't relevant here.

I also do not want to revive that discussion at this time. So I'm just 
going to note the way we think about this is... intrinsically different. 
With that said, I believe my work here is done. Intrinsic puzzle away. ;-)

/Erik

From felix.yang at huawei.com  Wed Nov 13 08:36:41 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Wed, 13 Nov 2019 08:36:41 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
 <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
 <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036C85@dggeml527-mbx.china.huawei.com>
 <83f92211-2c64-69d0-457c-c059acbccf63@oracle.com>
 <0d718a85-d669-f4b4-ae90-db1f7bb56b45@redhat.com> 
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED6036E65@dggeml527-mbx.china.huawei.com>

> -----Original Message-----
> From: Yangfei (Felix)
> Sent: Wednesday, November 13, 2019 10:36 AM
> To: 'Andrew Haley' <aph at redhat.com>; Erik ?sterlund
> <erik.osterlund at oracle.com>; Andrew Dinn <adinn at redhat.com>;
> aarch64-port-dev at openjdk.java.net
> Cc: hotspot-dev at openjdk.java.net
> Subject: RE: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
> operations
> 
> > -----Original Message-----
> > From: Andrew Haley [mailto:aph at redhat.com]
> > Sent: Wednesday, November 13, 2019 3:00 AM
> > To: Erik ?sterlund <erik.osterlund at oracle.com>; Yangfei (Felix)
> > <felix.yang at huawei.com>; Andrew Dinn <adinn at redhat.com>;
> > aarch64-port-dev at openjdk.java.net
> > Cc: hotspot-dev at openjdk.java.net
> > Subject: Re: [aarch64-port-dev ] RFR: aarch64: minor improvements of
> > atomic operations
> >
> > On 11/12/19 5:38 PM, Erik ?sterlund wrote:
> > > My hope is that the AArch64 port should use inline assembly as you
> > > suggest,
> > so we can see that the generated code is correct, as we wait for the
> > glorious future where all HotSpot code has been rewritten to work with
> > seq_cst (and we are *not* there now).
> >
> > I don't doubt it. :-)
> >
> > But my arguments about the C++ intrinsics being well-enough defined,
> > at least on AArch64 Linux, have not changed, and I'm not going to argue all
> that again.
> > I'll grant you that there may well be issues on various x86 compilers,
> > but that isn't relevant here.
> 
> Looks like I reignited an old discussion :- )
> 
> > > Now it looks like you have discovered that we sometimes have double
> > > trailing dmb ish, and sometimes lacking leading dmb ish if I am
> > > reading this right. That seems to make the case stronger,
> >
> > Sure, we can use inline asm if there's no other way to do it, but I
> > don't think that's necessary. All we need is to use
> >
> >   T res;
> >   __atomic_exchange(dest, &exchange_value, &res, __ATOMIC_RELEASE);
> >   FULL_MEM_BARRIER;
> >
> 
> When we go the C++ intrinsics way, we should also handle
> Atomic::PlatformCmpxchg.
> When I compile the following function with GCC 4.9.3:
> 
> long foo(long exchange_value, long volatile* dest, long compare_value) {
>   long val = __sync_val_compare_and_swap(dest, compare_value,
> exchange_value);
>   return val;
> }
> 
> I got:
> 
> .L2:
>         ldaxr   x0, [x1]
>         cmp     x0, x2
>         bne     .L3
>         stlxr   w4, x3, [x1]
>         cbnz    w4, .L2
> .L3:
> 
> 
> Proposed patch:
> diff -r 846fee5ea75e
> src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp
> --- a/src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp Wed Nov 13
> 10:27:06 2019 +0900
> +++ b/src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp Wed Nov
> +++ 13 10:14:58 2019 +0800
> @@ -52,7 +52,7 @@
>                                                       T volatile*
> dest,
> 
> atomic_memory_order order) const {
>    STATIC_ASSERT(byte_size == sizeof(T));
> -  T res = __sync_lock_test_and_set(dest, exchange_value);
> +  T res = __atomic_exchange_n(dest, exchange_value, __ATOMIC_RELEASE);
>    FULL_MEM_BARRIER;
>    return res;
>  }
> @@ -70,7 +70,11 @@
>                                __ATOMIC_RELAXED,
> __ATOMIC_RELAXED);
>      return value;
>    } else {
> -    return __sync_val_compare_and_swap(dest, compare_value,
> exchange_value);
> +    T value = compare_value;
> +    __atomic_compare_exchange(dest, &value, &exchange_value,
> /*weak*/false,
> +                              __ATOMIC_RELEASE,
> __ATOMIC_RELAXED);
> +    FULL_MEM_BARRIER;
> +    return value;
>    }
>  }


Still not strong enough? considering the first of ldxr of the loop may be speculated.  

v2 patch: 
diff -r 846fee5ea75e src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp
--- a/src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp Wed Nov 13 10:27:06 2019 +0900
+++ b/src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp Wed Nov 13 16:33:16 2019 +0800
@@ -52,7 +52,7 @@
                                                      T volatile* dest,
                                                      atomic_memory_order order) const {
   STATIC_ASSERT(byte_size == sizeof(T));
-  T res = __sync_lock_test_and_set(dest, exchange_value);
+  T res = __atomic_exchange_n(dest, exchange_value, __ATOMIC_RELEASE);
   FULL_MEM_BARRIER;
   return res;
 }
@@ -70,7 +70,12 @@
                               __ATOMIC_RELAXED, __ATOMIC_RELAXED);
     return value;
   } else {
-    return __sync_val_compare_and_swap(dest, compare_value, exchange_value);
+    T value = compare_value;
+    FULL_MEM_BARRIER;
+    __atomic_compare_exchange(dest, &value, &exchange_value, /*weak*/false,
+                              __ATOMIC_RELAXED, __ATOMIC_RELAXED);
+    FULL_MEM_BARRIER;
+    return value;
   }
 }


From adinn at redhat.com  Wed Nov 13 08:55:57 2019
From: adinn at redhat.com (Andrew Dinn)
Date: Wed, 13 Nov 2019 08:55:57 +0000
Subject: [aarch64-port-dev ] RFR 8231841: AArch64: Add entry to pns
 output in help()
In-Reply-To: <BCE2D257-945C-4BFC-930E-D8922C59B6A6@arm.com>
References: <BCE2D257-945C-4BFC-930E-D8922C59B6A6@arm.com>
Message-ID: <eca78d92-145e-8ab8-ab5a-6a848464b0fe@redhat.com>


On 12/11/2019 18:03, Alan Hayward wrote:
> Please could you review this change which adds AArch64 to the pns section of the help() output.
> 
> Bug: https://bugs.openjdk.java.net/browse/JDK-8231841
> Webrev: http://cr.openjdk.java.net/~smonteith/8231841/webrev.0/
> 
> 
> Built and ran tier1 on x86 and AArch64.
Yes, that's good to push thanks.

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill


From aph at redhat.com  Wed Nov 13 09:00:21 2019
From: aph at redhat.com (Andrew Haley)
Date: Wed, 13 Nov 2019 09:00:21 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED6036E65@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
 <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
 <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036C85@dggeml527-mbx.china.huawei.com>
 <83f92211-2c64-69d0-457c-c059acbccf63@oracle.com>
 <0d718a85-d669-f4b4-ae90-db1f7bb56b45@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036E65@dggeml527-mbx.china.huawei.com>
Message-ID: <1325b063-cc1b-74fe-3b78-f4eb4518d116@redhat.com>

On 11/13/19 8:36 AM, Yangfei (Felix) wrote:
> Still not strong enough? considering the first of ldxr of the loop may be speculated.  

Come on now, you must have read the thread on kernel-dev you pointed me to.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From felix.yang at huawei.com  Wed Nov 13 09:26:37 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Wed, 13 Nov 2019 09:26:37 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <1325b063-cc1b-74fe-3b78-f4eb4518d116@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
 <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
 <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036C85@dggeml527-mbx.china.huawei.com>
 <83f92211-2c64-69d0-457c-c059acbccf63@oracle.com>
 <0d718a85-d669-f4b4-ae90-db1f7bb56b45@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036E65@dggeml527-mbx.china.huawei.com>
 <1325b063-cc1b-74fe-3b78-f4eb4518d116@redhat.com>
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED6036EBA@dggeml527-mbx.china.huawei.com>

> -----Original Message-----
> From: Andrew Haley [mailto:aph at redhat.com]
> Sent: Wednesday, November 13, 2019 5:00 PM
> To: Yangfei (Felix) <felix.yang at huawei.com>; Erik ?sterlund
> <erik.osterlund at oracle.com>; Andrew Dinn <adinn at redhat.com>;
> aarch64-port-dev at openjdk.java.net
> Cc: hotspot-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
> operations
> 
> On 11/13/19 8:36 AM, Yangfei (Felix) wrote:
> > Still not strong enough? considering the first of ldxr of the loop may be
> speculated.
> 
> Come on now, you must have read the thread on kernel-dev you pointed me to.
> 

Yes, the cmpxchg case is different here.  
So the v2 patch in my previous mail approved?  
Will create a bug and do necessary testing.  

Thanks,
Felix

From patrick at os.amperecomputing.com  Wed Nov 13 09:35:36 2019
From: patrick at os.amperecomputing.com (Patrick Zhang OS)
Date: Wed, 13 Nov 2019 09:35:36 +0000
Subject: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub
 threshold of string_compare intrinsic tunable
In-Reply-To: <4a067ad4-a7e8-813e-2633-91bb380f24b5@bell-sw.com>
References: <MN2PR01MB609385B40719285BD087416A8F610@MN2PR01MB6093.prod.exchangelabs.com>
 <9248d1b3-7341-68b9-7dd1-02dfd48567a6@bell-sw.com>
 <MN2PR01MB609316337FABB9DE034156EA8F780@MN2PR01MB6093.prod.exchangelabs.com>
 <MN2PR01MB6093E9C78B79DD6C4F6692CF8F770@MN2PR01MB6093.prod.exchangelabs.com>
 <4a067ad4-a7e8-813e-2633-91bb380f24b5@bell-sw.com>
Message-ID: <MN2PR01MB6093CEF81AA8FCD02C9BCEA88F760@MN2PR01MB6093.prod.exchangelabs.com>

Many thanks for the comments and the suggested updates. I compared the numbers of touched branches, counted from the large-loop condition to the small-loop label. The early sub and branch to small-loop are nice to reduce it from 2 (v3) to 1 (v4), for StrLen=128 case. The base prefetched out of the boundary so the comparison might be unfair. By far my test result on Ampere eMAG systems looks fine with v4, the 128 LL is even a little bit better than base. In theory the additional branch for >= 200 (192 + 8) is still there, if the perf diffs were not obvious, the reason might be: the large-loop takes the majority of execution time, while the branch's time is minor.

LL, SoftwarePrefetchHintDistance=192
StrLen=128, base: 2 (prefetch out of boundary, the 1st br condition failed), patch.v3: 2 (to NO_PREFETCH), patch.v4: 1 (to SMALL_LOOP)
StrLen=256, base: 2, patch.v3: 3, patch.v4: 3 br + 1 b (SMALL_LOOP_CHECK)
StrLen=512, base: 6, patch.v3: 7, patch.v4: 7 br + 1 b (SMALL_LOOP_CHECK)
http://cr.openjdk.java.net/~qpzhang/8229351/webrev.04
The additional b (SMALL_LOOP_CHECK) makes the code cleaner, but want to keep the original subs and br (LT, TAIL), so keep as-is in v4.
Tested jtreg and strcmp microbenchs for LL/UU as smoke tests, no obvious regression. The LU/UL and other parts were not changed in v4, previous tests can cover.

Regards
Patrick

-----Original Message-----
From: Aleksei Voitylov <aleksei.voitylov at bell-sw.com> 
Sent: Wednesday, November 13, 2019 12:10 AM
To: Patrick Zhang OS <patrick at os.amperecomputing.com>
Cc: aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

Hi Patrick,

First, I'm a trespasser, not a reviewer. Reviewers will need to look at this.

On the technical side:

This additional branch in v3 is still painful. You can reduce the amount of branches in code path for lengths 128 and 256 by using that fact that CompareLongStringLimitLatin and CompareLongStringLimitUTF are at least 24. Then we don't have to jump to NO_PREFETCH label, where check for small string size is done. Instead we can jump to SMALL_LOOP label (assuming cnt2 counter is updated accordingly). In this case NO_PREFETCH label is not needed and we have 1 less branch.

Rough sketch of affected part based on v3 looks as follows. This version was checked on ThunderX2 and it looks fine on length 128 and 256 perf-wise. I also added alignment for small loop, which also helps a bit. Please keep in mind it's a sketch.

-Aleksei

@@ -4172,19 +4168,34 @@
???? Register result = r0, str1 = r1, cnt1 = r2, str2 = r3, cnt2 = r4,
???????? tmp1 = r10, tmp2 = r11;
???? Label SMALL_LOOP, LARGE_LOOP_PREFETCH, CHECK_LAST, DIFF2, TAIL,
-??????? LENGTH_DIFF, DIFF, LAST_CHECK_AND_LENGTH_DIFF,
+??????? LENGTH_DIFF, DIFF, LAST_CHECK_AND_LENGTH_DIFF, 
+SMALL_LOOP_CHECK,
???????? DIFF_LAST_POSITION, DIFF_LAST_POSITION2;
???? // exit from large loop when less than 64 bytes left to read or we're about
???? // to prefetch memory behind array border
???? int largeLoopExitCondition = MAX(64, SoftwarePrefetchHintDistance)/(isLL ? 1 : 2);
+??? // calculate the remaining limit in chars which manages if this
stub should be called,
+??? // if the limit is large enough (>= largeLoopExitCondition), below
large loop with prefetching
+??? // can be executed at least once, and there is no need to do any
extra checking at the entrance.
+??? int remainingLimit = (isLL ? CompareLongStringLimitLatin :
CompareLongStringLimitUTF) -
+???????????????????????? (wordSize / (isLL ? 1 : 2));
???? // cnt1/cnt2 contains amount of characters to compare. cnt1 can be re-used
???? // update cnt2 counter with already loaded 8 bytes
-??? __ sub(cnt2, cnt2, wordSize/(isLL ? 1 : 2));
+??? if (SoftwarePrefetchHintDistance >= 0 && remainingLimit <
largeLoopExitCondition) {
+????? __ sub(cnt2, cnt2, isLL ? 24 : 12);
+??? } else {
+????? __ sub(cnt2, cnt2, isLL ? 8 : 4);
+??? }
???? // update pointers, because of previous read
???? __ add(str1, str1, wordSize);
???? __ add(str2, str2, wordSize);
???? if (SoftwarePrefetchHintDistance >= 0) {
-????? __ bind(LARGE_LOOP_PREFETCH);
+????? if (remainingLimit < largeLoopExitCondition) {
+??????? // there could be fewer bytes left and invalid for this large
loop with prefetching
+??????? __ subs(rscratch2, cnt2, largeLoopExitCondition - (isLL ? 16 :
8));
+??????? __ br(__ LT, SMALL_LOOP);
+??????? __ add(cnt2, cnt2, isLL ? 16 : 8);
+????? }
+????? __ bind(LARGE_LOOP_PREFETCH); // 64 bytes loop
???????? __ prfm(Address(str1, SoftwarePrefetchHintDistance));
???????? __ prfm(Address(str2, SoftwarePrefetchHintDistance));
???????? compare_string_16_bytes_same(DIFF, DIFF2); @@ -4196,11 +4207,11 @@
???????? __ br(__ GT, LARGE_LOOP_PREFETCH);
???????? __ cbz(cnt2, LAST_CHECK_AND_LENGTH_DIFF); // no more chars left?
???? }
-??? // less than 16 bytes left?
-??? __ subs(cnt2, cnt2, isLL ? 16 : 8);
-??? __ br(__ LT, TAIL);
+??? __ b(SMALL_LOOP_CHECK); // check if less than 16 bytes left
+??? __ align(OptoLoopAlignment);
???? __ bind(SMALL_LOOP);
?????? compare_string_16_bytes_same(DIFF, DIFF2);
+????? __ bind(SMALL_LOOP_CHECK);
?????? __ subs(cnt2, cnt2, isLL ? 16 : 8);
?????? __ br(__ GE, SMALL_LOOP);
???? __ bind(TAIL);


On 12/11/2019 12:52, Patrick Zhang OS wrote:
> Ping...
>
> Hi Aleksei,
>
> Does the potential regression on my test system still exist? with the new patch webrev.03? If the added largeLoopExitCondition condition excluded your >128 chars strings from the large loop, and still caused performance drops, maybe hacking all the software prefetch hint distance and the checking condition to hardcoded 64, can be a good try. Although I think it would not be a right thing to do, in comparison with the similar logic in generate_compare_long_string_different_encoding. Thanks. 
>
> http://cr.openjdk.java.net/~qpzhang/8229351/webrev.03/jdk.changeset 
>      if (SoftwarePrefetchHintDistance >= 0) {
> -      __ bind(LARGE_LOOP_PREFETCH);
> +      if (remainingLimit < largeLoopExitCondition) {
> +        // there could be fewer bytes left and invalid for this large loop with prefetching
> +        __ subs(rscratch2, cnt2, largeLoopExitCondition); // => subs(rscratch2, cnt2, 64); ??
> +        __ br(__ LT, NO_PREFETCH);
> +      }
> +      __ bind(LARGE_LOOP_PREFETCH); // 64 bytes loop
>          __ prfm(Address(str1, SoftwarePrefetchHintDistance)); //  => __ prfm(Address(str1, 64));
>          __ prfm(Address(str2, SoftwarePrefetchHintDistance)); //  => 
> __ prfm(Address(str2, 64));
>
> Regards
> Patrick
>
> -----Original Message-----
> From: aarch64-port-dev <aarch64-port-dev-bounces at openjdk.java.net> On 
> Behalf Of Patrick Zhang OS
> Sent: Thursday, November 7, 2019 6:56 PM
> To: Aleksei Voitylov <aleksei.voitylov at bell-sw.com>
> Cc: aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub 
> threshold of string_compare intrinsic tunable
>
> Hi Aleksei,
>
> Thanks for testing it and the data. I only had the source of StringCompareBench.java [6], my numbers (the diffs) are within 2%, while the -207.12% looks quite weird. I initially did not add the condition to control the br, since generate_ compare_long_string_different_encoding has the similar unconditional br. By the way, the original logic allowed prefetching the memory behind array border, for the first 64 bytes. I think securing the prefetch is the right thing to do, but it could certainly stop some cases from going to the large loop with prefetching. Welcome further comments, thanks.
>
> http://cr.openjdk.java.net/~qpzhang/8229351/webrev.03/
>
> Regards
> Patrick
>
> From: Aleksei Voitylov <aleksei.voitylov at bell-sw.com>
> Sent: Thursday, November 7, 2019 12:53 AM
> To: Patrick Zhang OS <patrick at os.amperecomputing.com>
> Cc: aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub 
> threshold of string_compare intrinsic tunable
>
>
> Hi Patrick,
>
> I like the fact that this patch does not add much to the complexity of the code. Here are some experiments that you could find useful.
> Cortex A73    Size    base (ns/op)    patched (ns/op)    Diff
> StringCompareBench.StringCompareLL    256    14422257,98    15302300,24    -6,10%
> StringCompareBench.StringCompareLL    512    27998036,21    28317818,08    -1,14%
>
> ThunderX2    Size    base (ns/op)    patched (ns/op)    Diff
> StringCompareBench.StringCompareLL    128    4265122,232    13099099,67    -207,12%
> StringCompareBench.StringCompareLL    256    3539452,533    3599407,432    -1,69%
>
> StringCompareBench.StringCompareUU    128    6899938,75    7174601,241    -3,98%
> StringCompareBench.StringCompareUU    256    7654538,841    7826599,466    -2,25%
>
> StringCompareBench.cachedStringCompareLL    128    19,673    21,242    -7,98%
> StringCompareBench.cachedStringCompareLL    256    34,179    36,452    -6,65%
> StringCompareBench.cachedStringCompareLL    512    59,574    64,088    -7,58%
> StringCompareBench.cachedStringCompareLL    1024    110,37    118,477    -7,35%
> StringCompareBench.cachedStringCompareLL    1000000    114028,907    115388,681    -1,19%
>
> StringCompareBench.cachedStringCompareUU    128    33,752    36,922    -9,39%
> StringCompareBench.cachedStringCompareUU    256    60,939    64,096    -5,18%
> StringCompareBench.cachedStringCompareUU    512    115,328    118,48    -2,73%
> StringCompareBench.cachedStringCompareUU    1024    239,332    242,97    -1,52%
> StringCompareBench.cachedStringCompareUU    1000000    226491,096    233638,328    -3,16%
> It might be the case that the newly added branch is the culprit:
>
> +      __ subs(rscratch2, cnt2, largeLoopExitCondition);
> +      __ br(__ LT, NO_PREFETCH);
>
> Maybe you could skip it when CompareLongStringLimitLatin and CompareLongStringLimitUTF are large enough (then stub code is only called with string length large enough to skip branch above). Then (the properly commented) code would look like:
>
> if ((stub_threshold-wordSize/(isLL ? 1 : 2)) < largeLoopExitCondition) {
>      __ subs(rscratch2, cnt2, largeLoopExitCondition);
>      __ br(__ LT, NO_PREFETCH);
> }
>
> and in this case we shouldn't see any performance penalties.
>
> -Aleksei
>
> On 29/10/2019 12:58, Patrick Zhang OS wrote:
>
> Hi,
>
>
>
> Could you please review this patch, thanks.
>
>
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8229351
>
> Webrev: http://cr.openjdk.java.net/~qpzhang/8229351/webrev.02
>
> (this starts from .02 since there had been some internal review and 
> updates)
>
>
>
> Changes:
>
>
>
> 1.       Split the STUB_THRESHOLD from the hard-coded 72 to be CompareLongStringLimitLatin and CompareLongStringLimitUTF as a more flexible control over the stub thresholds for string_compare intrinsics, especially for various uArchs.
>
>
>
> 2.       MacroAssembler::string_compare LL and UU shared the same threshold, actually UU may only require the half (length of chars) of that of LL's, because one character has two-bytes for UU, while for compacted LL strings, one character means one byte. In addition, LU/UL may need a separated threshold, as the stub function is different from the same encoding one, and the performance may vary as well.
>
>
>
> 3.       In generate_compare_long_string_same_encoding, the hard-coded 72 was originally able to ensure that there can be always 64 bytes at least for the prefetch code path. However once a smaller stub threshold is set, a new condition is needed to tell if this would be still valid, or has to go to the NO_PREFETCH branch. This change can ensure the correctness.
>
>
>
> 4.       In generate_compare_long_string_different_encoding, some temp vars for handling the last 4 characters are not valid any longer, cleaned up strU and strL, and related pointers initialization to the next U (cnt1) and L (tmp2).
>
>
>
> 5.       In compare_string_16_x_LU, the reference to r10 (tmp1) is not needed, as tmpU or tmpL point to the same register.
>
>
>
> Tests:
>
>
>
>   1.  For function check, I have run
>
>
>
> jdk jtreg tier1 tests, with default vm flags
>
>
>
> hotspot jtreg tests: runtime/compiler/gc parts, with "-Xcomp -XX:-TieredCompilation"
>
>
>
> jck10/api/java.lang 1609 cases and other selected modules, no new 
> failures found, with default vm flags and "-Xcomp 
> -XX:-TieredCompilation" respectively;
>
>
>
> some specific test cases had been carefully executed to double check, i.e., TestStringCompareToDifferentLength.java [1] and TestStringCompareToDifferentLength.java [1] introduced by [2], StrCmpTest.java [3] introduced by [4].
>
>
>
>   1.  For performance check, I have run
>
>
>
> string-density-bench/CompareToBench.java [5] and 
> StringCompareBench.java [6] respectively,
>
>
>
> and SPECjbb2015.jar, no obvious performance change has been found (since the default threshold is NOT changed within this patch).
>
>
>
> FYI. with Ampere eMAG system, microbenchmarks [5][6] can have 1.5x consistent perf gain with LU/UL comparison for shorter strings (<72 chars, smaller stub thresholds), and slight improvement (5~10%) with LL/UU cases.
>
>
>
> Refs:
>
> [1] 
> http://hg.openjdk.java.net/jdk/jdk/file/3df2bf731a87/test/hotspot/jtre
> g/compiler/intrinsics/string
>
> [2] https://bugs.openjdk.java.net/browse/JDK-8218966 AArch64: 
> String.compareTo() can read memory after string
>
> [3] http://cr.openjdk.java.net/~dpochepk/8202326/StrCmpTest.java, 
> contributed by Dmitrij Pochepko
>
> [4] https://bugs.openjdk.java.net/browse/JDK-8202326 AARCH64: optimize 
> string compare intrinsic
>
> [5] 
> http://cr.openjdk.java.net/~shade/density/string-density-bench.jar, 
> contributed by Aleksey Shipilev
>
> [6] 
> http://cr.openjdk.java.net/~dpochepk/8202326/StringCompareBench.java, 
> contributed by Dmitrij Pochepko
>
>
>
> Regards
>
> Patrick
>
>


From aph at redhat.com  Wed Nov 13 09:39:05 2019
From: aph at redhat.com (Andrew Haley)
Date: Wed, 13 Nov 2019 09:39:05 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED6036EBA@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
 <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
 <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036C85@dggeml527-mbx.china.huawei.com>
 <83f92211-2c64-69d0-457c-c059acbccf63@oracle.com>
 <0d718a85-d669-f4b4-ae90-db1f7bb56b45@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036E65@dggeml527-mbx.china.huawei.com>
 <1325b063-cc1b-74fe-3b78-f4eb4518d116@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036EBA@dggeml527-mbx.china.huawei.com>
Message-ID: <9aa92f57-ea8a-7e0a-2218-bf360a13c46e@redhat.com>

On 11/13/19 9:26 AM, Yangfei (Felix) wrote:
> Yes, the cmpxchg case is different here.  
> So the v2 patch in my previous mail approved?  
> Will create a bug and do necessary testing.  

I don't know which patch is v2, but for the reasons carefully laid out
in the kernel-dev thread we don't need two full barriers. The first
version of Atomic::PlatformCmpxchg you posted is OK.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From felix.yang at huawei.com  Wed Nov 13 09:46:55 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Wed, 13 Nov 2019 09:46:55 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <9aa92f57-ea8a-7e0a-2218-bf360a13c46e@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
 <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
 <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036C85@dggeml527-mbx.china.huawei.com>
 <83f92211-2c64-69d0-457c-c059acbccf63@oracle.com>
 <0d718a85-d669-f4b4-ae90-db1f7bb56b45@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036E65@dggeml527-mbx.china.huawei.com>
 <1325b063-cc1b-74fe-3b78-f4eb4518d116@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036EBA@dggeml527-mbx.china.huawei.com>
 <9aa92f57-ea8a-7e0a-2218-bf360a13c46e@redhat.com>
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED6036F18@dggeml527-mbx.china.huawei.com>

> -----Original Message-----
> From: Andrew Haley [mailto:aph at redhat.com]
> Sent: Wednesday, November 13, 2019 5:39 PM
> To: Yangfei (Felix) <felix.yang at huawei.com>; Erik ?sterlund
> <erik.osterlund at oracle.com>; Andrew Dinn <adinn at redhat.com>;
> aarch64-port-dev at openjdk.java.net
> Cc: hotspot-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
> operations
> 
> On 11/13/19 9:26 AM, Yangfei (Felix) wrote:
> > Yes, the cmpxchg case is different here.
> > So the v2 patch in my previous mail approved?
> > Will create a bug and do necessary testing.
> 
> I don't know which patch is v2, but for the reasons carefully laid out in the
> kernel-dev thread we don't need two full barriers. The first version of
> Atomic::PlatformCmpxchg you posted is OK.
> 

Well, I think the cmpxchg case is different: the compare in the loop may fail and then we don't got a change to execute the stlxr instruction.  
This is explicitedly discussed in that thread: https://patchwork.kernel.org/patch/3575821/
As a result, aarch64 Linux plants two barriers in that patch: 

@@ -112,17 +114,20 @@  static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
 	unsigned long tmp;
 	int oldval;
 
+	smp_mb();              < ========
+
 	asm volatile("// atomic_cmpxchg\n"
-"1:	ldaxr	%w1, %2\n"
+"1:	ldxr	%w1, %2\n"
 "	cmp	%w1, %w3\n"
 "	b.ne	2f\n"
-"	stlxr	%w0, %w4, %2\n"
+"	stxr	%w0, %w4, %2\n"
 "	cbnz	%w0, 1b\n"
 "2:"
 	: "=&r" (tmp), "=&r" (oldval), "+Q" (ptr->counter)
 	: "Ir" (old), "r" (new)
 	: "cc", "memory");
 
+	smp_mb();             < ========
 	return oldval;
 }


That's why I switched to the V2 patch:

diff -r 846fee5ea75e src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp
--- a/src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp Wed Nov 13 10:27:06 2019 +0900
+++ b/src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp Wed Nov 13 16:33:16 2019 +0800
@@ -52,7 +52,7 @@
                                                      T volatile* dest,
                                                      atomic_memory_order order) const {
   STATIC_ASSERT(byte_size == sizeof(T));
-  T res = __sync_lock_test_and_set(dest, exchange_value);
+  T res = __atomic_exchange_n(dest, exchange_value, __ATOMIC_RELEASE);
   FULL_MEM_BARRIER;
   return res;
 }
@@ -70,7 +70,12 @@
                               __ATOMIC_RELAXED, __ATOMIC_RELAXED);
     return value;
   } else {
-    return __sync_val_compare_and_swap(dest, compare_value, exchange_value);
+    T value = compare_value;
+    FULL_MEM_BARRIER;
+    __atomic_compare_exchange(dest, &value, &exchange_value, /*weak*/false,
+                              __ATOMIC_RELAXED, __ATOMIC_RELAXED);
+    FULL_MEM_BARRIER;
+    return value;
   }
 }

From Pengfei.Li at arm.com  Wed Nov 13 09:55:48 2019
From: Pengfei.Li at arm.com (Pengfei Li (Arm Technology China))
Date: Wed, 13 Nov 2019 09:55:48 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
	conditionally allocatable
Message-ID: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>

Hi,

JBS: https://bugs.openjdk.java.net/browse/JDK-8233743
Webrev: http://cr.openjdk.java.net/~pli/rfr/8233743/webrev.00/

This is a follow-up patch of JDK-8217909[1] to make the AArch64 register
r27 allocatable when CompressedOops and CompressedClassPointers are both
turned off.

Below changes have been made:
- Massage the RegMask(s) in reg_mask_init() at C2 initialization and
remove r27 from some of the masks conditionally to make it allocatable.
- Also make r29 conditionally reserved in this similar way.
- Make r29 allocatable for pointers as well as integers.
- Replace an rheapbase use to rscratch1 in AArch64 ZGC.
- Revert JDK-8231754[2] which makes r27 always reserved in JVMCI.

This patch aligns with the implementation in [1] which makes the x86_64
r12 register allocatable. Please let me know if I have missed anything
for AArch64.

Tests:
Full jtreg with default options and extra options "-XX:-UseCompressedOops
-XX:+PreserveFramePointer". No new failure is found.

[1] https://hg.openjdk.java.net/jdk/jdk/rev/48b50573dee4
[2] https://hg.openjdk.java.net/jdk/jdk/rev/d068b1e534de

--
Thanks,
Pengfei


From aph at redhat.com  Wed Nov 13 10:38:25 2019
From: aph at redhat.com (Andrew Haley)
Date: Wed, 13 Nov 2019 10:38:25 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED6036F18@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
 <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
 <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036C85@dggeml527-mbx.china.huawei.com>
 <83f92211-2c64-69d0-457c-c059acbccf63@oracle.com>
 <0d718a85-d669-f4b4-ae90-db1f7bb56b45@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036E65@dggeml527-mbx.china.huawei.com>
 <1325b063-cc1b-74fe-3b78-f4eb4518d116@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036EBA@dggeml527-mbx.china.huawei.com>
 <9aa92f57-ea8a-7e0a-2218-bf360a13c46e@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036F18@dggeml527-mbx.china.huawei.com>
Message-ID: <1345fade-e1b4-42f0-c86f-9fd518431fcf@redhat.com>

On 11/13/19 9:46 AM, Yangfei (Felix) wrote:
> That's why I switched to the V2 patch:

I see.

This seems excessive. I doubt that there is any code in HotSpot that
relies on such things, especially given that we've manage with mere
sequential consistency for CMPXCHG for so long, but if you want to go
for the full Howitzer I won't try to stop you.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Wed Nov 13 12:27:26 2019
From: aph at redhat.com (Andrew Haley)
Date: Wed, 13 Nov 2019 12:27:26 +0000
Subject: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub
 threshold of string_compare intrinsic tunable
In-Reply-To: <MN2PR01MB609385B40719285BD087416A8F610@MN2PR01MB6093.prod.exchangelabs.com>
References: <MN2PR01MB609385B40719285BD087416A8F610@MN2PR01MB6093.prod.exchangelabs.com>
Message-ID: <21d39cdf-b38d-f40e-c8b5-2d92094cdc2a@redhat.com>

On 10/29/19 9:58 AM, Patrick Zhang OS wrote:

> 1.  Split the STUB_THRESHOLD from the hard-coded 72 to be
> CompareLongStringLimitLatin and CompareLongStringLimitUTF as a more
> flexible control over the stub thresholds for string_compare
> intrinsics, especially for various uArchs.
> 
> 2.  MacroAssembler::string_compare LL and UU shared the same
> threshold, actually UU may only require the half (length of chars)
> of that of LL's, because one character has two-bytes for UU, while
> for compacted LL strings, one character means one byte. In addition,
> LU/UL may need a separated threshold, as the stub function is
> different from the same encoding one, and the performance may vary
> as well.
> 
> 3.  In generate_compare_long_string_same_encoding, the hard-coded 72
> was originally able to ensure that there can be always 64 bytes at
> least for the prefetch code path. However once a smaller stub
> threshold is set, a new condition is needed to tell if this would be
> still valid, or has to go to the NO_PREFETCH branch. This change can
> ensure the correctness.
> 
> 4.  In generate_compare_long_string_different_encoding, some temp
> vars for handling the last 4 characters are not valid any longer,
> cleaned up strU and strL, and related pointers initialization to the
> next U (cnt1) and L (tmp2).
> 
> 5.  In compare_string_16_x_LU, the reference to r10 (tmp1) is not
> needed, as tmpU or tmpL point to the same register.

Thank you for your patch, but I'm afraid that I have some reservations.

This patch seems to do rather a lot.

What are the thresholds you tested? How are we supposed to test with
these different thresholds? Are the thresholds bytes or characters?
Why are the different thresholds not tested in this patch?

But the more serious problem is the fact that we have different code
paths for different microarchitectures, and somehow this has to be
standard supportable software. In order to test this stuff we'll need
different test parameters for SoftwarePrefetchHintDistance,
CompareLongStringLimitLatin, CompareLongStringLimitUTF.

Bear in mind that while manufacturers are (entirely reasonably) very
keen to show their processors in the best light possible, they are not
the people who will have to support this software and debug it when it
goes wrong. So there is a fundamental conflict of interest between
support people and CPU vendors.

We already emit a great deal of in-line code in the string_compare
intrinsic, with the intention that this be as fast as possible because
we want to avoid having to call the intrinsic. So why is the intrinsic
actually faster in your case? Could we not concentrate on that?

I -- and I'm sure it's not just me -- would be tremendously grateful
if all of the AArch64 developers would concentrate on improving code
quality overall rather than tweaking stub parameters.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From daniel.daugherty at oracle.com  Wed Nov 13 14:45:36 2019
From: daniel.daugherty at oracle.com (Daniel D. Daugherty)
Date: Wed, 13 Nov 2019 09:45:36 -0500
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED6036EBA@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
 <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
 <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036C85@dggeml527-mbx.china.huawei.com>
 <83f92211-2c64-69d0-457c-c059acbccf63@oracle.com>
 <0d718a85-d669-f4b4-ae90-db1f7bb56b45@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036E65@dggeml527-mbx.china.huawei.com>
 <1325b063-cc1b-74fe-3b78-f4eb4518d116@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036EBA@dggeml527-mbx.china.huawei.com>
Message-ID: <f93d7b5c-62fa-2143-0fe7-4a3ea8fb0f7a@oracle.com>

On 11/13/19 4:26 AM, Yangfei (Felix) wrote:
>> -----Original Message-----
>> From: Andrew Haley [mailto:aph at redhat.com]
>> Sent: Wednesday, November 13, 2019 5:00 PM
>> To: Yangfei (Felix) <felix.yang at huawei.com>; Erik ?sterlund
>> <erik.osterlund at oracle.com>; Andrew Dinn <adinn at redhat.com>;
>> aarch64-port-dev at openjdk.java.net
>> Cc: hotspot-dev at openjdk.java.net
>> Subject: Re: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
>> operations
>>
>> On 11/13/19 8:36 AM, Yangfei (Felix) wrote:
>>> Still not strong enough? considering the first of ldxr of the loop may be
>> speculated.
>>
>> Come on now, you must have read the thread on kernel-dev you pointed me to.
>>
> Yes, the cmpxchg case is different here.
> So the v2 patch in my previous mail approved?
> Will create a bug and do necessary testing.

Is there a reason to not reopen this bug:

JDK-8233912 aarch64: minor improvements of atomic operations
https://bugs.openjdk.java.net/browse/JDK-8233912

Dan

>
> Thanks,
> Felix


From ci_notify at linaro.org  Thu Nov 14 02:36:08 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Thu, 14 Nov 2019 02:36:08 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK JDK on AArch64
Message-ID: <851383565.643.1573698968979.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/summary/2019/317/summary.html
 
-------------------------------------------------------------------------------
client-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,780; fail: 19; not run: 90

-------------------------------------------------------------------------------
client-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,495; fail: 670; error: 23

-------------------------------------------------------------------------------
client-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/09 pass: 5,747; fail: 1
Build 1: aarch64/2019/oct/11 pass: 5,751; fail: 1
Build 2: aarch64/2019/oct/14 pass: 5,753
Build 3: aarch64/2019/oct/16 pass: 5,753; fail: 1
Build 4: aarch64/2019/oct/18 pass: 5,760
Build 5: aarch64/2019/oct/21 pass: 5,716; fail: 43; error: 1
Build 6: aarch64/2019/oct/23 pass: 5,760; fail: 1
Build 7: aarch64/2019/oct/28 pass: 5,766
Build 8: aarch64/2019/oct/30 pass: 5,768
Build 9: aarch64/2019/nov/01 pass: 5,768; fail: 1
Build 10: aarch64/2019/nov/04 pass: 5,769
Build 11: aarch64/2019/nov/06 pass: 5,766; fail: 2
Build 12: aarch64/2019/nov/08 pass: 5,761
Build 13: aarch64/2019/nov/11 pass: 5,762
Build 14: aarch64/2019/nov/13 pass: 5,764; fail: 1

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/09 pass: 8,692; fail: 507; error: 21
Build 1: aarch64/2019/oct/11 pass: 8,693; fail: 511; error: 18
Build 2: aarch64/2019/oct/14 pass: 8,706; fail: 497; error: 20
Build 3: aarch64/2019/oct/16 pass: 8,702; fail: 509; error: 17
Build 4: aarch64/2019/oct/18 pass: 8,694; fail: 522; error: 17
Build 5: aarch64/2019/oct/21 pass: 8,705; fail: 512; error: 18
Build 6: aarch64/2019/oct/23 pass: 8,712; fail: 505; error: 18
Build 7: aarch64/2019/oct/28 pass: 8,711; fail: 509; error: 18
Build 8: aarch64/2019/oct/30 pass: 8,723; fail: 504; error: 19
Build 9: aarch64/2019/nov/01 pass: 8,774; fail: 506; error: 18
Build 10: aarch64/2019/nov/04 pass: 8,777; fail: 509; error: 17
Build 11: aarch64/2019/nov/06 pass: 8,775; fail: 507; error: 19
Build 12: aarch64/2019/nov/08 pass: 8,774; fail: 510; error: 17
Build 13: aarch64/2019/nov/11 pass: 8,777; fail: 509; error: 15
Build 14: aarch64/2019/nov/13 pass: 8,773; fail: 509; error: 21

3 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/09 pass: 3,979
Build 1: aarch64/2019/oct/11 pass: 3,979
Build 2: aarch64/2019/oct/14 pass: 3,979
Build 3: aarch64/2019/oct/16 pass: 3,979
Build 4: aarch64/2019/oct/18 pass: 3,979
Build 5: aarch64/2019/oct/21 pass: 3,979
Build 6: aarch64/2019/oct/23 pass: 3,980
Build 7: aarch64/2019/oct/28 pass: 3,980
Build 8: aarch64/2019/oct/30 pass: 3,980
Build 9: aarch64/2019/nov/01 pass: 3,980
Build 10: aarch64/2019/nov/04 pass: 3,980
Build 11: aarch64/2019/nov/06 pass: 3,980
Build 12: aarch64/2019/nov/08 pass: 3,980
Build 13: aarch64/2019/nov/11 pass: 3,980
Build 14: aarch64/2019/nov/13 pass: 3,980

-------------------------------------------------------------------------------
server-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,787; fail: 18; not run: 90

-------------------------------------------------------------------------------
server-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,476; fail: 686; error: 27

-------------------------------------------------------------------------------
server-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 7.63x
Relative performance: Server critical-jOPS (nc): 9.66x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 207.57

Server 207.57 / Server 2014-04-01 (71.00): 2.92x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-10-08 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/280/results/
2019-10-10 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/282/results/
2019-10-12 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/284/results/
2019-10-15 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/287/results/
2019-10-17 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/289/results/
2019-10-19 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/291/results/
2019-10-22 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/294/results/
2019-10-23 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/296/results/
2019-10-29 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/301/results/
2019-10-31 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/303/results/
2019-11-02 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/305/results/
2019-11-05 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/308/results/
2019-11-07 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/310/results/
2019-11-12 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/315/results/
2019-11-14 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/317/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/

From dean.long at oracle.com  Thu Nov 14 04:15:40 2019
From: dean.long at oracle.com (dean.long at oracle.com)
Date: Wed, 13 Nov 2019 20:15:40 -0800
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
	conditionally allocatable
In-Reply-To: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
Message-ID: <b4de4a1c-9b25-1921-8a27-2946e7660dc2@oracle.com>

Hi Pengfei,

I took a quick look and didn't notice any problems.? Nice work! This 
seems to match the x64 approach, however please get other reviews.

dl

On 11/13/19 1:55 AM, Pengfei Li (Arm Technology China) wrote:
> Hi,
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8233743
> Webrev: http://cr.openjdk.java.net/~pli/rfr/8233743/webrev.00/
>
> This is a follow-up patch of JDK-8217909[1] to make the AArch64 register
> r27 allocatable when CompressedOops and CompressedClassPointers are both
> turned off.
>
> Below changes have been made:
> - Massage the RegMask(s) in reg_mask_init() at C2 initialization and
> remove r27 from some of the masks conditionally to make it allocatable.
> - Also make r29 conditionally reserved in this similar way.
> - Make r29 allocatable for pointers as well as integers.
> - Replace an rheapbase use to rscratch1 in AArch64 ZGC.
> - Revert JDK-8231754[2] which makes r27 always reserved in JVMCI.
>
> This patch aligns with the implementation in [1] which makes the x86_64
> r12 register allocatable. Please let me know if I have missed anything
> for AArch64.
>
> Tests:
> Full jtreg with default options and extra options "-XX:-UseCompressedOops
> -XX:+PreserveFramePointer". No new failure is found.
>
> [1] https://hg.openjdk.java.net/jdk/jdk/rev/48b50573dee4
> [2] https://hg.openjdk.java.net/jdk/jdk/rev/d068b1e534de
>
> --
> Thanks,
> Pengfei
>


From patrick at os.amperecomputing.com  Thu Nov 14 09:20:01 2019
From: patrick at os.amperecomputing.com (Patrick Zhang OS)
Date: Thu, 14 Nov 2019 09:20:01 +0000
Subject: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub
 threshold of string_compare intrinsic tunable
In-Reply-To: <21d39cdf-b38d-f40e-c8b5-2d92094cdc2a@redhat.com>
References: <MN2PR01MB609385B40719285BD087416A8F610@MN2PR01MB6093.prod.exchangelabs.com>
 <21d39cdf-b38d-f40e-c8b5-2d92094cdc2a@redhat.com>
Message-ID: <MN2PR01MB609388DD147A3BF07D2F6E348F710@MN2PR01MB6093.prod.exchangelabs.com>

Thanks for the comments, see my answers below please.

>> 1. This patch seems to do rather a lot.
Yes, it enables tweaking the stub parameters (not really changed any in this patch), fixed an out-of-boundary prefetching for LL/UU, and fixed some redundant instructions in LU/UL code path. 
The latter two are code-quality-wise, if splitting the patch could make the changes clearer, I'd like to do.

>> 2. Are the thresholds bytes or characters?
All thresholds are (and should be) in characters. This was a little bit misleading, for LL/LU/UL, the const STUB_THRESHOLD meant chars, while for UU it could be explained as bytes. If specified -XX:-CompactStrings, all code path going to UU would make the threshold mean bytes, which might confuse developers. This patch can clarify it, and the description of tunable options can provide further guidance.

>> 3. How are we supposed to test with these different thresholds?
There are two jtreg tests for checking the impacts of SoftwarePrefetchHintDistance over the intrinsics, I have locally added non-default thresholds inside and tested with many lengths (took days on a test system). This has not been included in the proposed patch, maybe a follow-up one would do, any advice?
hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToSameLength.java
hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToDifferentLength.java

>> 4. What are the thresholds you tested? 
Firstly, the default threshold, the hardcoded 72 is my testing focus since I would try best not to bring negative impacts to aarch64-port normal state, especially other CPU vendors.
Second, I tested two extreme thresholds: 24 and 255, which means more shorter strings (24 to 71 chars) or only very long strings (>=255) could go to the stub code path, respectively. Function tests passed (listed in the initial email), while performance test results (with string-density-bench, StringCompareBench.java, and SPECjbb2015) could be varying with different systems (as well as microarchitectures).
Third, some other non-default thresholds, as sanity check, particularly for ensuring correctness.

>> 5. But the more serious problem is the fact that we have different code paths for different microarchitectures, and somehow this has to be standard supportable software. In order to test this stuff we'll need different test parameters for SoftwarePrefetchHintDistance, CompareLongStringLimitLatin, CompareLongStringLimitUTF
The STUB_THRESHOLD was introduced to control the stub code insertion, tested on some aarch64 systems. I think making it tunable is the way to let different microarchitectures be able to configure optimal ones for their own. I would like to have a common threshold too, or no threshold for all, but lacking of full-coverage tests over all systems. Maybe I misunderstood you points here with regards to "supportable", the two new options can be kept as default if developers have no concerns on string compare intrinsics.

>> 6. We already emit a great deal of in-line code in the string_compare intrinsic, with the intention that this be as fast as possible because we want to avoid having to call the intrinsic. So why is the intrinsic actually faster in your case? 
Avoid having to call the intrinsic? Per my testing results with microbenchmarks like string-density-bench.jar, the LL cases can be up to 10x faster than the non-intrinsic path, while for some public benchmarks with SPECjbb, Renaissance, 99% string_compare inside are LL, the intrinsics definitely can help a lot as well. If you did NOT mean completely "avoiding intrinsic", but the strings shorter than 72 chars, I would have to say, "it depends". The stub functions try best to process every 16 chars, while the outer logic processes every 8 bytes, which is the major diff. For example, I can see consistent 1.5x faster with lengths 24-71 for LU/UL cases, maybe others cannot, which can be reason why we need an option here.

Regards
Patrick

-----Original Message-----
From: Andrew Haley <aph at redhat.com> 
Sent: Wednesday, November 13, 2019 8:27 PM
To: Patrick Zhang OS <patrick at os.amperecomputing.com>; aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

On 10/29/19 9:58 AM, Patrick Zhang OS wrote:

> 1.  Split the STUB_THRESHOLD from the hard-coded 72 to be 
> CompareLongStringLimitLatin and CompareLongStringLimitUTF as a more 
> flexible control over the stub thresholds for string_compare 
> intrinsics, especially for various uArchs.
> 
> 2.  MacroAssembler::string_compare LL and UU shared the same 
> threshold, actually UU may only require the half (length of chars) of 
> that of LL's, because one character has two-bytes for UU, while for 
> compacted LL strings, one character means one byte. In addition, LU/UL 
> may need a separated threshold, as the stub function is different from 
> the same encoding one, and the performance may vary as well.
> 
> 3.  In generate_compare_long_string_same_encoding, the hard-coded 72 
> was originally able to ensure that there can be always 64 bytes at 
> least for the prefetch code path. However once a smaller stub 
> threshold is set, a new condition is needed to tell if this would be 
> still valid, or has to go to the NO_PREFETCH branch. This change can 
> ensure the correctness.
> 
> 4.  In generate_compare_long_string_different_encoding, some temp vars 
> for handling the last 4 characters are not valid any longer, cleaned 
> up strU and strL, and related pointers initialization to the next U 
> (cnt1) and L (tmp2).
> 
> 5.  In compare_string_16_x_LU, the reference to r10 (tmp1) is not 
> needed, as tmpU or tmpL point to the same register.

Thank you for your patch, but I'm afraid that I have some reservations.

This patch seems to do rather a lot.

What are the thresholds you tested? How are we supposed to test with these different thresholds? Are the thresholds bytes or characters?
Why are the different thresholds not tested in this patch?

But the more serious problem is the fact that we have different code paths for different microarchitectures, and somehow this has to be standard supportable software. In order to test this stuff we'll need different test parameters for SoftwarePrefetchHintDistance, CompareLongStringLimitLatin, CompareLongStringLimitUTF.

Bear in mind that while manufacturers are (entirely reasonably) very keen to show their processors in the best light possible, they are not the people who will have to support this software and debug it when it goes wrong. So there is a fundamental conflict of interest between support people and CPU vendors.

We already emit a great deal of in-line code in the string_compare intrinsic, with the intention that this be as fast as possible because we want to avoid having to call the intrinsic. So why is the intrinsic actually faster in your case? Could we not concentrate on that?

I -- and I'm sure it's not just me -- would be tremendously grateful if all of the AArch64 developers would concentrate on improving code quality overall rather than tweaking stub parameters.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From adinn at redhat.com  Thu Nov 14 09:26:45 2019
From: adinn at redhat.com (Andrew Dinn)
Date: Thu, 14 Nov 2019 09:26:45 +0000
Subject: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub
 threshold of string_compare intrinsic tunable
In-Reply-To: <21d39cdf-b38d-f40e-c8b5-2d92094cdc2a@redhat.com>
References: <MN2PR01MB609385B40719285BD087416A8F610@MN2PR01MB6093.prod.exchangelabs.com>
 <21d39cdf-b38d-f40e-c8b5-2d92094cdc2a@redhat.com>
Message-ID: <ec65f381-902f-f5bb-3327-afc36f62d33e@redhat.com>

On 13/11/2019 12:27, Andrew Haley wrote:
> Thank you for your patch, but I'm afraid that I have some reservations.

I also have the same reservations.

> This patch seems to do rather a lot.
> 
> What are the thresholds you tested? How are we supposed to test with
> these different thresholds? Are the thresholds bytes or characters?
> Why are the different thresholds not tested in this patch?

I agree that we would really need some numbers in order to determine
whether to make this change. However, before we go down that path ...

> But the more serious problem is the fact that we have different code
> paths for different microarchitectures, and somehow this has to be
> standard supportable software. In order to test this stuff we'll need
> different test parameters for SoftwarePrefetchHintDistance,
> CompareLongStringLimitLatin, CompareLongStringLimitUTF.

The key word here is /supportable/. This current proposed change is the
start of a slippery slope where we can end up with a plethora of
'tuning' parameters, not just for different manufacturers' but for this
years model and then next years model and so on. As the number and, more
importantly, combination of such parameters grows we can easily end up
in a situation where we are unable to generate a useful configuration
for all combinations of tuning parameters that meet the totality of
different application needs. Worse, we risk ending up in a situation
where we see terrible performance in the worst cases and no idea of how
we got there. This is a problem of complexity and tractability. Even if
we could in principle, given enough time, arrive at a global maximum or,
failing that, an optimal compromise that trades off competing needs the
danger is that in practice getting there can end up taking increasingly
large amounts of development and maintenance time that we don't have.

So, the gains for any addition of tuning parameters need to be
significant if we are to justify the costs incurred by implementing and
maintaining them. It is not enough for such a tuning feature to optimize
a specific case, especially just for a specific architecture, by a
noticeable amount e.g. the 1.5x that you cite for your architecture. For
an improvement to be significant enough to merit the incurred support
burden the gain ought to apply to many applications or, perhaps, to a
few critical, high-value applications and needs manifestly not to risk
lowering performance in all other applications. It also ought, at the
least, to be shown not to hurt performance on other architectures and,
preferably, provide at least some other architectures with the
opportunity also to improve performance.

> I -- and I'm sure it's not just me -- would be tremendously grateful
> if all of the AArch64 developers would concentrate on improving code
> quality overall rather than tweaking stub parameters.
I can confirm that this is not just Andrew's sentiment.

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill


From aph at redhat.com  Thu Nov 14 10:33:00 2019
From: aph at redhat.com (Andrew Haley)
Date: Thu, 14 Nov 2019 10:33:00 +0000
Subject: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub
 threshold of string_compare intrinsic tunable
In-Reply-To: <MN2PR01MB609388DD147A3BF07D2F6E348F710@MN2PR01MB6093.prod.exchangelabs.com>
References: <MN2PR01MB609385B40719285BD087416A8F610@MN2PR01MB6093.prod.exchangelabs.com>
 <21d39cdf-b38d-f40e-c8b5-2d92094cdc2a@redhat.com>
 <MN2PR01MB609388DD147A3BF07D2F6E348F710@MN2PR01MB6093.prod.exchangelabs.com>
Message-ID: <2890e2c6-ed15-7f8a-5e88-83ed8c6af7ce@redhat.com>

On 11/14/19 9:20 AM, Patrick Zhang OS wrote:
> Thanks for the comments, see my answers below please.
> 
>>> 1. This patch seems to do rather a lot.

> Yes, it enables tweaking the stub parameters (not really changed any
> in this patch), fixed an out-of-boundary prefetching for LL/UU, and
> fixed some redundant instructions in LU/UL code path.  The latter
> two are code-quality-wise, if splitting the patch could make the
> changes clearer, I'd like to do.

Why do we care about out-of-boundary prefetching for LL/UU? I don't
think we do if it requires any extra logic.

>>> 2. Are the thresholds bytes or characters?

> All thresholds are (and should be) in characters. This was a little
> bit misleading, for LL/LU/UL, the const STUB_THRESHOLD meant chars,
> while for UU it could be explained as bytes. If specified
> -XX:-CompactStrings, all code path going to UU would make the
> threshold mean bytes, which might confuse developers. This patch can
> clarify it, and the description of tunable options can provide
> further guidance.

It must. Without some commentary both maintainers and developers are
lost. Unless there is some very strong reason, all counts must specify
units.

>>> 3. How are we supposed to test with these different thresholds?

> There are two jtreg tests for checking the impacts of
> SoftwarePrefetchHintDistance over the intrinsics, I have locally
> added non-default thresholds inside and tested with many lengths
> (took days on a test system). This has not been included in the
> proposed patch, maybe a follow-up one would do, any advice?
> hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToSameLength.java
> hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToDifferentLength.java

I won't accept this patch unless it is accompanied by test cases that
properly exercise the code.

>>> 4. What are the thresholds you tested? 

> Firstly, the default threshold, the hardcoded 72 is my testing focus
> since I would try best not to bring negative impacts to aarch64-port
> normal state, especially other CPU vendors.

> Second, I tested two extreme thresholds: 24 and 255, which means
> more shorter strings (24 to 71 chars) or only very long strings
> (>=255) could go to the stub code path, respectively. Function tests
> passed (listed in the initial email), while performance test results
> (with string-density-bench, StringCompareBench.java, and
> SPECjbb2015) could be varying with different systems (as well as
> microarchitectures).

> Third, some other non-default thresholds, as sanity check,
> particularly for ensuring correctness.

It's the extremes that really matter, I suspect.

>>> 5. But the more serious problem is the fact that we have different
>>> code paths for different microarchitectures, and somehow this has
>>> to be standard supportable software. In order to test this stuff
>>> we'll need different test parameters for
>>> SoftwarePrefetchHintDistance, CompareLongStringLimitLatin,
>>> CompareLongStringLimitUTF

> The STUB_THRESHOLD was introduced to control the stub code
> insertion, tested on some aarch64 systems. I think making it tunable
> is the way to let different microarchitectures be able to configure
> optimal ones for their own.

Well, yes. The question is whether we go down this rabbit hole or try
to find a compromise that is perhaps not quite optimal for anyone but
good enough for everyone.

> I would like to have a common threshold too, or no threshold for
> all, but lacking of full-coverage tests over all systems. Maybe I
> misunderstood you points here with regards to "supportable", the two
> new options can be kept as default if developers have no concerns on
> string compare intrinsics.

I rather suspect that vendors will want to change the defaults sooner
or later. And besides, we'll all have to support these options.

>>> 6. We already emit a great deal of in-line code in the
>>> string_compare intrinsic, with the intention that this be as fast
>>> as possible because we want to avoid having to call the
>>> intrinsic. So why is the intrinsic actually faster in your case?

> Avoid having to call the intrinsic?

I meant "the stub".

> If you did NOT mean completely "avoiding intrinsic", but the strings
> shorter than 72 chars, I would have to say, "it depends". The stub
> functions try best to process every 16 chars, while the outer logic
> processes every 8 bytes, which is the major diff. For example, I can
> see consistent 1.5x faster with lengths 24-71 for LU/UL cases, maybe
> others cannot, which can be reason why we need an option here.

I know that strings of length 24 - 30ish are very common, so this is
an important case.

Do you have a theory that LU/UL cases are common? Why?

What is it like with LL/UU? I'd need to see real timings.

I'd either do all numbers < 256 or (to save time) a sequence like...

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 17, 19, 21, 23, 25, 28, 30, 34, 37, 41, 45, 49, 54, 60, 66, 72, 80, 88, 97, 106, 117, 129, 142, 156, 171, 189, 207, 228, 251

The idea here is that we an plot a graph. The timings should ideally
be monotonically increasing.

And then we could see how different processors behave, and hopefully
find a decent solution for all.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Thu Nov 14 10:40:53 2019
From: aph at redhat.com (Andrew Haley)
Date: Thu, 14 Nov 2019 10:40:53 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
Message-ID: <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>

On 11/13/19 9:55 AM, Pengfei Li (Arm Technology China) wrote:
> This patch aligns with the implementation in [1] which makes the x86_64
> r12 register allocatable. Please let me know if I have missed anything
> for AArch64.

We don't generally use r27 for compressed class pointers.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From patrick at os.amperecomputing.com  Thu Nov 14 11:13:48 2019
From: patrick at os.amperecomputing.com (Patrick Zhang OS)
Date: Thu, 14 Nov 2019 11:13:48 +0000
Subject: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub
 threshold of string_compare intrinsic tunable
In-Reply-To: <2890e2c6-ed15-7f8a-5e88-83ed8c6af7ce@redhat.com>
References: <MN2PR01MB609385B40719285BD087416A8F610@MN2PR01MB6093.prod.exchangelabs.com>
 <21d39cdf-b38d-f40e-c8b5-2d92094cdc2a@redhat.com>
 <MN2PR01MB609388DD147A3BF07D2F6E348F710@MN2PR01MB6093.prod.exchangelabs.com>
 <2890e2c6-ed15-7f8a-5e88-83ed8c6af7ce@redhat.com>
Message-ID: <MN2PR01MB6093D3069D628058447F520D8F710@MN2PR01MB6093.prod.exchangelabs.com>

>> Why do we care about out-of-boundary prefetching for LL/UU? I don't think we do if it requires any extra logic.
I was thinking out-of-boundary prefetching should be prevented, and UL/LU has the same condition, if no need, we could force set largeLoopExitCondition to be 64 so that more cases can freely stay in the large loop. I don't think so.
http://hg.openjdk.java.net/jdk/jdk/file/355f4f42dda5/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#l4194 
    if (SoftwarePrefetchHintDistance >= 0) {
      __ bind(LARGE_LOOP_PREFETCH);
        __ prfm(Address(str1, SoftwarePrefetchHintDistance));
        __ prfm(Address(str2, SoftwarePrefetchHintDistance));
        compare_string_16_bytes_same(DIFF, DIFF2);
        compare_string_16_bytes_same(DIFF, DIFF2);
        __ sub(cnt2, cnt2, isLL ? 64 : 32);
        compare_string_16_bytes_same(DIFF, DIFF2);
-        __ subs(rscratch2, cnt2, largeLoopExitCondition);
+        __ subs(rscratch2, cnt2, 64);
        compare_string_16_bytes_same(DIFF, DIFF2);
        __ br(__ GT, LARGE_LOOP_PREFETCH);
        __ cbz(cnt2, LAST_CHECK_AND_LENGTH_DIFF); // no more chars left?
    }

>> Do you have a theory that LU/UL cases are common? Why?
The only "theory" can be compare_string_16_x_LU (in the stub) is fater than the 8 bytes main loop (out of the stub) (http://hg.openjdk.java.net/jdk/jdk/file/355f4f42dda5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#l4997), even not in the large loop of stub, the small loop can be faster as well since it is able to process more bytes within fewer instructions (http://hg.openjdk.java.net/jdk/jdk/file/355f4f42dda5/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#l4203). 

I can prepare a new patch with the updates to tests, and plot the timings soon latter.

Regards
Patrick

-----Original Message-----
From: Andrew Haley <aph at redhat.com> 
Sent: Thursday, November 14, 2019 6:33 PM
To: Patrick Zhang OS <patrick at os.amperecomputing.com>; aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

On 11/14/19 9:20 AM, Patrick Zhang OS wrote:
> Thanks for the comments, see my answers below please.
> 
>>> 1. This patch seems to do rather a lot.

> Yes, it enables tweaking the stub parameters (not really changed any 
> in this patch), fixed an out-of-boundary prefetching for LL/UU, and 
> fixed some redundant instructions in LU/UL code path.  The latter two 
> are code-quality-wise, if splitting the patch could make the changes 
> clearer, I'd like to do.

Why do we care about out-of-boundary prefetching for LL/UU? I don't think we do if it requires any extra logic.

>>> 2. Are the thresholds bytes or characters?

> All thresholds are (and should be) in characters. This was a little 
> bit misleading, for LL/LU/UL, the const STUB_THRESHOLD meant chars, 
> while for UU it could be explained as bytes. If specified 
> -XX:-CompactStrings, all code path going to UU would make the 
> threshold mean bytes, which might confuse developers. This patch can 
> clarify it, and the description of tunable options can provide further 
> guidance.

It must. Without some commentary both maintainers and developers are lost. Unless there is some very strong reason, all counts must specify units.

>>> 3. How are we supposed to test with these different thresholds?

> There are two jtreg tests for checking the impacts of 
> SoftwarePrefetchHintDistance over the intrinsics, I have locally added 
> non-default thresholds inside and tested with many lengths (took days 
> on a test system). This has not been included in the proposed patch, 
> maybe a follow-up one would do, any advice?
> hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToSameLength
> .java 
> hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToDifferentL
> ength.java

I won't accept this patch unless it is accompanied by test cases that properly exercise the code.

>>> 4. What are the thresholds you tested? 

> Firstly, the default threshold, the hardcoded 72 is my testing focus 
> since I would try best not to bring negative impacts to aarch64-port 
> normal state, especially other CPU vendors.

> Second, I tested two extreme thresholds: 24 and 255, which means more 
> shorter strings (24 to 71 chars) or only very long strings
> (>=255) could go to the stub code path, respectively. Function tests 
> passed (listed in the initial email), while performance test results 
> (with string-density-bench, StringCompareBench.java, and
> SPECjbb2015) could be varying with different systems (as well as 
> microarchitectures).

> Third, some other non-default thresholds, as sanity check, 
> particularly for ensuring correctness.

It's the extremes that really matter, I suspect.

>>> 5. But the more serious problem is the fact that we have different 
>>> code paths for different microarchitectures, and somehow this has to 
>>> be standard supportable software. In order to test this stuff we'll 
>>> need different test parameters for SoftwarePrefetchHintDistance, 
>>> CompareLongStringLimitLatin, CompareLongStringLimitUTF

> The STUB_THRESHOLD was introduced to control the stub code insertion, 
> tested on some aarch64 systems. I think making it tunable is the way 
> to let different microarchitectures be able to configure optimal ones 
> for their own.

Well, yes. The question is whether we go down this rabbit hole or try to find a compromise that is perhaps not quite optimal for anyone but good enough for everyone.

> I would like to have a common threshold too, or no threshold for all, 
> but lacking of full-coverage tests over all systems. Maybe I 
> misunderstood you points here with regards to "supportable", the two 
> new options can be kept as default if developers have no concerns on 
> string compare intrinsics.

I rather suspect that vendors will want to change the defaults sooner or later. And besides, we'll all have to support these options.

>>> 6. We already emit a great deal of in-line code in the 
>>> string_compare intrinsic, with the intention that this be as fast as 
>>> possible because we want to avoid having to call the intrinsic. So 
>>> why is the intrinsic actually faster in your case?

> Avoid having to call the intrinsic?

I meant "the stub".

> If you did NOT mean completely "avoiding intrinsic", but the strings 
> shorter than 72 chars, I would have to say, "it depends". The stub 
> functions try best to process every 16 chars, while the outer logic 
> processes every 8 bytes, which is the major diff. For example, I can 
> see consistent 1.5x faster with lengths 24-71 for LU/UL cases, maybe 
> others cannot, which can be reason why we need an option here.

I know that strings of length 24 - 30ish are very common, so this is an important case.

Do you have a theory that LU/UL cases are common? Why?

What is it like with LL/UU? I'd need to see real timings.

I'd either do all numbers < 256 or (to save time) a sequence like...

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 17, 19, 21, 23, 25, 28, 30, 34, 37, 41, 45, 49, 54, 60, 66, 72, 80, 88, 97, 106, 117, 129, 142, 156, 171, 189, 207, 228, 251

The idea here is that we an plot a graph. The timings should ideally be monotonically increasing.

And then we could see how different processors behave, and hopefully find a decent solution for all.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From felix.yang at huawei.com  Thu Nov 14 12:26:15 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Thu, 14 Nov 2019 12:26:15 +0000
Subject: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
 operations
In-Reply-To: <f93d7b5c-62fa-2143-0fe7-4a3ea8fb0f7a@oracle.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED60278B7@dggeml527-mbx.china.huawei.com>
 <65e93675-a3cf-53ac-6894-bb4124c55f93@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6035952@dggeml527-mbx.china.huawei.com>
 <1f4c99ac-461c-7795-1a74-a494bdba3672@redhat.com>
 <1cc3ab16-eaab-d031-3df0-c9133de24f88@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036B49@dggeml527-mbx.china.huawei.com>
 <8b527457-c371-45ae-bb54-0a048f9ee6f8@redhat.com>
 <32ea3e22-9f7a-9aaa-c86a-79ed175a1c7b@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036C85@dggeml527-mbx.china.huawei.com>
 <83f92211-2c64-69d0-457c-c059acbccf63@oracle.com>
 <0d718a85-d669-f4b4-ae90-db1f7bb56b45@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036E65@dggeml527-mbx.china.huawei.com>
 <1325b063-cc1b-74fe-3b78-f4eb4518d116@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6036EBA@dggeml527-mbx.china.huawei.com>
 <f93d7b5c-62fa-2143-0fe7-4a3ea8fb0f7a@oracle.com>
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED603722D@dggeml527-mbx.china.huawei.com>

.
> 
> Is there a reason to not reopen this bug:
> 
> JDK-8233912 aarch64: minor improvements of atomic operations
> https://bugs.openjdk.java.net/browse/JDK-8233912
> 
> Dan
> 

Reopend and modified problem description on that bug.
Webrev: http://cr.openjdk.java.net/~fyang/8233912/webrev.00/
The webrev also adds one comment from aph.  
Passed tier1 & 2 & 3 test.  Also run jcstress test.  Will do the push.

Thanks,
Felix

From ci_notify at linaro.org  Fri Nov 15 06:10:51 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Fri, 15 Nov 2019 06:10:51 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK 13 on AArch64
Message-ID: <471749255.788.1573798251928.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk13/openjdk-jtreg-nightly-tests/summary/2019/318/summary.html
 
-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/04 pass: 5,644; fail: 2; error: 1
Build 1: aarch64/2019/jul/09 pass: 5,643; fail: 4
Build 2: aarch64/2019/jul/16 pass: 5,646; fail: 1
Build 3: aarch64/2019/jul/18 pass: 5,644; fail: 2; error: 1
Build 4: aarch64/2019/jul/20 pass: 5,645; fail: 1; error: 1
Build 5: aarch64/2019/jul/23 pass: 5,644; fail: 3
Build 6: aarch64/2019/jul/25 pass: 5,644; fail: 3
Build 7: aarch64/2019/jul/30 pass: 5,645; fail: 2
Build 8: aarch64/2019/aug/01 pass: 5,646; fail: 1
Build 9: aarch64/2019/aug/03 pass: 5,646; fail: 1
Build 10: aarch64/2019/aug/06 pass: 5,645; fail: 2
Build 11: aarch64/2019/aug/08 pass: 5,646; fail: 1
Build 12: aarch64/2019/aug/10 pass: 5,646; fail: 1
Build 13: aarch64/2019/nov/12 pass: 5,652
Build 14: aarch64/2019/nov/14 pass: 5,650; fail: 2

1 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/04 pass: 8,601; fail: 523; error: 26
Build 1: aarch64/2019/jul/09 pass: 8,606; fail: 515; error: 29
Build 2: aarch64/2019/jul/16 pass: 8,593; fail: 531; error: 30
Build 3: aarch64/2019/jul/18 pass: 8,618; fail: 527; error: 26
Build 4: aarch64/2019/jul/20 pass: 8,619; fail: 519; error: 33
Build 5: aarch64/2019/jul/23 pass: 8,616; fail: 525; error: 30
Build 6: aarch64/2019/jul/25 pass: 8,620; fail: 528; error: 23
Build 7: aarch64/2019/jul/30 pass: 8,610; fail: 529; error: 32
Build 8: aarch64/2019/aug/01 pass: 8,620; fail: 527; error: 24
Build 9: aarch64/2019/aug/03 pass: 8,596; fail: 552; error: 23
Build 10: aarch64/2019/aug/06 pass: 8,616; fail: 528; error: 27
Build 11: aarch64/2019/aug/08 pass: 8,649; fail: 504; error: 18
Build 12: aarch64/2019/aug/10 pass: 8,647; fail: 507; error: 17
Build 13: aarch64/2019/nov/12 pass: 8,650; fail: 513; error: 16
Build 14: aarch64/2019/nov/14 pass: 8,651; fail: 511; error: 17

4 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/04 pass: 3,962
Build 1: aarch64/2019/jul/09 pass: 3,962
Build 2: aarch64/2019/jul/16 pass: 3,963
Build 3: aarch64/2019/jul/18 pass: 3,964
Build 4: aarch64/2019/jul/20 pass: 3,964
Build 5: aarch64/2019/jul/23 pass: 3,964
Build 6: aarch64/2019/jul/25 pass: 3,964
Build 7: aarch64/2019/jul/30 pass: 3,964
Build 8: aarch64/2019/aug/01 pass: 3,964
Build 9: aarch64/2019/aug/03 pass: 3,964
Build 10: aarch64/2019/aug/06 pass: 3,964
Build 11: aarch64/2019/aug/08 pass: 3,964
Build 12: aarch64/2019/aug/10 pass: 3,964
Build 13: aarch64/2019/nov/12 pass: 3,964
Build 14: aarch64/2019/nov/14 pass: 3,964

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdk13/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 7.63x
Relative performance: Server critical-jOPS (nc): 9.35x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk13/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 204.57

Server 204.57 / Server 2014-04-01 (71.00): 2.88x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk13/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-07-04 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/185/results/
2019-07-10 pass rate: 10487/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/190/results/
2019-07-16 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/197/results/
2019-07-19 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/199/results/
2019-07-21 pass rate: 10487/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/201/results/
2019-07-24 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/204/results/
2019-07-26 pass rate: 10487/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/206/results/
2019-07-31 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/211/results/
2019-08-02 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/213/results/
2019-08-04 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/215/results/
2019-08-07 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/218/results/
2019-08-09 pass rate: 10487/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/220/results/
2019-08-11 pass rate: 10487/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/222/results/
2019-11-12 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/316/results/
2019-11-15 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/318/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/

From ci_notify at linaro.org  Fri Nov 15 06:15:19 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Fri, 15 Nov 2019 06:15:19 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK 8u on AArch64
Message-ID: <107109560.790.1573798519711.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk8u/openjdk-jtreg-nightly-tests/summary/2019/318/summary.html
 
-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/25 pass: 802; fail: 25; error: 11
Build 1: aarch64/2019/jul/30 pass: 787; fail: 40; error: 11
Build 2: aarch64/2019/aug/01 pass: 800; fail: 26; error: 12
Build 3: aarch64/2019/aug/04 pass: 808; fail: 30; error: 2
Build 4: aarch64/2019/aug/06 pass: 799; fail: 29; error: 12
Build 5: aarch64/2019/aug/08 pass: 830; fail: 9; error: 1
Build 6: aarch64/2019/aug/11 pass: 825; fail: 14; error: 1
Build 7: aarch64/2019/aug/13 pass: 830; fail: 9; error: 1
Build 8: aarch64/2019/aug/15 pass: 837; fail: 9; error: 1
Build 9: aarch64/2019/aug/17 pass: 837; fail: 9; error: 1
Build 10: aarch64/2019/aug/22 pass: 837; fail: 9; error: 1
Build 11: aarch64/2019/sep/10 pass: 838; fail: 13; error: 1
Build 12: aarch64/2019/sep/21 pass: 838; fail: 13; error: 1
Build 13: aarch64/2019/nov/02 pass: 843; fail: 9; error: 1
Build 14: aarch64/2019/nov/14 pass: 843; fail: 9; error: 1

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/25 pass: 5,938; fail: 276; error: 26
Build 1: aarch64/2019/jul/30 pass: 5,942; fail: 273; error: 25
Build 2: aarch64/2019/aug/01 pass: 5,945; fail: 271; error: 24
Build 3: aarch64/2019/aug/04 pass: 5,949; fail: 270; error: 24
Build 4: aarch64/2019/aug/06 pass: 5,945; fail: 275; error: 23
Build 5: aarch64/2019/aug/08 pass: 5,953; fail: 267; error: 23
Build 6: aarch64/2019/aug/11 pass: 5,947; fail: 272; error: 25
Build 7: aarch64/2019/aug/13 pass: 5,962; fail: 258; error: 24
Build 8: aarch64/2019/aug/15 pass: 5,955; fail: 266; error: 23
Build 9: aarch64/2019/aug/17 pass: 5,951; fail: 269; error: 24
Build 10: aarch64/2019/aug/22 pass: 5,945; fail: 279; error: 20
Build 11: aarch64/2019/sep/10 pass: 5,951; fail: 273; error: 23
Build 12: aarch64/2019/sep/21 pass: 5,964; fail: 261; error: 22
Build 13: aarch64/2019/nov/02 pass: 5,956; fail: 278; error: 18
Build 14: aarch64/2019/nov/14 pass: 5,956; fail: 275; error: 21

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/25 pass: 3,116; fail: 2
Build 1: aarch64/2019/jul/30 pass: 3,116; fail: 2
Build 2: aarch64/2019/aug/01 pass: 3,116; fail: 2
Build 3: aarch64/2019/aug/04 pass: 3,116; fail: 2
Build 4: aarch64/2019/aug/06 pass: 3,116; fail: 2
Build 5: aarch64/2019/aug/08 pass: 3,116; fail: 2
Build 6: aarch64/2019/aug/11 pass: 3,116; fail: 2
Build 7: aarch64/2019/aug/13 pass: 3,116; fail: 2
Build 8: aarch64/2019/aug/15 pass: 3,116; fail: 2
Build 9: aarch64/2019/aug/17 pass: 3,116; fail: 2
Build 10: aarch64/2019/aug/22 pass: 3,116; fail: 2
Build 11: aarch64/2019/sep/10 pass: 3,116; fail: 2
Build 12: aarch64/2019/sep/21 pass: 3,116; fail: 2
Build 13: aarch64/2019/nov/02 pass: 3,116; fail: 2
Build 14: aarch64/2019/nov/14 pass: 3,116; fail: 2

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdk8u/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 6.73x
Relative performance: Server critical-jOPS (nc): 8.09x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk8u/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 174.26

Server 174.26 / Server 2014-04-01 (71.00): 2.45x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk8u/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-07-26 pass rate: 8229/8229, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/206/results/
2019-07-31 pass rate: 8229/8229, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/211/results/
2019-08-02 pass rate: 8229/8229, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/213/results/
2019-08-05 pass rate: 8229/8229, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/216/results/
2019-08-07 pass rate: 8229/8229, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/218/results/
2019-08-09 pass rate: 8229/8229, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/220/results/
2019-08-12 pass rate: 8230/8230, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/223/results/
2019-08-13 pass rate: 8230/8230, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/225/results/
2019-08-16 pass rate: 8230/8230, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/227/results/
2019-08-17 pass rate: 8230/8230, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/229/results/
2019-08-23 pass rate: 8230/8230, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/234/results/
2019-09-11 pass rate: 8230/8230, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/253/results/
2019-09-22 pass rate: 8230/8230, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/264/results/
2019-11-02 pass rate: 8230/8230, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/306/results/
2019-11-15 pass rate: 8231/8231, results: http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/2019/318/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk8u/jcstress-nightly-runs/

From patrick at os.amperecomputing.com  Fri Nov 15 07:51:17 2019
From: patrick at os.amperecomputing.com (Patrick Zhang OS)
Date: Fri, 15 Nov 2019 07:51:17 +0000
Subject: [aarch64-port-dev ] RFR: 8218748: AARCH64: String::compareTo
 intrinsic documentation and maintenance improvement
In-Reply-To: <399186f2-5cb7-b713-33c5-b29b6eb73eb8@samersoff.net>
References: <fc712566-1d25-88d6-b178-652efdc00f36@bell-sw.com>
 <DB7PR08MB31153C8A293F3B4FC2371B34967F0@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <86f91401-b7f6-b634-fef1-b0615b8fcde0@redhat.com>
 <e47a5afa-e9ce-1286-97da-5fb9018846cb@bell-sw.com>
 <399186f2-5cb7-b713-33c5-b29b6eb73eb8@samersoff.net>
Message-ID: <MN2PR01MB6093B0E0C0BD01587FF96F778F700@MN2PR01MB6093.prod.exchangelabs.com>

Hi Dmitrij,

The inline document inside your this patch is nice in helping me understand the string_compare stub code generation in depth, although I don't know why it was not pushed. 
http://cr.openjdk.java.net/~dpochepk/8218748/webrev.02/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp.sdiff.html 

There is one point I don't quite understand, could you please clarify it a little more? Thanks in advance!
The large loop with prefetching logic, in different_encoding function, it uses "cnt2 - prefetchLoopExitCondition" to tell whether the 1st iteration should be executed or not, while the same_encoding does not do this, why?

I was thinking that "prefetching out of the string boundary" could be an invalid operation, but it seems a misunderstanding, isn't it? The AArch64 ISA document says that prfm instruction signals the memory system and expects "preloading the cache line containing the specified address into one or more caches" would be done, in order to speed up the memory accesses when they do occur. If this is just a hint for coming ldrs, and safe enough, could we not restrict the rest iterations (2 ~ n) with largeLoopExitCondition? instead use 64 LL/UU, and 128 for LU/UL. As such more iteration can say in the large loop (SoftwarePrefetchHintDistance=192 for example), for better performance. Any comments?

Thanks 

4327   address generate_compare_long_string_different_encoding(bool isLU) {
4377     if (SoftwarePrefetchHintDistance >= 0) {
4378       __ subs(rscratch2, cnt2, prefetchLoopExitCondition);
4379       __ br(__ LT, NO_PREFETCH);
4380       __ bind(LARGE_LOOP_PREFETCH);                  // 64-characters loop
... ...
4395           __ subs(rscratch2, cnt2, prefetchLoopExitCondition); // <-- could we use subs(rscratch2, cnt2, 128) instead?
4396           __ br(__ GE, LARGE_LOOP_PREFETCH);
4397     } // end of 64-characters loop

4616   address generate_compare_long_string_same_encoding(bool isLL) {
4637     if (SoftwarePrefetchHintDistance >= 0) {
4638       __ bind(LARGE_LOOP_PREFETCH);
4639         __ prfm(Address(str1, SoftwarePrefetchHintDistance));
4640         __ prfm(Address(str2, SoftwarePrefetchHintDistance));
4641         compare_string_16_bytes_same(DIFF, DIFF2);
4642         compare_string_16_bytes_same(DIFF, DIFF2);
4643         __ sub(cnt2, cnt2, 8 * characters_in_word);
4644         compare_string_16_bytes_same(DIFF, DIFF2);
4645         __ subs(rscratch2, cnt2, largeLoopExitCondition); // rscratch2 is not used. Use subs instead of cmp in case of potentially large constants // <-- could we use subs(rscratch2, cnt2, 64) instead?
4646         compare_string_16_bytes_same(DIFF, DIFF2);
4647         __ br(__ GT, LARGE_LOOP_PREFETCH);
4648         __ cbz(cnt2, LAST_CHECK);                    // no more loads left
4649     }

Regards
Patrick

-----Original Message-----
From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Dmitry Samersoff
Sent: Sunday, May 19, 2019 11:42 PM
To: Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com>; Andrew Haley <aph at redhat.com>; Pengfei Li (Arm Technology China) <Pengfei.Li at arm.com>
Cc: hotspot-compiler-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8218748: AARCH64: String::compareTo intrinsic documentation and maintenance improvement

Dmitrij,

The changes looks good to me.

-Dmitry

On 25.02.2019 19:52, Dmitrij Pochepko wrote:
> Hi Andrew, Pengfei,
> 
> I created webrev.02 with all your suggestions implemented:
> 
> webrev: http://cr.openjdk.java.net/~dpochepk/8218748/webrev.02/
> 
> - comments are now both in separate section and inlined into code.
> - documentation mismatch mentioned by Pengfei is fixed:
> -- SHORT_LAST_INIT label name misprint changed to correct SHORT_LAST
> -- SHORT_LOOP_TAIL block now merged with last instruction. 
> Documentation is updated respectively
> - minor other changes to layout and wording
> 
> Newly developed tests were run as sanity and they passed.
> 
> Thanks,
> Dmitrij
> 
> On 22/02/2019 6:42 PM, Andrew Haley wrote:
>> On 2/22/19 10:31 AM, Pengfei Li (Arm Technology China) wrote:
>>
>>> So personally, I still prefer to inline the comments with the 
>>> original code block to avoid this kind of inconsistencies. And it 
>>> makes us easier to review or maintain the code together with the 
>>> doc, as we don't need to scroll back and force. I don't know the 
>>> benefit of making the code documentation as a separate part. What's 
>>> your opinion, Andrew Haley?
>> I agree with you. There's no harm having both inline and separate.
>>


From patrick at os.amperecomputing.com  Fri Nov 15 08:04:44 2019
From: patrick at os.amperecomputing.com (Patrick Zhang OS)
Date: Fri, 15 Nov 2019 08:04:44 +0000
Subject: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub
 threshold of string_compare intrinsic tunable
In-Reply-To: <MN2PR01MB6093D3069D628058447F520D8F710@MN2PR01MB6093.prod.exchangelabs.com>
References: <MN2PR01MB609385B40719285BD087416A8F610@MN2PR01MB6093.prod.exchangelabs.com>
 <21d39cdf-b38d-f40e-c8b5-2d92094cdc2a@redhat.com>
 <MN2PR01MB609388DD147A3BF07D2F6E348F710@MN2PR01MB6093.prod.exchangelabs.com>
 <2890e2c6-ed15-7f8a-5e88-83ed8c6af7ce@redhat.com>
 <MN2PR01MB6093D3069D628058447F520D8F710@MN2PR01MB6093.prod.exchangelabs.com>
Message-ID: <MN2PR01MB6093E0A03FE20DA3EE98CA038F700@MN2PR01MB6093.prod.exchangelabs.com>

To avoid future confusion, I am going to split the patch, take out the updates for generate_compare_long_string_different_encoding, which drops two redundant temp Register vars and related unused instructions, then create a new for your review. It has nothing to do with the proposed option. 

And I will continue working the remaining parts according to your comments and suggestions..

Regards
Patrick

-----Original Message-----
From: aarch64-port-dev <aarch64-port-dev-bounces at openjdk.java.net> On Behalf Of Patrick Zhang OS
Sent: Thursday, November 14, 2019 7:14 PM
To: Andrew Haley <aph at redhat.com>; aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

>> Why do we care about out-of-boundary prefetching for LL/UU? I don't think we do if it requires any extra logic.
I was thinking out-of-boundary prefetching should be prevented, and UL/LU has the same condition, if no need, we could force set largeLoopExitCondition to be 64 so that more cases can freely stay in the large loop. I don't think so.
http://hg.openjdk.java.net/jdk/jdk/file/355f4f42dda5/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#l4194 
    if (SoftwarePrefetchHintDistance >= 0) {
      __ bind(LARGE_LOOP_PREFETCH);
        __ prfm(Address(str1, SoftwarePrefetchHintDistance));
        __ prfm(Address(str2, SoftwarePrefetchHintDistance));
        compare_string_16_bytes_same(DIFF, DIFF2);
        compare_string_16_bytes_same(DIFF, DIFF2);
        __ sub(cnt2, cnt2, isLL ? 64 : 32);
        compare_string_16_bytes_same(DIFF, DIFF2);
-        __ subs(rscratch2, cnt2, largeLoopExitCondition);
+        __ subs(rscratch2, cnt2, 64);
        compare_string_16_bytes_same(DIFF, DIFF2);
        __ br(__ GT, LARGE_LOOP_PREFETCH);
        __ cbz(cnt2, LAST_CHECK_AND_LENGTH_DIFF); // no more chars left?
    }

>> Do you have a theory that LU/UL cases are common? Why?
The only "theory" can be compare_string_16_x_LU (in the stub) is fater than the 8 bytes main loop (out of the stub) (http://hg.openjdk.java.net/jdk/jdk/file/355f4f42dda5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#l4997), even not in the large loop of stub, the small loop can be faster as well since it is able to process more bytes within fewer instructions (http://hg.openjdk.java.net/jdk/jdk/file/355f4f42dda5/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#l4203). 

I can prepare a new patch with the updates to tests, and plot the timings soon latter.

Regards
Patrick

-----Original Message-----
From: Andrew Haley <aph at redhat.com>
Sent: Thursday, November 14, 2019 6:33 PM
To: Patrick Zhang OS <patrick at os.amperecomputing.com>; aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

On 11/14/19 9:20 AM, Patrick Zhang OS wrote:
> Thanks for the comments, see my answers below please.
> 
>>> 1. This patch seems to do rather a lot.

> Yes, it enables tweaking the stub parameters (not really changed any 
> in this patch), fixed an out-of-boundary prefetching for LL/UU, and 
> fixed some redundant instructions in LU/UL code path.  The latter two 
> are code-quality-wise, if splitting the patch could make the changes 
> clearer, I'd like to do.

Why do we care about out-of-boundary prefetching for LL/UU? I don't think we do if it requires any extra logic.

>>> 2. Are the thresholds bytes or characters?

> All thresholds are (and should be) in characters. This was a little 
> bit misleading, for LL/LU/UL, the const STUB_THRESHOLD meant chars, 
> while for UU it could be explained as bytes. If specified 
> -XX:-CompactStrings, all code path going to UU would make the 
> threshold mean bytes, which might confuse developers. This patch can 
> clarify it, and the description of tunable options can provide further 
> guidance.

It must. Without some commentary both maintainers and developers are lost. Unless there is some very strong reason, all counts must specify units.

>>> 3. How are we supposed to test with these different thresholds?

> There are two jtreg tests for checking the impacts of 
> SoftwarePrefetchHintDistance over the intrinsics, I have locally added 
> non-default thresholds inside and tested with many lengths (took days 
> on a test system). This has not been included in the proposed patch, 
> maybe a follow-up one would do, any advice?
> hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToSameLength
> .java
> hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToDifferentL
> ength.java

I won't accept this patch unless it is accompanied by test cases that properly exercise the code.

>>> 4. What are the thresholds you tested? 

> Firstly, the default threshold, the hardcoded 72 is my testing focus 
> since I would try best not to bring negative impacts to aarch64-port 
> normal state, especially other CPU vendors.

> Second, I tested two extreme thresholds: 24 and 255, which means more 
> shorter strings (24 to 71 chars) or only very long strings
> (>=255) could go to the stub code path, respectively. Function tests 
> passed (listed in the initial email), while performance test results 
> (with string-density-bench, StringCompareBench.java, and
> SPECjbb2015) could be varying with different systems (as well as 
> microarchitectures).

> Third, some other non-default thresholds, as sanity check, 
> particularly for ensuring correctness.

It's the extremes that really matter, I suspect.

>>> 5. But the more serious problem is the fact that we have different 
>>> code paths for different microarchitectures, and somehow this has to 
>>> be standard supportable software. In order to test this stuff we'll 
>>> need different test parameters for SoftwarePrefetchHintDistance, 
>>> CompareLongStringLimitLatin, CompareLongStringLimitUTF

> The STUB_THRESHOLD was introduced to control the stub code insertion, 
> tested on some aarch64 systems. I think making it tunable is the way 
> to let different microarchitectures be able to configure optimal ones 
> for their own.

Well, yes. The question is whether we go down this rabbit hole or try to find a compromise that is perhaps not quite optimal for anyone but good enough for everyone.

> I would like to have a common threshold too, or no threshold for all, 
> but lacking of full-coverage tests over all systems. Maybe I 
> misunderstood you points here with regards to "supportable", the two 
> new options can be kept as default if developers have no concerns on 
> string compare intrinsics.

I rather suspect that vendors will want to change the defaults sooner or later. And besides, we'll all have to support these options.

>>> 6. We already emit a great deal of in-line code in the 
>>> string_compare intrinsic, with the intention that this be as fast as 
>>> possible because we want to avoid having to call the intrinsic. So 
>>> why is the intrinsic actually faster in your case?

> Avoid having to call the intrinsic?

I meant "the stub".

> If you did NOT mean completely "avoiding intrinsic", but the strings 
> shorter than 72 chars, I would have to say, "it depends". The stub 
> functions try best to process every 16 chars, while the outer logic 
> processes every 8 bytes, which is the major diff. For example, I can 
> see consistent 1.5x faster with lengths 24-71 for LU/UL cases, maybe 
> others cannot, which can be reason why we need an option here.

I know that strings of length 24 - 30ish are very common, so this is an important case.

Do you have a theory that LU/UL cases are common? Why?

What is it like with LL/UU? I'd need to see real timings.

I'd either do all numbers < 256 or (to save time) a sequence like...

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 17, 19, 21, 23, 25, 28, 30, 34, 37, 41, 45, 49, 54, 60, 66, 72, 80, 88, 97, 106, 117, 129, 142, 156, 171, 189, 207, 228, 251

The idea here is that we an plot a graph. The timings should ideally be monotonically increasing.

And then we could see how different processors behave, and hopefully find a decent solution for all.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From felix.yang at huawei.com  Fri Nov 15 08:33:15 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Fri, 15 Nov 2019 08:33:15 +0000
Subject: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED6037478@dggeml527-mbx.china.huawei.com>

Any comments on that?

I have posted some of the hs_err_log files here: http://cr.openjdk.java.net/~fyang/sigill-crashes.tar.bz2

Thanks,
Felix

From: Yangfei (Felix)
Sent: Tuesday, November 12, 2019 3:37 PM
To: aarch64-port-dev at openjdk.java.net
Subject: Question about ISB usage in the aarch64 port

Hi,

  I am witnessing some SIGILL jvm crashes on my aarch64 platform.
  I looked at the ISB usage, especially this one: https://mail.openjdk.java.net/pipermail/aarch64-port-dev/2014-September/001376.html
  One of changes is adding one ISB after the native call returns:

1100 static void rt_call(MacroAssembler* masm, address dest, int gpargs, int fpargs, int type) {
1101   CodeBlob *cb = CodeCache::find_blob(dest);
1102   if (cb) {
1103     __ far_call(RuntimeAddress(dest));
1104   } else {
1105     assert((unsigned)gpargs < 256, "eek!");
1106     assert((unsigned)fpargs < 32, "eek!");
1107     __ lea(rscratch1, RuntimeAddress(dest));
1108     __ blr(rscratch1);
1109     __ maybe_isb();    < ========
1110   }
1111 }

  The rt_call function is used in generate_native_wrapper to make the JNI call.
  As I didn?t see the barrier for the ppc or arm port.  I would like to know more details here.  Does anyone still remember?
  Also the ISB is planted only in the else block.  I assume this is also necessary for the if block.  Correct?


Thanks for your help,
Felix

From Pengfei.Li at arm.com  Fri Nov 15 09:15:37 2019
From: Pengfei.Li at arm.com (Pengfei Li (Arm Technology China))
Date: Fri, 15 Nov 2019 09:15:37 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
Message-ID: <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>

Hi Andrew,

> > This patch aligns with the implementation in [1] which makes the
> > x86_64
> > r12 register allocatable. Please let me know if I have missed anything
> > for AArch64.
> 
> We don't generally use r27 for compressed class pointers.

Do you mean that r27 is only used for encoding/decoding oops but not for
any klass pointers? I looked at the AArch64 code and find it also used in
MacroAssembler::encode_klass_not_null() if the compressed mode is not
zero-based.

--
Thanks,
Pengfei


From Joshua.Zhu at arm.com  Fri Nov 15 10:29:49 2019
From: Joshua.Zhu at arm.com (Joshua Zhu (Arm Technology China))
Date: Fri, 15 Nov 2019 10:29:49 +0000
Subject: [aarch64-port-dev ] 8233948: AArch64: Incorrect mapping between
 OptoReg and VMReg for high 64 bits of Vector Register
In-Reply-To: <VE1PR08MB4880227F4AEE8A866F3AB9CD88770@VE1PR08MB4880.eurprd08.prod.outlook.com>
References: <VE1PR08MB4880832B668EFD8A88419E2188770@VE1PR08MB4880.eurprd08.prod.outlook.com>
 <VE1PR08MB4880227F4AEE8A866F3AB9CD88770@VE1PR08MB4880.eurprd08.prod.outlook.com>
Message-ID: <VE1PR08MB4880428B5053CEC070A2483E88700@VE1PR08MB4880.eurprd08.prod.outlook.com>

Hi,

> Please review the following patch:
> JBS: https://bugs.openjdk.java.net/browse/JDK-8233948
> Webrev: http://cr.openjdk.java.net/~jzhu/8233948/webrev.00/

Please let me know if any comments. Thanks a lot.

Best Regards,
Joshua


From patrick at os.amperecomputing.com  Fri Nov 15 10:54:16 2019
From: patrick at os.amperecomputing.com (Patrick Zhang OS)
Date: Fri, 15 Nov 2019 10:54:16 +0000
Subject: [aarch64-port-dev ] RFR (trivial): 8234228: AArch64: Clean up
 redundant temp vars in generate_compare_long_string_different_encoding
Message-ID: <MN2PR01MB60936ED9EF6FAE8A493694508F700@MN2PR01MB6093.prod.exchangelabs.com>

Hi Reviewers,

This is a simple patch which cleans up some redundant temp vars and related instructions in generate_compare_long_string_different_encoding.

JBS: https://bugs.openjdk.java.net/browse/JDK-8234228 
Webrev: http://cr.openjdk.java.net/~qpzhang/8234228/webrev.01 

In generate_compare_long_string_different_encoding, the two Register vars strU and strL were used to record the pointers of the last 4 characters for the final comparisons. strU has been no use since the latest code updates as the chars got pre-loaded (r12) by compare_string_16_x_LU early, and strL is redundant too since the pointer is available in r11. Cleaning up these can save two add, two temp vars, and replace two sub with mov.
In addition, r10 in compare_string_16_x_LU is not used, cleaned the temp var too.

Tested jtreg tier1, and hotspot runtime/compiler, no new failures found.
Double checked with string intrinsics cases under [1], no regression found.
Ran [2] CompareToBench LU/UL as performance check, no regression found, and slight gains with some input sizes

[1] http://hg.openjdk.java.net/jdk/jdk/file/3df2bf731a87/test/hotspot/jtreg/compiler/intrinsics/string 
[2] http://cr.openjdk.java.net/~shade/density/string-density-bench.jar 

Regards
Patrick


From aph at redhat.com  Fri Nov 15 14:49:14 2019
From: aph at redhat.com (Andrew Haley)
Date: Fri, 15 Nov 2019 14:49:14 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
Message-ID: <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>

On 11/15/19 9:15 AM, Pengfei Li (Arm Technology China) wrote:

>>> This patch aligns with the implementation in [1] which makes the
>>> x86_64
>>> r12 register allocatable. Please let me know if I have missed anything
>>> for AArch64.
>>
>> We don't generally use r27 for compressed class pointers.
> 
> Do you mean that r27 is only used for encoding/decoding oops but not for
> any klass pointers?

Almost always, yes.

> I looked at the AArch64 code and find it also used in
> MacroAssembler::encode_klass_not_null() if the compressed mode is
> not zero-based.

I see

  if (use_XOR_for_compressed_class_base) {
    if (CompressedKlassPointers::shift() != 0) {
      eor(dst, src, (uint64_t)CompressedKlassPointers::base());
      lsr(dst, dst, LogKlassAlignmentInBytes);
    } else {
      eor(dst, src, (uint64_t)CompressedKlassPointers::base());
    }
    return;
  }

  if (((uint64_t)CompressedKlassPointers::base() & 0xffffffff) == 0
      && CompressedKlassPointers::shift() == 0) {
    movw(dst, src);
    return;
  }

  ... followed by code which does use r27.

Do you ever see r27 being used? If so, I'd be interested to know how
this gets triggered and what command-line arguments you use. It's
rather inefficient.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From dmitrij.pochepko at bell-sw.com  Fri Nov 15 15:51:42 2019
From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko)
Date: Fri, 15 Nov 2019 18:51:42 +0300
Subject: [aarch64-port-dev ] RFR: 8218748: AARCH64: String::compareTo
 intrinsic documentation and maintenance improvement
In-Reply-To: <MN2PR01MB6093B0E0C0BD01587FF96F778F700@MN2PR01MB6093.prod.exchangelabs.com>
References: <fc712566-1d25-88d6-b178-652efdc00f36@bell-sw.com>
 <DB7PR08MB31153C8A293F3B4FC2371B34967F0@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <86f91401-b7f6-b634-fef1-b0615b8fcde0@redhat.com>
 <e47a5afa-e9ce-1286-97da-5fb9018846cb@bell-sw.com>
 <399186f2-5cb7-b713-33c5-b29b6eb73eb8@samersoff.net>
 <MN2PR01MB6093B0E0C0BD01587FF96F778F700@MN2PR01MB6093.prod.exchangelabs.com>
Message-ID: <6843892a-4267-06cd-0af4-4618fce86396@bell-sw.com>

Hi Patrick,

My experiments back then showed that few platforms (some of Cortex A* 
series) behaves unexpectedly slow when dealing with overprefetch 
(probably CPU implementation specifics). So this code is some kind of 
compromise to run relatively well on all platforms I was able to test on 
(ThunderX, ThunderX2, Cortex A53, Cortex A73). That is the main reason 
for such code structure.
It's good that you're willing to experiment and improve it, but I'm 
afraid changing largeLoopExitCondition to 64 LL/UU and 128 for LU/UL 
will likely make some systems slower as I experimented a lot with this. 
Let us see the performance results for several systems you've got to 
avoid a situation when one platform benefits by slowing down others. We 
could offer some help if you don't have some HW available.

Thanks,
Dmitrij

On 15/11/2019 10:51 AM, Patrick Zhang OS wrote:
> Hi Dmitrij,
>
> The inline document inside your this patch is nice in helping me understand the string_compare stub code generation in depth, although I don't know why it was not pushed.
> http://cr.openjdk.java.net/~dpochepk/8218748/webrev.02/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp.sdiff.html
>
> There is one point I don't quite understand, could you please clarify it a little more? Thanks in advance!
> The large loop with prefetching logic, in different_encoding function, it uses "cnt2 - prefetchLoopExitCondition" to tell whether the 1st iteration should be executed or not, while the same_encoding does not do this, why?
>
> I was thinking that "prefetching out of the string boundary" could be an invalid operation, but it seems a misunderstanding, isn't it? The AArch64 ISA document says that prfm instruction signals the memory system and expects "preloading the cache line containing the specified address into one or more caches" would be done, in order to speed up the memory accesses when they do occur. If this is just a hint for coming ldrs, and safe enough, could we not restrict the rest iterations (2 ~ n) with largeLoopExitCondition? instead use 64 LL/UU, and 128 for LU/UL. As such more iteration can say in the large loop (SoftwarePrefetchHintDistance=192 for example), for better performance. Any comments?
>
> Thanks
>
> 4327   address generate_compare_long_string_different_encoding(bool isLU) {
> 4377     if (SoftwarePrefetchHintDistance >= 0) {
> 4378       __ subs(rscratch2, cnt2, prefetchLoopExitCondition);
> 4379       __ br(__ LT, NO_PREFETCH);
> 4380       __ bind(LARGE_LOOP_PREFETCH);                  // 64-characters loop
> ... ...
> 4395           __ subs(rscratch2, cnt2, prefetchLoopExitCondition); // <-- could we use subs(rscratch2, cnt2, 128) instead?
> 4396           __ br(__ GE, LARGE_LOOP_PREFETCH);
> 4397     } // end of 64-characters loop
>
> 4616   address generate_compare_long_string_same_encoding(bool isLL) {
> 4637     if (SoftwarePrefetchHintDistance >= 0) {
> 4638       __ bind(LARGE_LOOP_PREFETCH);
> 4639         __ prfm(Address(str1, SoftwarePrefetchHintDistance));
> 4640         __ prfm(Address(str2, SoftwarePrefetchHintDistance));
> 4641         compare_string_16_bytes_same(DIFF, DIFF2);
> 4642         compare_string_16_bytes_same(DIFF, DIFF2);
> 4643         __ sub(cnt2, cnt2, 8 * characters_in_word);
> 4644         compare_string_16_bytes_same(DIFF, DIFF2);
> 4645         __ subs(rscratch2, cnt2, largeLoopExitCondition); // rscratch2 is not used. Use subs instead of cmp in case of potentially large constants // <-- could we use subs(rscratch2, cnt2, 64) instead?
> 4646         compare_string_16_bytes_same(DIFF, DIFF2);
> 4647         __ br(__ GT, LARGE_LOOP_PREFETCH);
> 4648         __ cbz(cnt2, LAST_CHECK);                    // no more loads left
> 4649     }
>
> Regards
> Patrick
>
> -----Original Message-----
> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Dmitry Samersoff
> Sent: Sunday, May 19, 2019 11:42 PM
> To: Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com>; Andrew Haley <aph at redhat.com>; Pengfei Li (Arm Technology China) <Pengfei.Li at arm.com>
> Cc: hotspot-compiler-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR: 8218748: AARCH64: String::compareTo intrinsic documentation and maintenance improvement
>
> Dmitrij,
>
> The changes looks good to me.
>
> -Dmitry
>
> On 25.02.2019 19:52, Dmitrij Pochepko wrote:
>> Hi Andrew, Pengfei,
>>
>> I created webrev.02 with all your suggestions implemented:
>>
>> webrev: http://cr.openjdk.java.net/~dpochepk/8218748/webrev.02/
>>
>> - comments are now both in separate section and inlined into code.
>> - documentation mismatch mentioned by Pengfei is fixed:
>> -- SHORT_LAST_INIT label name misprint changed to correct SHORT_LAST
>> -- SHORT_LOOP_TAIL block now merged with last instruction.
>> Documentation is updated respectively
>> - minor other changes to layout and wording
>>
>> Newly developed tests were run as sanity and they passed.
>>
>> Thanks,
>> Dmitrij
>>
>> On 22/02/2019 6:42 PM, Andrew Haley wrote:
>>> On 2/22/19 10:31 AM, Pengfei Li (Arm Technology China) wrote:
>>>
>>>> So personally, I still prefer to inline the comments with the
>>>> original code block to avoid this kind of inconsistencies. And it
>>>> makes us easier to review or maintain the code together with the
>>>> doc, as we don't need to scroll back and force. I don't know the
>>>> benefit of making the code documentation as a separate part. What's
>>>> your opinion, Andrew Haley?
>>> I agree with you. There's no harm having both inline and separate.
>>>

From ci_notify at linaro.org  Sat Nov 16 01:33:31 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Sat, 16 Nov 2019 01:33:31 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK JDK on AArch64
Message-ID: <954629567.893.1573868012484.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/summary/2019/319/summary.html
 
-------------------------------------------------------------------------------
client-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,780; fail: 19; not run: 90

-------------------------------------------------------------------------------
client-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,495; fail: 670; error: 23

-------------------------------------------------------------------------------
client-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/11 pass: 5,751; fail: 1
Build 1: aarch64/2019/oct/14 pass: 5,753
Build 2: aarch64/2019/oct/16 pass: 5,753; fail: 1
Build 3: aarch64/2019/oct/18 pass: 5,760
Build 4: aarch64/2019/oct/21 pass: 5,716; fail: 43; error: 1
Build 5: aarch64/2019/oct/23 pass: 5,760; fail: 1
Build 6: aarch64/2019/oct/28 pass: 5,766
Build 7: aarch64/2019/oct/30 pass: 5,768
Build 8: aarch64/2019/nov/01 pass: 5,768; fail: 1
Build 9: aarch64/2019/nov/04 pass: 5,769
Build 10: aarch64/2019/nov/06 pass: 5,766; fail: 2
Build 11: aarch64/2019/nov/08 pass: 5,761
Build 12: aarch64/2019/nov/11 pass: 5,762
Build 13: aarch64/2019/nov/13 pass: 5,764; fail: 1
Build 14: aarch64/2019/nov/15 pass: 5,750

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/11 pass: 8,693; fail: 511; error: 18
Build 1: aarch64/2019/oct/14 pass: 8,706; fail: 497; error: 20
Build 2: aarch64/2019/oct/16 pass: 8,702; fail: 509; error: 17
Build 3: aarch64/2019/oct/18 pass: 8,694; fail: 522; error: 17
Build 4: aarch64/2019/oct/21 pass: 8,705; fail: 512; error: 18
Build 5: aarch64/2019/oct/23 pass: 8,712; fail: 505; error: 18
Build 6: aarch64/2019/oct/28 pass: 8,711; fail: 509; error: 18
Build 7: aarch64/2019/oct/30 pass: 8,723; fail: 504; error: 19
Build 8: aarch64/2019/nov/01 pass: 8,774; fail: 506; error: 18
Build 9: aarch64/2019/nov/04 pass: 8,777; fail: 509; error: 17
Build 10: aarch64/2019/nov/06 pass: 8,775; fail: 507; error: 19
Build 11: aarch64/2019/nov/08 pass: 8,774; fail: 510; error: 17
Build 12: aarch64/2019/nov/11 pass: 8,777; fail: 509; error: 15
Build 13: aarch64/2019/nov/13 pass: 8,773; fail: 509; error: 21
Build 14: aarch64/2019/nov/15 pass: 8,756; fail: 511; error: 19

3 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/11 pass: 3,979
Build 1: aarch64/2019/oct/14 pass: 3,979
Build 2: aarch64/2019/oct/16 pass: 3,979
Build 3: aarch64/2019/oct/18 pass: 3,979
Build 4: aarch64/2019/oct/21 pass: 3,979
Build 5: aarch64/2019/oct/23 pass: 3,980
Build 6: aarch64/2019/oct/28 pass: 3,980
Build 7: aarch64/2019/oct/30 pass: 3,980
Build 8: aarch64/2019/nov/01 pass: 3,980
Build 9: aarch64/2019/nov/04 pass: 3,980
Build 10: aarch64/2019/nov/06 pass: 3,980
Build 11: aarch64/2019/nov/08 pass: 3,980
Build 12: aarch64/2019/nov/11 pass: 3,980
Build 13: aarch64/2019/nov/13 pass: 3,980
Build 14: aarch64/2019/nov/15 pass: 3,981

-------------------------------------------------------------------------------
server-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,787; fail: 18; not run: 90

-------------------------------------------------------------------------------
server-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,476; fail: 686; error: 27

-------------------------------------------------------------------------------
server-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 7.74x
Relative performance: Server critical-jOPS (nc): 9.52x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 207.57

Server 207.57 / Server 2014-04-01 (71.00): 2.92x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-10-10 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/282/results/
2019-10-12 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/284/results/
2019-10-15 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/287/results/
2019-10-17 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/289/results/
2019-10-19 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/291/results/
2019-10-22 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/294/results/
2019-10-23 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/296/results/
2019-10-29 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/301/results/
2019-10-31 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/303/results/
2019-11-02 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/305/results/
2019-11-05 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/308/results/
2019-11-07 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/310/results/
2019-11-12 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/315/results/
2019-11-14 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/317/results/
2019-11-16 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/319/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/

From ci_notify at linaro.org  Sun Nov 17 19:23:25 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Sun, 17 Nov 2019 19:23:25 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK 13 on AArch64
Message-ID: <607733841.1001.1574018606160.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk13/openjdk-jtreg-nightly-tests/summary/2019/320/summary.html
 
-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/09 pass: 5,643; fail: 4
Build 1: aarch64/2019/jul/16 pass: 5,646; fail: 1
Build 2: aarch64/2019/jul/18 pass: 5,644; fail: 2; error: 1
Build 3: aarch64/2019/jul/20 pass: 5,645; fail: 1; error: 1
Build 4: aarch64/2019/jul/23 pass: 5,644; fail: 3
Build 5: aarch64/2019/jul/25 pass: 5,644; fail: 3
Build 6: aarch64/2019/jul/30 pass: 5,645; fail: 2
Build 7: aarch64/2019/aug/01 pass: 5,646; fail: 1
Build 8: aarch64/2019/aug/03 pass: 5,646; fail: 1
Build 9: aarch64/2019/aug/06 pass: 5,645; fail: 2
Build 10: aarch64/2019/aug/08 pass: 5,646; fail: 1
Build 11: aarch64/2019/aug/10 pass: 5,646; fail: 1
Build 12: aarch64/2019/nov/12 pass: 5,652
Build 13: aarch64/2019/nov/14 pass: 5,650; fail: 2
Build 14: aarch64/2019/nov/16 pass: 5,652

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/09 pass: 8,606; fail: 515; error: 29
Build 1: aarch64/2019/jul/16 pass: 8,593; fail: 531; error: 30
Build 2: aarch64/2019/jul/18 pass: 8,618; fail: 527; error: 26
Build 3: aarch64/2019/jul/20 pass: 8,619; fail: 519; error: 33
Build 4: aarch64/2019/jul/23 pass: 8,616; fail: 525; error: 30
Build 5: aarch64/2019/jul/25 pass: 8,620; fail: 528; error: 23
Build 6: aarch64/2019/jul/30 pass: 8,610; fail: 529; error: 32
Build 7: aarch64/2019/aug/01 pass: 8,620; fail: 527; error: 24
Build 8: aarch64/2019/aug/03 pass: 8,596; fail: 552; error: 23
Build 9: aarch64/2019/aug/06 pass: 8,616; fail: 528; error: 27
Build 10: aarch64/2019/aug/08 pass: 8,649; fail: 504; error: 18
Build 11: aarch64/2019/aug/10 pass: 8,647; fail: 507; error: 17
Build 12: aarch64/2019/nov/12 pass: 8,650; fail: 513; error: 16
Build 13: aarch64/2019/nov/14 pass: 8,651; fail: 511; error: 17
Build 14: aarch64/2019/nov/16 pass: 8,663; fail: 500; error: 17

4 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/09 pass: 3,962
Build 1: aarch64/2019/jul/16 pass: 3,963
Build 2: aarch64/2019/jul/18 pass: 3,964
Build 3: aarch64/2019/jul/20 pass: 3,964
Build 4: aarch64/2019/jul/23 pass: 3,964
Build 5: aarch64/2019/jul/25 pass: 3,964
Build 6: aarch64/2019/jul/30 pass: 3,964
Build 7: aarch64/2019/aug/01 pass: 3,964
Build 8: aarch64/2019/aug/03 pass: 3,964
Build 9: aarch64/2019/aug/06 pass: 3,964
Build 10: aarch64/2019/aug/08 pass: 3,964
Build 11: aarch64/2019/aug/10 pass: 3,964
Build 12: aarch64/2019/nov/12 pass: 3,964
Build 13: aarch64/2019/nov/14 pass: 3,964
Build 14: aarch64/2019/nov/16 pass: 3,964

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdk13/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 7.63x
Relative performance: Server critical-jOPS (nc): 9.40x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk13/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 210.67

Server 210.67 / Server 2014-04-01 (71.00): 2.97x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk13/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-07-10 pass rate: 10487/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/190/results/
2019-07-16 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/197/results/
2019-07-19 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/199/results/
2019-07-21 pass rate: 10487/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/201/results/
2019-07-24 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/204/results/
2019-07-26 pass rate: 10487/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/206/results/
2019-07-31 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/211/results/
2019-08-02 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/213/results/
2019-08-04 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/215/results/
2019-08-07 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/218/results/
2019-08-09 pass rate: 10487/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/220/results/
2019-08-11 pass rate: 10487/10488, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/222/results/
2019-11-12 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/316/results/
2019-11-15 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/318/results/
2019-11-17 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/2019/320/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk13/jcstress-nightly-runs/

From ci_notify at linaro.org  Sun Nov 17 19:26:51 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Sun, 17 Nov 2019 19:26:51 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK 11u on AArch64
Message-ID: <1239889170.1003.1574018811781.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk11u/openjdk-jtreg-nightly-tests/summary/2019/320/summary.html
 
-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/02 pass: 5,737; fail: 5
Build 1: aarch64/2019/aug/03 pass: 5,746; fail: 4
Build 2: aarch64/2019/aug/10 pass: 5,747; fail: 4
Build 3: aarch64/2019/aug/15 pass: 5,753; fail: 4
Build 4: aarch64/2019/aug/22 pass: 5,755; fail: 4
Build 5: aarch64/2019/sep/04 pass: 5,764; fail: 2
Build 6: aarch64/2019/sep/05 pass: 5,764; fail: 2
Build 7: aarch64/2019/sep/10 pass: 5,764; fail: 2
Build 8: aarch64/2019/sep/17 pass: 5,763; fail: 3
Build 9: aarch64/2019/sep/21 pass: 5,764; fail: 2
Build 10: aarch64/2019/oct/04 pass: 5,764; fail: 2
Build 11: aarch64/2019/oct/17 pass: 5,764; fail: 2
Build 12: aarch64/2019/oct/31 pass: 5,784; fail: 1
Build 13: aarch64/2019/nov/09 pass: 5,773; fail: 3
Build 14: aarch64/2019/nov/16 pass: 5,775; fail: 1

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/02 pass: 8,407; fail: 498; error: 31
Build 1: aarch64/2019/aug/03 pass: 8,429; fail: 509; error: 18
Build 2: aarch64/2019/aug/10 pass: 8,450; fail: 485; error: 16
Build 3: aarch64/2019/aug/15 pass: 8,443; fail: 496; error: 13
Build 4: aarch64/2019/aug/22 pass: 8,446; fail: 494; error: 15
Build 5: aarch64/2019/sep/04 pass: 8,483; fail: 465; error: 10
Build 6: aarch64/2019/sep/05 pass: 8,465; fail: 479; error: 14
Build 7: aarch64/2019/sep/10 pass: 8,444; fail: 500; error: 14
Build 8: aarch64/2019/sep/17 pass: 8,462; fail: 482; error: 12
Build 9: aarch64/2019/sep/21 pass: 8,467; fail: 478; error: 13
Build 10: aarch64/2019/oct/04 pass: 8,444; fail: 498; error: 16
Build 11: aarch64/2019/oct/17 pass: 8,452; fail: 493; error: 16
Build 12: aarch64/2019/oct/31 pass: 8,468; fail: 490; error: 14
Build 13: aarch64/2019/nov/09 pass: 8,487; fail: 470; error: 16
Build 14: aarch64/2019/nov/16 pass: 8,475; fail: 484; error: 15

3 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/jul/02 pass: 3,908
Build 1: aarch64/2019/aug/03 pass: 3,908
Build 2: aarch64/2019/aug/10 pass: 3,909
Build 3: aarch64/2019/aug/15 pass: 3,909
Build 4: aarch64/2019/aug/22 pass: 3,909
Build 5: aarch64/2019/sep/04 pass: 3,910
Build 6: aarch64/2019/sep/05 pass: 3,910
Build 7: aarch64/2019/sep/10 pass: 3,910
Build 8: aarch64/2019/sep/17 pass: 3,910
Build 9: aarch64/2019/sep/21 pass: 3,910
Build 10: aarch64/2019/oct/04 pass: 3,910
Build 11: aarch64/2019/oct/17 pass: 3,910
Build 12: aarch64/2019/oct/31 pass: 3,910
Build 13: aarch64/2019/nov/09 pass: 3,910
Build 14: aarch64/2019/nov/16 pass: 3,910

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdk11u/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 7.38x
Relative performance: Server critical-jOPS (nc): 8.14x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk11u/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 207.57

Server 207.57 / Server 2014-04-01 (71.00): 2.92x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk11u/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-07-03 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/183/results/
2019-08-04 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/215/results/
2019-08-11 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/222/results/
2019-08-16 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/227/results/
2019-08-23 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/234/results/
2019-09-05 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/247/results/
2019-09-07 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/248/results/
2019-09-11 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/253/results/
2019-09-18 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/260/results/
2019-09-22 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/264/results/
2019-10-05 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/277/results/
2019-10-18 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/290/results/
2019-11-01 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/304/results/
2019-11-10 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/313/results/
2019-11-17 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/320/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/

From patrick at os.amperecomputing.com  Mon Nov 18 03:52:26 2019
From: patrick at os.amperecomputing.com (Patrick Zhang OS)
Date: Mon, 18 Nov 2019 03:52:26 +0000
Subject: [aarch64-port-dev ] RFR: 8218748: AARCH64: String::compareTo
 intrinsic documentation and maintenance improvement
In-Reply-To: <6843892a-4267-06cd-0af4-4618fce86396@bell-sw.com>
References: <fc712566-1d25-88d6-b178-652efdc00f36@bell-sw.com>
 <DB7PR08MB31153C8A293F3B4FC2371B34967F0@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <86f91401-b7f6-b634-fef1-b0615b8fcde0@redhat.com>
 <e47a5afa-e9ce-1286-97da-5fb9018846cb@bell-sw.com>
 <399186f2-5cb7-b713-33c5-b29b6eb73eb8@samersoff.net>
 <MN2PR01MB6093B0E0C0BD01587FF96F778F700@MN2PR01MB6093.prod.exchangelabs.com>
 <6843892a-4267-06cd-0af4-4618fce86396@bell-sw.com>
Message-ID: <MN2PR01MB6093853821555FE4DE0BEB628F4D0@MN2PR01MB6093.prod.exchangelabs.com>

Thanks for the information.
I am interested in the inconsistence between same_encoding and different_encoding functions, if "overprefetch" can be safe enough, why do we prevent it at the end of large-loop inside same_encoding, why do we protect it more strictly in different_encoding at both the beginning and ending of the large-loop?
I did not mean globally updating largeLoopExitCondition to 64/128, merely the condition at the end of large-loop inside same_encoding. Suppose large-loop could be faster than small-loop (in theory), removing all "overprefetch" conditions would allow more strings go to the large-loop for better performance. Any other potential side-effects?

Regards
Patrick

-----Original Message-----
From: Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com> 
Sent: Friday, November 15, 2019 11:52 PM
To: Patrick Zhang OS <patrick at os.amperecomputing.com>
Cc: hotspot-compiler-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net; Dmitry Samersoff <dms at samersoff.net>; Andrew Haley <aph at redhat.com>; Pengfei Li (Arm Technology China) <Pengfei.Li at arm.com>
Subject: Re: [aarch64-port-dev ] RFR: 8218748: AARCH64: String::compareTo intrinsic documentation and maintenance improvement

Hi Patrick,

My experiments back then showed that few platforms (some of Cortex A*
series) behaves unexpectedly slow when dealing with overprefetch (probably CPU implementation specifics). So this code is some kind of compromise to run relatively well on all platforms I was able to test on (ThunderX, ThunderX2, Cortex A53, Cortex A73). That is the main reason for such code structure.
It's good that you're willing to experiment and improve it, but I'm afraid changing largeLoopExitCondition to 64 LL/UU and 128 for LU/UL will likely make some systems slower as I experimented a lot with this. 
Let us see the performance results for several systems you've got to avoid a situation when one platform benefits by slowing down others. We could offer some help if you don't have some HW available.

Thanks,
Dmitrij

On 15/11/2019 10:51 AM, Patrick Zhang OS wrote:
> Hi Dmitrij,
>
> The inline document inside your this patch is nice in helping me understand the string_compare stub code generation in depth, although I don't know why it was not pushed.
> http://cr.openjdk.java.net/~dpochepk/8218748/webrev.02/src/hotspot/cpu
> /aarch64/stubGenerator_aarch64.cpp.sdiff.html
>
> There is one point I don't quite understand, could you please clarify it a little more? Thanks in advance!
> The large loop with prefetching logic, in different_encoding function, it uses "cnt2 - prefetchLoopExitCondition" to tell whether the 1st iteration should be executed or not, while the same_encoding does not do this, why?
>
> I was thinking that "prefetching out of the string boundary" could be an invalid operation, but it seems a misunderstanding, isn't it? The AArch64 ISA document says that prfm instruction signals the memory system and expects "preloading the cache line containing the specified address into one or more caches" would be done, in order to speed up the memory accesses when they do occur. If this is just a hint for coming ldrs, and safe enough, could we not restrict the rest iterations (2 ~ n) with largeLoopExitCondition? instead use 64 LL/UU, and 128 for LU/UL. As such more iteration can say in the large loop (SoftwarePrefetchHintDistance=192 for example), for better performance. Any comments?
>
> Thanks
>
> 4327   address generate_compare_long_string_different_encoding(bool isLU) {
> 4377     if (SoftwarePrefetchHintDistance >= 0) {
> 4378       __ subs(rscratch2, cnt2, prefetchLoopExitCondition);
> 4379       __ br(__ LT, NO_PREFETCH);
> 4380       __ bind(LARGE_LOOP_PREFETCH);                  // 64-characters loop
> ... ...
> 4395           __ subs(rscratch2, cnt2, prefetchLoopExitCondition); // <-- could we use subs(rscratch2, cnt2, 128) instead?
> 4396           __ br(__ GE, LARGE_LOOP_PREFETCH);
> 4397     } // end of 64-characters loop
>
> 4616   address generate_compare_long_string_same_encoding(bool isLL) {
> 4637     if (SoftwarePrefetchHintDistance >= 0) {
> 4638       __ bind(LARGE_LOOP_PREFETCH);
> 4639         __ prfm(Address(str1, SoftwarePrefetchHintDistance));
> 4640         __ prfm(Address(str2, SoftwarePrefetchHintDistance));
> 4641         compare_string_16_bytes_same(DIFF, DIFF2);
> 4642         compare_string_16_bytes_same(DIFF, DIFF2);
> 4643         __ sub(cnt2, cnt2, 8 * characters_in_word);
> 4644         compare_string_16_bytes_same(DIFF, DIFF2);
> 4645         __ subs(rscratch2, cnt2, largeLoopExitCondition); // rscratch2 is not used. Use subs instead of cmp in case of potentially large constants // <-- could we use subs(rscratch2, cnt2, 64) instead?
> 4646         compare_string_16_bytes_same(DIFF, DIFF2);
> 4647         __ br(__ GT, LARGE_LOOP_PREFETCH);
> 4648         __ cbz(cnt2, LAST_CHECK);                    // no more loads left
> 4649     }
>
> Regards
> Patrick
>
> -----Original Message-----
> From: hotspot-compiler-dev 
> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Dmitry 
> Samersoff
> Sent: Sunday, May 19, 2019 11:42 PM
> To: Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com>; Andrew Haley 
> <aph at redhat.com>; Pengfei Li (Arm Technology China) 
> <Pengfei.Li at arm.com>
> Cc: hotspot-compiler-dev at openjdk.java.net; 
> aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR: 8218748: AARCH64: 
> String::compareTo intrinsic documentation and maintenance improvement
>
> Dmitrij,
>
> The changes looks good to me.
>
> -Dmitry
>
> On 25.02.2019 19:52, Dmitrij Pochepko wrote:
>> Hi Andrew, Pengfei,
>>
>> I created webrev.02 with all your suggestions implemented:
>>
>> webrev: http://cr.openjdk.java.net/~dpochepk/8218748/webrev.02/
>>
>> - comments are now both in separate section and inlined into code.
>> - documentation mismatch mentioned by Pengfei is fixed:
>> -- SHORT_LAST_INIT label name misprint changed to correct SHORT_LAST
>> -- SHORT_LOOP_TAIL block now merged with last instruction.
>> Documentation is updated respectively
>> - minor other changes to layout and wording
>>
>> Newly developed tests were run as sanity and they passed.
>>
>> Thanks,
>> Dmitrij
>>
>> On 22/02/2019 6:42 PM, Andrew Haley wrote:
>>> On 2/22/19 10:31 AM, Pengfei Li (Arm Technology China) wrote:
>>>
>>>> So personally, I still prefer to inline the comments with the 
>>>> original code block to avoid this kind of inconsistencies. And it 
>>>> makes us easier to review or maintain the code together with the 
>>>> doc, as we don't need to scroll back and force. I don't know the 
>>>> benefit of making the code documentation as a separate part. What's 
>>>> your opinion, Andrew Haley?
>>> I agree with you. There's no harm having both inline and separate.
>>>

From patrick at os.amperecomputing.com  Mon Nov 18 04:03:55 2019
From: patrick at os.amperecomputing.com (Patrick Zhang OS)
Date: Mon, 18 Nov 2019 04:03:55 +0000
Subject: [aarch64-port-dev ] RFR: 8218748: AARCH64: String::compareTo
 intrinsic documentation and maintenance improvement
In-Reply-To: <MN2PR01MB6093853821555FE4DE0BEB628F4D0@MN2PR01MB6093.prod.exchangelabs.com>
References: <fc712566-1d25-88d6-b178-652efdc00f36@bell-sw.com>
 <DB7PR08MB31153C8A293F3B4FC2371B34967F0@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <86f91401-b7f6-b634-fef1-b0615b8fcde0@redhat.com>
 <e47a5afa-e9ce-1286-97da-5fb9018846cb@bell-sw.com>
 <399186f2-5cb7-b713-33c5-b29b6eb73eb8@samersoff.net>
 <MN2PR01MB6093B0E0C0BD01587FF96F778F700@MN2PR01MB6093.prod.exchangelabs.com>
 <6843892a-4267-06cd-0af4-4618fce86396@bell-sw.com>
 <MN2PR01MB6093853821555FE4DE0BEB628F4D0@MN2PR01MB6093.prod.exchangelabs.com>
Message-ID: <MN2PR01MB609392A8FA8C9EE6C27C17BE8F4D0@MN2PR01MB6093.prod.exchangelabs.com>

>> changing largeLoopExitCondition to 64 LL/UU and 128 for LU/UL will likely make some systems slower as I experimented a lot with this.
Sorry my second paragraph was inaccurate, it seems you experimented that there were some cases ran well with the first iteration of the large-loop but would rather quit the loop and go to the small-loop immediately for better performance (?). Please correct me if I misunderstood this. Thanks.

Regards
Patrick

-----Original Message-----
From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Patrick Zhang OS
Sent: Monday, November 18, 2019 11:52 AM
To: Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com>
Cc: hotspot-compiler-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net
Subject: RE: [aarch64-port-dev ] RFR: 8218748: AARCH64: String::compareTo intrinsic documentation and maintenance improvement

Thanks for the information.
I am interested in the inconsistence between same_encoding and different_encoding functions, if "overprefetch" can be safe enough, why do we prevent it at the end of large-loop inside same_encoding, why do we protect it more strictly in different_encoding at both the beginning and ending of the large-loop?
I did not mean globally updating largeLoopExitCondition to 64/128, merely the condition at the end of large-loop inside same_encoding. Suppose large-loop could be faster than small-loop (in theory), removing all "overprefetch" conditions would allow more strings go to the large-loop for better performance. Any other potential side-effects?

Regards
Patrick

-----Original Message-----
From: Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com>
Sent: Friday, November 15, 2019 11:52 PM
To: Patrick Zhang OS <patrick at os.amperecomputing.com>
Cc: hotspot-compiler-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net; Dmitry Samersoff <dms at samersoff.net>; Andrew Haley <aph at redhat.com>; Pengfei Li (Arm Technology China) <Pengfei.Li at arm.com>
Subject: Re: [aarch64-port-dev ] RFR: 8218748: AARCH64: String::compareTo intrinsic documentation and maintenance improvement

Hi Patrick,

My experiments back then showed that few platforms (some of Cortex A*
series) behaves unexpectedly slow when dealing with overprefetch (probably CPU implementation specifics). So this code is some kind of compromise to run relatively well on all platforms I was able to test on (ThunderX, ThunderX2, Cortex A53, Cortex A73). That is the main reason for such code structure.
It's good that you're willing to experiment and improve it, but I'm afraid changing largeLoopExitCondition to 64 LL/UU and 128 for LU/UL will likely make some systems slower as I experimented a lot with this. 
Let us see the performance results for several systems you've got to avoid a situation when one platform benefits by slowing down others. We could offer some help if you don't have some HW available.

Thanks,
Dmitrij

On 15/11/2019 10:51 AM, Patrick Zhang OS wrote:
> Hi Dmitrij,
>
> The inline document inside your this patch is nice in helping me understand the string_compare stub code generation in depth, although I don't know why it was not pushed.
> http://cr.openjdk.java.net/~dpochepk/8218748/webrev.02/src/hotspot/cpu
> /aarch64/stubGenerator_aarch64.cpp.sdiff.html
>
> There is one point I don't quite understand, could you please clarify it a little more? Thanks in advance!
> The large loop with prefetching logic, in different_encoding function, it uses "cnt2 - prefetchLoopExitCondition" to tell whether the 1st iteration should be executed or not, while the same_encoding does not do this, why?
>
> I was thinking that "prefetching out of the string boundary" could be an invalid operation, but it seems a misunderstanding, isn't it? The AArch64 ISA document says that prfm instruction signals the memory system and expects "preloading the cache line containing the specified address into one or more caches" would be done, in order to speed up the memory accesses when they do occur. If this is just a hint for coming ldrs, and safe enough, could we not restrict the rest iterations (2 ~ n) with largeLoopExitCondition? instead use 64 LL/UU, and 128 for LU/UL. As such more iteration can say in the large loop (SoftwarePrefetchHintDistance=192 for example), for better performance. Any comments?
>
> Thanks
>
> 4327   address generate_compare_long_string_different_encoding(bool isLU) {
> 4377     if (SoftwarePrefetchHintDistance >= 0) {
> 4378       __ subs(rscratch2, cnt2, prefetchLoopExitCondition);
> 4379       __ br(__ LT, NO_PREFETCH);
> 4380       __ bind(LARGE_LOOP_PREFETCH);                  // 64-characters loop
> ... ...
> 4395           __ subs(rscratch2, cnt2, prefetchLoopExitCondition); // <-- could we use subs(rscratch2, cnt2, 128) instead?
> 4396           __ br(__ GE, LARGE_LOOP_PREFETCH);
> 4397     } // end of 64-characters loop
>
> 4616   address generate_compare_long_string_same_encoding(bool isLL) {
> 4637     if (SoftwarePrefetchHintDistance >= 0) {
> 4638       __ bind(LARGE_LOOP_PREFETCH);
> 4639         __ prfm(Address(str1, SoftwarePrefetchHintDistance));
> 4640         __ prfm(Address(str2, SoftwarePrefetchHintDistance));
> 4641         compare_string_16_bytes_same(DIFF, DIFF2);
> 4642         compare_string_16_bytes_same(DIFF, DIFF2);
> 4643         __ sub(cnt2, cnt2, 8 * characters_in_word);
> 4644         compare_string_16_bytes_same(DIFF, DIFF2);
> 4645         __ subs(rscratch2, cnt2, largeLoopExitCondition); // rscratch2 is not used. Use subs instead of cmp in case of potentially large constants // <-- could we use subs(rscratch2, cnt2, 64) instead?
> 4646         compare_string_16_bytes_same(DIFF, DIFF2);
> 4647         __ br(__ GT, LARGE_LOOP_PREFETCH);
> 4648         __ cbz(cnt2, LAST_CHECK);                    // no more loads left
> 4649     }
>
> Regards
> Patrick
>
> -----Original Message-----
> From: hotspot-compiler-dev
> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Dmitry 
> Samersoff
> Sent: Sunday, May 19, 2019 11:42 PM
> To: Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com>; Andrew Haley 
> <aph at redhat.com>; Pengfei Li (Arm Technology China) 
> <Pengfei.Li at arm.com>
> Cc: hotspot-compiler-dev at openjdk.java.net;
> aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR: 8218748: AARCH64: 
> String::compareTo intrinsic documentation and maintenance improvement
>
> Dmitrij,
>
> The changes looks good to me.
>
> -Dmitry
>
> On 25.02.2019 19:52, Dmitrij Pochepko wrote:
>> Hi Andrew, Pengfei,
>>
>> I created webrev.02 with all your suggestions implemented:
>>
>> webrev: http://cr.openjdk.java.net/~dpochepk/8218748/webrev.02/
>>
>> - comments are now both in separate section and inlined into code.
>> - documentation mismatch mentioned by Pengfei is fixed:
>> -- SHORT_LAST_INIT label name misprint changed to correct SHORT_LAST
>> -- SHORT_LOOP_TAIL block now merged with last instruction.
>> Documentation is updated respectively
>> - minor other changes to layout and wording
>>
>> Newly developed tests were run as sanity and they passed.
>>
>> Thanks,
>> Dmitrij
>>
>> On 22/02/2019 6:42 PM, Andrew Haley wrote:
>>> On 2/22/19 10:31 AM, Pengfei Li (Arm Technology China) wrote:
>>>
>>>> So personally, I still prefer to inline the comments with the 
>>>> original code block to avoid this kind of inconsistencies. And it 
>>>> makes us easier to review or maintain the code together with the 
>>>> doc, as we don't need to scroll back and force. I don't know the 
>>>> benefit of making the code documentation as a separate part. What's 
>>>> your opinion, Andrew Haley?
>>> I agree with you. There's no harm having both inline and separate.
>>>

From Pengfei.Li at arm.com  Mon Nov 18 09:58:11 2019
From: Pengfei.Li at arm.com (Pengfei Li (Arm Technology China))
Date: Mon, 18 Nov 2019 09:58:11 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
Message-ID: <DB7PR08MB3115D5D4659DC9A7886F2A44964D0@DB7PR08MB3115.eurprd08.prod.outlook.com>

Hi Andrew,

> I see
> 
>   if (use_XOR_for_compressed_class_base) {
>     if (CompressedKlassPointers::shift() != 0) {
>       eor(dst, src, (uint64_t)CompressedKlassPointers::base());
>       lsr(dst, dst, LogKlassAlignmentInBytes);
>     } else {
>       eor(dst, src, (uint64_t)CompressedKlassPointers::base());
>     }
>     return;
>   }
> 
>   if (((uint64_t)CompressedKlassPointers::base() & 0xffffffff) == 0
>       && CompressedKlassPointers::shift() == 0) {
>     movw(dst, src);
>     return;
>   }
> 
>   ... followed by code which does use r27.
> 
> Do you ever see r27 being used? If so, I'd be interested to know how this gets
> triggered and what command-line arguments you use. It's rather inefficient.

I think you're right. I tried hard with various VM options but still failed to
get the code after this part triggered. The worst case I've ever found is that
the encoding/decoding returns at if block
  if (((uint64_t)CompressedKlassPointers::base() & 0xffffffff) == 0
      && CompressedKlassPointers::shift() == 0) { ... }

By browsing the code, I found this is caused by a metaspace reservation trick
that always tries to make AArch64 metaspace 4G-aligned. [1]

If we do have the confidence that r27 won't be used for class pointers, I will
remove UseCompressedClassPointers in my if condition. Another question, shall
we clean up the (almost) dead code which uses r27 for encoding/decoding class
pointers?

[1] http://hg.openjdk.java.net/jdk/jdk/file/7bdc4f073c7f/src/hotspot/share/memory/metaspace.cpp#l1048

--
Thanks,
Pengfei


From aph at redhat.com  Mon Nov 18 10:06:46 2019
From: aph at redhat.com (Andrew Haley)
Date: Mon, 18 Nov 2019 10:06:46 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <DB7PR08MB3115D5D4659DC9A7886F2A44964D0@DB7PR08MB3115.eurprd08.prod.outlook.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <DB7PR08MB3115D5D4659DC9A7886F2A44964D0@DB7PR08MB3115.eurprd08.prod.outlook.com>
Message-ID: <355b7629-2ef0-316e-8d75-0593a8a50dc8@redhat.com>

On 11/18/19 9:58 AM, Pengfei Li (Arm Technology China) wrote:
> I think you're right. I tried hard with various VM options but still failed to
> get the code after this part triggered. The worst case I've ever found is that
> the encoding/decoding returns at if block
>   if (((uint64_t)CompressedKlassPointers::base() & 0xffffffff) == 0
>       && CompressedKlassPointers::shift() == 0) { ... }
> 
> By browsing the code, I found this is caused by a metaspace reservation trick
> that always tries to make AArch64 metaspace 4G-aligned. [1]
> 
> If we do have the confidence that r27 won't be used for class pointers, I will
> remove UseCompressedClassPointers in my if condition. Another question, shall
> we clean up the (almost) dead code which uses r27 for encoding/decoding class
> pointers?
> 
> [1] http://hg.openjdk.java.net/jdk/jdk/file/7bdc4f073c7f/src/hotspot/share/memory/metaspace.cpp#l1048

We should have a flag which is set if the search for nicely-aligned memory
is successful, and then you can use that flag to determine if r27 is
needed.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From Pengfei.Li at arm.com  Mon Nov 18 10:35:18 2019
From: Pengfei.Li at arm.com (Pengfei Li (Arm Technology China))
Date: Mon, 18 Nov 2019 10:35:18 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <355b7629-2ef0-316e-8d75-0593a8a50dc8@redhat.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <DB7PR08MB3115D5D4659DC9A7886F2A44964D0@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <355b7629-2ef0-316e-8d75-0593a8a50dc8@redhat.com>
Message-ID: <DB7PR08MB311548BD55E8D8F07D3E97DF964D0@DB7PR08MB3115.eurprd08.prod.outlook.com>

Hi Andrew,

> We should have a flag which is set if the search for nicely-aligned memory is
> successful, and then you can use that flag to determine if r27 is needed.

I just found in current HotSpot code, UseCompressedOops must be on for
UseCompressedClassPointers to be on. See arguments.cpp [1].

If this is true, UseCompressedClassPointers cannot be used without
UseCompressedOops. So wouldn't a single condition of UseCompressedOops be
enough? But the x86_64 code which I referenced has both two conditions.
Is it because the relationship of the arguments are subject to change in the
future?

[1] http://hg.openjdk.java.net/jdk/jdk/file/7bdc4f073c7f/src/hotspot/share/runtime/arguments.cpp#l1715

--
Thanks,
Pengfei


From aph at redhat.com  Mon Nov 18 10:39:03 2019
From: aph at redhat.com (Andrew Haley)
Date: Mon, 18 Nov 2019 10:39:03 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <DB7PR08MB311548BD55E8D8F07D3E97DF964D0@DB7PR08MB3115.eurprd08.prod.outlook.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <DB7PR08MB3115D5D4659DC9A7886F2A44964D0@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <355b7629-2ef0-316e-8d75-0593a8a50dc8@redhat.com>
 <DB7PR08MB311548BD55E8D8F07D3E97DF964D0@DB7PR08MB3115.eurprd08.prod.outlook.com>
Message-ID: <638df51f-f9e0-5ceb-6509-7d396f85d8ec@redhat.com>

On 11/18/19 10:35 AM, Pengfei Li (Arm Technology China) wrote:
> If this is true, UseCompressedClassPointers cannot be used without
> UseCompressedOops. So wouldn't a single condition of UseCompressedOops be
> enough?

Why do you think so? UseCompressedOops doesn't usually need r27.

> But the x86_64 code which I referenced has both two conditions.
> Is it because the relationship of the arguments are subject to change in the
> future?

I have no idea why these flags depend on each other. I'd use compressed
class pointers all the time, regardless of compressed oops.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From dmitrij.pochepko at bell-sw.com  Mon Nov 18 12:02:13 2019
From: dmitrij.pochepko at bell-sw.com (Dmitrij Pochepko)
Date: Mon, 18 Nov 2019 15:02:13 +0300
Subject: [aarch64-port-dev ] RFR: 8218748: AARCH64: String::compareTo
 intrinsic documentation and maintenance improvement
In-Reply-To: <MN2PR01MB609392A8FA8C9EE6C27C17BE8F4D0@MN2PR01MB6093.prod.exchangelabs.com>
References: <fc712566-1d25-88d6-b178-652efdc00f36@bell-sw.com>
 <DB7PR08MB31153C8A293F3B4FC2371B34967F0@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <86f91401-b7f6-b634-fef1-b0615b8fcde0@redhat.com>
 <e47a5afa-e9ce-1286-97da-5fb9018846cb@bell-sw.com>
 <399186f2-5cb7-b713-33c5-b29b6eb73eb8@samersoff.net>
 <MN2PR01MB6093B0E0C0BD01587FF96F778F700@MN2PR01MB6093.prod.exchangelabs.com>
 <6843892a-4267-06cd-0af4-4618fce86396@bell-sw.com>
 <MN2PR01MB6093853821555FE4DE0BEB628F4D0@MN2PR01MB6093.prod.exchangelabs.com>
 <MN2PR01MB609392A8FA8C9EE6C27C17BE8F4D0@MN2PR01MB6093.prod.exchangelabs.com>
Message-ID: <3427facc-eb05-d690-eebe-acca39b87d4a@bell-sw.com>


On 18/11/2019 7:03 AM, Patrick Zhang OS wrote:
>>> changing largeLoopExitCondition to 64 LL/UU and 128 for LU/UL will likely make some systems slower as I experimented a lot with this.
> Sorry my second paragraph was inaccurate, it seems you experimented that there were some cases ran well with the first iteration of the large-loop but would rather quit the loop and go to the small-loop immediately for better performance (?). Please correct me if I misunderstood this. Thanks.
>
> Regards
> Patrick

Yes. That's correct.

Thanks,
Dmitrij

>
> -----Original Message-----
> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Patrick Zhang OS
> Sent: Monday, November 18, 2019 11:52 AM
> To: Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com>
> Cc: hotspot-compiler-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net
> Subject: RE: [aarch64-port-dev ] RFR: 8218748: AARCH64: String::compareTo intrinsic documentation and maintenance improvement
>
> Thanks for the information.
> I am interested in the inconsistence between same_encoding and different_encoding functions, if "overprefetch" can be safe enough, why do we prevent it at the end of large-loop inside same_encoding, why do we protect it more strictly in different_encoding at both the beginning and ending of the large-loop?
> I did not mean globally updating largeLoopExitCondition to 64/128, merely the condition at the end of large-loop inside same_encoding. Suppose large-loop could be faster than small-loop (in theory), removing all "overprefetch" conditions would allow more strings go to the large-loop for better performance. Any other potential side-effects?
>
> Regards
> Patrick
>
> -----Original Message-----
> From: Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com>
> Sent: Friday, November 15, 2019 11:52 PM
> To: Patrick Zhang OS <patrick at os.amperecomputing.com>
> Cc: hotspot-compiler-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net; Dmitry Samersoff <dms at samersoff.net>; Andrew Haley <aph at redhat.com>; Pengfei Li (Arm Technology China) <Pengfei.Li at arm.com>
> Subject: Re: [aarch64-port-dev ] RFR: 8218748: AARCH64: String::compareTo intrinsic documentation and maintenance improvement
>
> Hi Patrick,
>
> My experiments back then showed that few platforms (some of Cortex A*
> series) behaves unexpectedly slow when dealing with overprefetch (probably CPU implementation specifics). So this code is some kind of compromise to run relatively well on all platforms I was able to test on (ThunderX, ThunderX2, Cortex A53, Cortex A73). That is the main reason for such code structure.
> It's good that you're willing to experiment and improve it, but I'm afraid changing largeLoopExitCondition to 64 LL/UU and 128 for LU/UL will likely make some systems slower as I experimented a lot with this.
> Let us see the performance results for several systems you've got to avoid a situation when one platform benefits by slowing down others. We could offer some help if you don't have some HW available.
>
> Thanks,
> Dmitrij
>
> On 15/11/2019 10:51 AM, Patrick Zhang OS wrote:
>> Hi Dmitrij,
>>
>> The inline document inside your this patch is nice in helping me understand the string_compare stub code generation in depth, although I don't know why it was not pushed.
>> http://cr.openjdk.java.net/~dpochepk/8218748/webrev.02/src/hotspot/cpu
>> /aarch64/stubGenerator_aarch64.cpp.sdiff.html
>>
>> There is one point I don't quite understand, could you please clarify it a little more? Thanks in advance!
>> The large loop with prefetching logic, in different_encoding function, it uses "cnt2 - prefetchLoopExitCondition" to tell whether the 1st iteration should be executed or not, while the same_encoding does not do this, why?
>>
>> I was thinking that "prefetching out of the string boundary" could be an invalid operation, but it seems a misunderstanding, isn't it? The AArch64 ISA document says that prfm instruction signals the memory system and expects "preloading the cache line containing the specified address into one or more caches" would be done, in order to speed up the memory accesses when they do occur. If this is just a hint for coming ldrs, and safe enough, could we not restrict the rest iterations (2 ~ n) with largeLoopExitCondition? instead use 64 LL/UU, and 128 for LU/UL. As such more iteration can say in the large loop (SoftwarePrefetchHintDistance=192 for example), for better performance. Any comments?
>>
>> Thanks
>>
>> 4327   address generate_compare_long_string_different_encoding(bool isLU) {
>> 4377     if (SoftwarePrefetchHintDistance >= 0) {
>> 4378       __ subs(rscratch2, cnt2, prefetchLoopExitCondition);
>> 4379       __ br(__ LT, NO_PREFETCH);
>> 4380       __ bind(LARGE_LOOP_PREFETCH);                  // 64-characters loop
>> ... ...
>> 4395           __ subs(rscratch2, cnt2, prefetchLoopExitCondition); // <-- could we use subs(rscratch2, cnt2, 128) instead?
>> 4396           __ br(__ GE, LARGE_LOOP_PREFETCH);
>> 4397     } // end of 64-characters loop
>>
>> 4616   address generate_compare_long_string_same_encoding(bool isLL) {
>> 4637     if (SoftwarePrefetchHintDistance >= 0) {
>> 4638       __ bind(LARGE_LOOP_PREFETCH);
>> 4639         __ prfm(Address(str1, SoftwarePrefetchHintDistance));
>> 4640         __ prfm(Address(str2, SoftwarePrefetchHintDistance));
>> 4641         compare_string_16_bytes_same(DIFF, DIFF2);
>> 4642         compare_string_16_bytes_same(DIFF, DIFF2);
>> 4643         __ sub(cnt2, cnt2, 8 * characters_in_word);
>> 4644         compare_string_16_bytes_same(DIFF, DIFF2);
>> 4645         __ subs(rscratch2, cnt2, largeLoopExitCondition); // rscratch2 is not used. Use subs instead of cmp in case of potentially large constants // <-- could we use subs(rscratch2, cnt2, 64) instead?
>> 4646         compare_string_16_bytes_same(DIFF, DIFF2);
>> 4647         __ br(__ GT, LARGE_LOOP_PREFETCH);
>> 4648         __ cbz(cnt2, LAST_CHECK);                    // no more loads left
>> 4649     }
>>
>> Regards
>> Patrick
>>
>> -----Original Message-----
>> From: hotspot-compiler-dev
>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Dmitry
>> Samersoff
>> Sent: Sunday, May 19, 2019 11:42 PM
>> To: Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com>; Andrew Haley
>> <aph at redhat.com>; Pengfei Li (Arm Technology China)
>> <Pengfei.Li at arm.com>
>> Cc: hotspot-compiler-dev at openjdk.java.net;
>> aarch64-port-dev at openjdk.java.net
>> Subject: Re: [aarch64-port-dev ] RFR: 8218748: AARCH64:
>> String::compareTo intrinsic documentation and maintenance improvement
>>
>> Dmitrij,
>>
>> The changes looks good to me.
>>
>> -Dmitry
>>
>> On 25.02.2019 19:52, Dmitrij Pochepko wrote:
>>> Hi Andrew, Pengfei,
>>>
>>> I created webrev.02 with all your suggestions implemented:
>>>
>>> webrev: http://cr.openjdk.java.net/~dpochepk/8218748/webrev.02/
>>>
>>> - comments are now both in separate section and inlined into code.
>>> - documentation mismatch mentioned by Pengfei is fixed:
>>> -- SHORT_LAST_INIT label name misprint changed to correct SHORT_LAST
>>> -- SHORT_LOOP_TAIL block now merged with last instruction.
>>> Documentation is updated respectively
>>> - minor other changes to layout and wording
>>>
>>> Newly developed tests were run as sanity and they passed.
>>>
>>> Thanks,
>>> Dmitrij
>>>
>>> On 22/02/2019 6:42 PM, Andrew Haley wrote:
>>>> On 2/22/19 10:31 AM, Pengfei Li (Arm Technology China) wrote:
>>>>
>>>>> So personally, I still prefer to inline the comments with the
>>>>> original code block to avoid this kind of inconsistencies. And it
>>>>> makes us easier to review or maintain the code together with the
>>>>> doc, as we don't need to scroll back and force. I don't know the
>>>>> benefit of making the code documentation as a separate part. What's
>>>>> your opinion, Andrew Haley?
>>>> I agree with you. There's no harm having both inline and separate.
>>>>

From ci_notify at linaro.org  Tue Nov 19 00:52:45 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Tue, 19 Nov 2019 00:52:45 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK JDK on AArch64
Message-ID: <1998756915.1196.1574124766267.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/summary/2019/322/summary.html
 
-------------------------------------------------------------------------------
client-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,780; fail: 19; not run: 90

-------------------------------------------------------------------------------
client-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,495; fail: 670; error: 23

-------------------------------------------------------------------------------
client-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/14 pass: 5,753
Build 1: aarch64/2019/oct/16 pass: 5,753; fail: 1
Build 2: aarch64/2019/oct/18 pass: 5,760
Build 3: aarch64/2019/oct/21 pass: 5,716; fail: 43; error: 1
Build 4: aarch64/2019/oct/23 pass: 5,760; fail: 1
Build 5: aarch64/2019/oct/28 pass: 5,766
Build 6: aarch64/2019/oct/30 pass: 5,768
Build 7: aarch64/2019/nov/01 pass: 5,768; fail: 1
Build 8: aarch64/2019/nov/04 pass: 5,769
Build 9: aarch64/2019/nov/06 pass: 5,766; fail: 2
Build 10: aarch64/2019/nov/08 pass: 5,761
Build 11: aarch64/2019/nov/11 pass: 5,762
Build 12: aarch64/2019/nov/13 pass: 5,764; fail: 1
Build 13: aarch64/2019/nov/15 pass: 5,750
Build 14: aarch64/2019/nov/18 pass: 5,750; fail: 1

1 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/14 pass: 8,706; fail: 497; error: 20
Build 1: aarch64/2019/oct/16 pass: 8,702; fail: 509; error: 17
Build 2: aarch64/2019/oct/18 pass: 8,694; fail: 522; error: 17
Build 3: aarch64/2019/oct/21 pass: 8,705; fail: 512; error: 18
Build 4: aarch64/2019/oct/23 pass: 8,712; fail: 505; error: 18
Build 5: aarch64/2019/oct/28 pass: 8,711; fail: 509; error: 18
Build 6: aarch64/2019/oct/30 pass: 8,723; fail: 504; error: 19
Build 7: aarch64/2019/nov/01 pass: 8,774; fail: 506; error: 18
Build 8: aarch64/2019/nov/04 pass: 8,777; fail: 509; error: 17
Build 9: aarch64/2019/nov/06 pass: 8,775; fail: 507; error: 19
Build 10: aarch64/2019/nov/08 pass: 8,774; fail: 510; error: 17
Build 11: aarch64/2019/nov/11 pass: 8,777; fail: 509; error: 15
Build 12: aarch64/2019/nov/13 pass: 8,773; fail: 509; error: 21
Build 13: aarch64/2019/nov/15 pass: 8,756; fail: 511; error: 19
Build 14: aarch64/2019/nov/18 pass: 8,765; fail: 504; error: 18

2 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/14 pass: 3,979
Build 1: aarch64/2019/oct/16 pass: 3,979
Build 2: aarch64/2019/oct/18 pass: 3,979
Build 3: aarch64/2019/oct/21 pass: 3,979
Build 4: aarch64/2019/oct/23 pass: 3,980
Build 5: aarch64/2019/oct/28 pass: 3,980
Build 6: aarch64/2019/oct/30 pass: 3,980
Build 7: aarch64/2019/nov/01 pass: 3,980
Build 8: aarch64/2019/nov/04 pass: 3,980
Build 9: aarch64/2019/nov/06 pass: 3,980
Build 10: aarch64/2019/nov/08 pass: 3,980
Build 11: aarch64/2019/nov/11 pass: 3,980
Build 12: aarch64/2019/nov/13 pass: 3,980
Build 13: aarch64/2019/nov/15 pass: 3,981
Build 14: aarch64/2019/nov/18 pass: 3,981

-------------------------------------------------------------------------------
server-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,787; fail: 18; not run: 90

-------------------------------------------------------------------------------
server-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,476; fail: 686; error: 27

-------------------------------------------------------------------------------
server-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 7.63x
Relative performance: Server critical-jOPS (nc): 9.74x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 207.57

Server 207.57 / Server 2014-04-01 (71.00): 2.92x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-10-12 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/284/results/
2019-10-15 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/287/results/
2019-10-17 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/289/results/
2019-10-19 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/291/results/
2019-10-22 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/294/results/
2019-10-23 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/296/results/
2019-10-29 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/301/results/
2019-10-31 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/303/results/
2019-11-02 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/305/results/
2019-11-05 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/308/results/
2019-11-07 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/310/results/
2019-11-12 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/315/results/
2019-11-14 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/317/results/
2019-11-16 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/319/results/
2019-11-19 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/322/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/

From Pengfei.Li at arm.com  Tue Nov 19 10:03:50 2019
From: Pengfei.Li at arm.com (Pengfei Li (Arm Technology China))
Date: Tue, 19 Nov 2019 10:03:50 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <638df51f-f9e0-5ceb-6509-7d396f85d8ec@redhat.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <DB7PR08MB3115D5D4659DC9A7886F2A44964D0@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <355b7629-2ef0-316e-8d75-0593a8a50dc8@redhat.com>
 <DB7PR08MB311548BD55E8D8F07D3E97DF964D0@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <638df51f-f9e0-5ceb-6509-7d396f85d8ec@redhat.com>
Message-ID: <DB7PR08MB3115C5D0762DA2EEB3B7DF58964C0@DB7PR08MB3115.eurprd08.prod.outlook.com>

Hi Andrew,

> Why do you think so? UseCompressedOops doesn't usually need r27.

If I understand correctly, your point is to allocate r27 as well for some scenarios when UseCompressedOops or UseCompressedClassPointers is on. This optimization is much more aggressive and I will try to do it carefully.

> We should have a flag which is set if the search for nicely-aligned 
> memory is successful, and then you can use that flag to determine if r27 is needed.

In which file do you think we should add the flag? Can we just check the value of CompressedKlassPointers::base() in reg_mask_init() ?

--
Thanks,
Pengfei

From matthias.baesken at sap.com  Wed Nov 20 08:26:17 2019
From: matthias.baesken at sap.com (Baesken, Matthias)
Date: Wed, 20 Nov 2019 08:26:17 +0000
Subject: [aarch64-port-dev ] runtime/memory/ReadFromNoaccessArea.java
	crashes on aarch64
Message-ID: <AM6PR02MB50785EEFAF6E963466727CA5934F0@AM6PR02MB5078.eurprd02.prod.outlook.com>

Hello, are you aware that the   jtreg hotspot test  runtime/memory/ReadFromNoaccessArea.java   crashes   on aarch64 for some days ?
We notice  the crash since 15. November .

The stderr output  is like this :

stdout: [#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (vtableStubs.cpp:197), pid=6213, tid=6217
#  guarantee(masm->pc() <= s->code_end()) failed: itable #2: overflowed buffer, estimated len: 176, actual len: 180, overrun: 4
#
# JRE version: OpenJDK Runtime Environment (14.0.0.1) (build 14.0.0.1-internal+0-adhoc.openjdk.jdk)
# Java VM: OpenJDK 64-Bit Server VM (14.0.0.1-internal+0-adhoc.openjdk.jdk, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-aarch64)
# Problematic frame:
# V  [libjvm.so+0xdcd534]  VtableStubs::bookkeeping(MacroAssembler*, outputStream*, VtableStub*, unsigned char*, unsigned char*, bool, int, int, int)+0x114
#
# CreateCoredumpOnCrash turned off, no core file dumped
#
# An error report file with more information is saved as:
# /mytestdir/jtreg_hotspot_tier1_work/JTwork/runtime/memory/ReadFromNoaccessArea/hs_err_pid6213.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
];
stderr: []
exitValue = 1

java.lang.RuntimeException: 'SIGSEGV' missing from stdout/stderr

         at jdk.test.lib.process.OutputAnalyzer.shouldContain(OutputAnalyzer.java:187)
         at ReadFromNoaccessArea.main(ReadFromNoaccessArea.java:74)
         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
         at java.base/java.lang.reflect.Method.invoke(Method.java:564)
         at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
         at java.base/java.lang.Thread.run(Thread.java:833)

JavaTest Message: Test threw exception: java.lang.RuntimeException: 'SIGSEGV' missing from stdout/stderr

JavaTest Message: shutting down test

STATUS:Failed.`main' threw exception: java.lang.RuntimeException: 'SIGSEGV' missing from stdout/stderr


Thread info from hs_err


---------------  T H R E A D  ---------------


Current thread (0x0000ffffa4028800):  JavaThread "main" [_thread_in_vm, id=6217, stack(0x0000ffffa9f3b000,0x0000ffffaa13b000)]


Stack: [0x0000ffffa9f3b000,0x0000ffffaa13b000],  sp=0x0000ffffaa137f40,  free space=2035k

Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)

V  [libjvm.so+0xdcd534]  VtableStubs::bookkeeping(MacroAssembler*, outputStream*, VtableStub*, unsigned char*, unsigned char*, bool, int, int, int)+0x114

V  [libjvm.so+0xdce1fc]  VtableStubs::create_itable_stub(int)+0x584

V  [libjvm.so+0xdccbd0]  VtableStubs::find_stub(bool, int)+0x1f0

V  [libjvm.so+0x4e165c]  CompiledIC::set_to_megamorphic(CallInfo*, Bytecodes::Code, bool&, Thread*)+0x74

V  [libjvm.so+0xb62358]  SharedRuntime::handle_ic_miss_helper_internal(Handle, CompiledMethod*, frame const&, methodHandle, Bytecodes::Code, CallInfo&, bool&, Thread*)+0x1e0

V  [libjvm.so+0xb63144]  SharedRuntime::handle_ic_miss_helper(JavaThread*, Thread*)+0x6c4

V  [libjvm.so+0xb63380]  SharedRuntime::handle_wrong_method_ic_miss(JavaThread*)+0x38

v  ~RuntimeStub::ic_miss_stub

j  jdk.internal.module.ModuleBootstrap.boot()Ljava/lang/ModuleLayer;+1323 java.base at 14.0.0.1-internal

j  java.lang.System.initPhase2(ZZ)I+0 java.base at 14.0.0.1-internal

v  ~StubRoutines::call_stub

V  [libjvm.so+0x712b60]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*)+0x368

V  [libjvm.so+0x711460]  JavaCalls::call_static(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0xf8

V  [libjvm.so+0xd5a2dc]  Threads::create_vm(JavaVMInitArgs*, bool*)+0x8f4

V  [libjvm.so+0x7a23b0]  JNI_CreateJavaVM+0x78

C  [libjli.so+0x48d0]  JavaMain+0x70

C  [libjli.so+0x885c]  ThreadJavaMain+0xc

C  [libpthread.so.0+0x7060]  start_thread+0xb0


Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)

v  ~RuntimeStub::ic_miss_stub

J 41 c1 jdk.internal.module.ModuleBootstrap$2.hasNext()Z java.base at 14.0.0.1-internal (30 bytes) @ 0x0000ffff8cbf12e0 [0x0000ffff8cbf1080+0x0000000000000260]

j  java.lang.Module.implAddOpensToAllUnnamed(Ljava/util/Iterator;)V+47 java.base at 14.0.0.1-internal

j  java.lang.System$2.addOpensToAllUnnamed(Ljava/lang/Module;Ljava/util/Iterator;)V+2 java.base at 14.0.0.1-internal

j  jdk.internal.module.ModuleBootstrap.addIllegalAccess(Ljava/lang/module/ModuleFinder;Ljava/util/Map;Ljava/util/Map;Ljava/lang/ModuleLayer;Z)V+573 java.base at 14.0.0.1-internal

j  jdk.internal.module.ModuleBootstrap.boot()Ljava/lang/ModuleLayer;+1323 java.base at 14.0.0.1-internal

j  java.lang.System.initPhase2(ZZ)I+0 java.base at 14.0.0.1-internal

v  ~StubRoutines::call_stub


Best regards, Matthias


From ci_notify at linaro.org  Thu Nov 21 03:05:25 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Thu, 21 Nov 2019 03:05:25 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK JDK on AArch64
Message-ID: <1620776588.1768.1574305526541.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/summary/2019/324/summary.html
 
-------------------------------------------------------------------------------
client-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,780; fail: 19; not run: 90

-------------------------------------------------------------------------------
client-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,495; fail: 670; error: 23

-------------------------------------------------------------------------------
client-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/16 pass: 5,753; fail: 1
Build 1: aarch64/2019/oct/18 pass: 5,760
Build 2: aarch64/2019/oct/21 pass: 5,716; fail: 43; error: 1
Build 3: aarch64/2019/oct/23 pass: 5,760; fail: 1
Build 4: aarch64/2019/oct/28 pass: 5,766
Build 5: aarch64/2019/oct/30 pass: 5,768
Build 6: aarch64/2019/nov/01 pass: 5,768; fail: 1
Build 7: aarch64/2019/nov/04 pass: 5,769
Build 8: aarch64/2019/nov/06 pass: 5,766; fail: 2
Build 9: aarch64/2019/nov/08 pass: 5,761
Build 10: aarch64/2019/nov/11 pass: 5,762
Build 11: aarch64/2019/nov/13 pass: 5,764; fail: 1
Build 12: aarch64/2019/nov/15 pass: 5,750
Build 13: aarch64/2019/nov/18 pass: 5,750; fail: 1
Build 14: aarch64/2019/nov/20 pass: 5,752

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/16 pass: 8,702; fail: 509; error: 17
Build 1: aarch64/2019/oct/18 pass: 8,694; fail: 522; error: 17
Build 2: aarch64/2019/oct/21 pass: 8,705; fail: 512; error: 18
Build 3: aarch64/2019/oct/23 pass: 8,712; fail: 505; error: 18
Build 4: aarch64/2019/oct/28 pass: 8,711; fail: 509; error: 18
Build 5: aarch64/2019/oct/30 pass: 8,723; fail: 504; error: 19
Build 6: aarch64/2019/nov/01 pass: 8,774; fail: 506; error: 18
Build 7: aarch64/2019/nov/04 pass: 8,777; fail: 509; error: 17
Build 8: aarch64/2019/nov/06 pass: 8,775; fail: 507; error: 19
Build 9: aarch64/2019/nov/08 pass: 8,774; fail: 510; error: 17
Build 10: aarch64/2019/nov/11 pass: 8,777; fail: 509; error: 15
Build 11: aarch64/2019/nov/13 pass: 8,773; fail: 509; error: 21
Build 12: aarch64/2019/nov/15 pass: 8,756; fail: 511; error: 19
Build 13: aarch64/2019/nov/18 pass: 8,765; fail: 504; error: 18
Build 14: aarch64/2019/nov/20 pass: 8,768; fail: 504; error: 19

2 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/16 pass: 3,979
Build 1: aarch64/2019/oct/18 pass: 3,979
Build 2: aarch64/2019/oct/21 pass: 3,979
Build 3: aarch64/2019/oct/23 pass: 3,980
Build 4: aarch64/2019/oct/28 pass: 3,980
Build 5: aarch64/2019/oct/30 pass: 3,980
Build 6: aarch64/2019/nov/01 pass: 3,980
Build 7: aarch64/2019/nov/04 pass: 3,980
Build 8: aarch64/2019/nov/06 pass: 3,980
Build 9: aarch64/2019/nov/08 pass: 3,980
Build 10: aarch64/2019/nov/11 pass: 3,980
Build 11: aarch64/2019/nov/13 pass: 3,980
Build 12: aarch64/2019/nov/15 pass: 3,981
Build 13: aarch64/2019/nov/18 pass: 3,981
Build 14: aarch64/2019/nov/20 pass: 3,981

-------------------------------------------------------------------------------
server-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,787; fail: 18; not run: 90

-------------------------------------------------------------------------------
server-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,476; fail: 686; error: 27

-------------------------------------------------------------------------------
server-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 8.14x
Relative performance: Server critical-jOPS (nc): 9.52x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 207.57

Server 207.57 / Server 2014-04-01 (71.00): 2.92x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-10-15 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/287/results/
2019-10-17 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/289/results/
2019-10-19 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/291/results/
2019-10-22 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/294/results/
2019-10-23 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/296/results/
2019-10-29 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/301/results/
2019-10-31 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/303/results/
2019-11-02 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/305/results/
2019-11-05 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/308/results/
2019-11-07 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/310/results/
2019-11-12 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/315/results/
2019-11-14 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/317/results/
2019-11-16 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/319/results/
2019-11-19 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/322/results/
2019-11-21 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/324/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/

From Xiaohong.Gong at arm.com  Thu Nov 21 08:28:16 2019
From: Xiaohong.Gong at arm.com (Xiaohong Gong (Arm Technology China))
Date: Thu, 21 Nov 2019 08:28:16 +0000
Subject: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
Message-ID: <VE1PR08MB5008E08B6935A17666327503F54E0@VE1PR08MB5008.eurprd08.prod.outlook.com>

Hi Felix,

I met the similar SIGILL on aarch64 platform as well. And here is the related JBS containing the resolution: https://bugs.openjdk.java.net/browse/JDK-8234321
Hope this could help you!

Thanks,
Xiaohong Gong

From aph at redhat.com  Thu Nov 21 10:03:58 2019
From: aph at redhat.com (Andrew Haley)
Date: Thu, 21 Nov 2019 10:03:58 +0000
Subject: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED6037478@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED6037478@dggeml527-mbx.china.huawei.com>
Message-ID: <9ce8d22e-875e-ca04-8ca0-ee7d269e4754@redhat.com>

On 11/15/19 8:33 AM, Yangfei (Felix) wrote:

>   I am witnessing some SIGILL jvm crashes on my aarch64 platform.
>   I looked at the ISB usage, especially this one: https://mail.openjdk.java.net/pipermail/aarch64-port-dev/2014-September/001376.html
>   One of changes is adding one ISB after the native call returns:

Yes, I did that.

> 1100 static void rt_call(MacroAssembler* masm, address dest, int gpargs, int fpargs, int type) {
> 1101   CodeBlob *cb = CodeCache::find_blob(dest);
> 1102   if (cb) {
> 1103     __ far_call(RuntimeAddress(dest));
> 1104   } else {
> 1105     assert((unsigned)gpargs < 256, "eek!");
> 1106     assert((unsigned)fpargs < 32, "eek!");
> 1107     __ lea(rscratch1, RuntimeAddress(dest));
> 1108     __ blr(rscratch1);
> 1109     __ maybe_isb();    < ========
> 1110   }
> 1111 }
> 
>
>   The rt_call function is used in generate_native_wrapper to make the
>   JNI call.
>   As I didn?t see the barrier for the ppc or arm port.  I would like
>   to know more details here.  Does anyone still remember?

What question are you asking? The ISB is there because the callout
might run concurrently with a safepoint, during which time the code
cache may be changed by some other thread. While we are in native code
safepoints can run in other threads without us knowing.

>   Also the ISB is planted only in the else block.  I assume this is
>   also necessary for the if block.  Correct?

No. The if block is for calls within the AArch64 Java runtime, so we
stay in Java, and there shouldn't be any ISB needed. Any part of the
Java runtime that loads or generates code does its own ISB.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From felix.yang at huawei.com  Thu Nov 21 11:47:05 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Thu, 21 Nov 2019 11:47:05 +0000
Subject: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
In-Reply-To: <9ce8d22e-875e-ca04-8ca0-ee7d269e4754@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED6037478@dggeml527-mbx.china.huawei.com>
 <9ce8d22e-875e-ca04-8ca0-ee7d269e4754@redhat.com>
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820ED6042867@dggeml527-mbx.china.huawei.com>

> -----Original Message-----
> From: Andrew Haley [mailto:aph at redhat.com]
> Sent: Thursday, November 21, 2019 6:04 PM
> To: Yangfei (Felix) <felix.yang at huawei.com>;
> aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
> 
> On 11/15/19 8:33 AM, Yangfei (Felix) wrote:
> >
> >   The rt_call function is used in generate_native_wrapper to make the
> >   JNI call.
> >   As I didn?t see the barrier for the ppc or arm port.  I would like
> >   to know more details here.  Does anyone still remember?
> 
> What question are you asking? The ISB is there because the callout might run
> concurrently with a safepoint, during which time the code cache may be
> changed by some other thread. While we are in native code safepoints can run
> in other threads without us knowing.

I didn't find this barrier for the ppc or arm port.  
My question: is this necessary to plant a instruction barrier in the same place for those ports?  
Please let me know if I missed anything here.  

> 
> >   Also the ISB is planted only in the else block.  I assume this is
> >   also necessary for the if block.  Correct?
> 
> No. The if block is for calls within the AArch64 Java runtime, so we stay in Java,
> and there shouldn't be any ISB needed. Any part of the Java runtime that loads
> or generates code does its own ISB.

I see.  Does that mean the isb in LIR_Assembler::rt_call should be planted in the else block? 
I mean: 

diff -r dc45ed0ab083 src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp
--- a/src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp       Wed Nov 13 15:16:45 2019 -0800
+++ b/src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp       Thu Nov 21 19:25:00 2019 +0800
@@ -2906,12 +2906,12 @@
   } else {
     __ mov(rscratch1, RuntimeAddress(dest));
     __ blr(rscratch1);
+    __ maybe_isb();
   }

   if (info != NULL) {
     add_call_info_here(info);
   }
-  __ maybe_isb();
 }


Thanks,
Felix

From aph at redhat.com  Thu Nov 21 14:04:07 2019
From: aph at redhat.com (Andrew Haley)
Date: Thu, 21 Nov 2019 14:04:07 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <DB7PR08MB3115C5D0762DA2EEB3B7DF58964C0@DB7PR08MB3115.eurprd08.prod.outlook.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <DB7PR08MB3115D5D4659DC9A7886F2A44964D0@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <355b7629-2ef0-316e-8d75-0593a8a50dc8@redhat.com>
 <DB7PR08MB311548BD55E8D8F07D3E97DF964D0@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <638df51f-f9e0-5ceb-6509-7d396f85d8ec@redhat.com>
 <DB7PR08MB3115C5D0762DA2EEB3B7DF58964C0@DB7PR08MB3115.eurprd08.prod.outlook.com>
Message-ID: <399b921f-2110-3ceb-24b9-e346e7733ab5@redhat.com>

On 11/19/19 10:03 AM, Pengfei Li (Arm Technology China) wrote:
>> We should have a flag which is set if the search for nicely-aligned 
>> memory is successful, and then you can use that flag to determine if r27 is needed.
> In which file do you think we should add the flag? Can we just check the value of CompressedKlassPointers::base() in reg_mask_init() ?

I would call from the #ifdef AARCH64 code that allocates the memory into
a static method Assembler::setCompressedBaseAndScale().

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Thu Nov 21 14:09:52 2019
From: aph at redhat.com (Andrew Haley)
Date: Thu, 21 Nov 2019 14:09:52 +0000
Subject: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED6042867@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED6037478@dggeml527-mbx.china.huawei.com>
 <9ce8d22e-875e-ca04-8ca0-ee7d269e4754@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6042867@dggeml527-mbx.china.huawei.com>
Message-ID: <931032a7-5548-f847-e271-df5e4acf81ed@redhat.com>

On 11/21/19 11:47 AM, Yangfei (Felix) wrote:
>> -----Original Message-----
>> From: Andrew Haley [mailto:aph at redhat.com]
>> Sent: Thursday, November 21, 2019 6:04 PM
>> To: Yangfei (Felix) <felix.yang at huawei.com>;
>> aarch64-port-dev at openjdk.java.net
>> Subject: Re: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
>>
>> On 11/15/19 8:33 AM, Yangfei (Felix) wrote:
>>>
>>>   The rt_call function is used in generate_native_wrapper to make the
>>>   JNI call.
>>>   As I didn?t see the barrier for the ppc or arm port.  I would like
>>>   to know more details here.  Does anyone still remember?
>>
>> What question are you asking? The ISB is there because the callout might run
>> concurrently with a safepoint, during which time the code cache may be
>> changed by some other thread. While we are in native code safepoints can run
>> in other threads without us knowing.
> 
> I didn't find this barrier for the ppc or arm port.  
> My question: is this necessary to plant a instruction barrier in the same place for those ports?  

Probably. There was recently some code discussed to do pipeline
flushing for for x86 and others. I can't find it right now...

>>>   Also the ISB is planted only in the else block.  I assume this is
>>>   also necessary for the if block.  Correct?
>>
>> No. The if block is for calls within the AArch64 Java runtime, so
>> we stay in Java, and there shouldn't be any ISB needed. Any part of
>> the Java runtime that loads or generates code does its own ISB.
> 
> I see.  Does that mean the isb in LIR_Assembler::rt_call should be planted in the else block? 
> I mean: 
> 
> diff -r dc45ed0ab083 src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp
> --- a/src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp       Wed Nov 13 15:16:45 2019 -0800
> +++ b/src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp       Thu Nov 21 19:25:00 2019 +0800
> @@ -2906,12 +2906,12 @@
>    } else {
>      __ mov(rscratch1, RuntimeAddress(dest));
>      __ blr(rscratch1);
> +    __ maybe_isb();
>    }
> 
>    if (info != NULL) {
>      add_call_info_here(info);
>    }
> -  __ maybe_isb();
>  }

Quite possibly, but I wouldn't touch it without a very good reason and
lots of analysis.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Thu Nov 21 15:33:04 2019
From: aph at redhat.com (Andrew Haley)
Date: Thu, 21 Nov 2019 15:33:04 +0000
Subject: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
In-Reply-To: <931032a7-5548-f847-e271-df5e4acf81ed@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED6037478@dggeml527-mbx.china.huawei.com>
 <9ce8d22e-875e-ca04-8ca0-ee7d269e4754@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820ED6042867@dggeml527-mbx.china.huawei.com>
 <931032a7-5548-f847-e271-df5e4acf81ed@redhat.com>
Message-ID: <3c756be2-4b0e-e4a4-e356-9fa673e97f81@redhat.com>

On 11/21/19 2:09 PM, Andrew Haley wrote:
> Quite possibly, but I wouldn't touch it without a very good reason and
> lots of analysis.

And I have to confess that there are probably unnecessary ISBs. It's
what we in England call "belt and braces", or more politely "defence
in depth".

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From ci_notify at linaro.org  Fri Nov 22 02:31:03 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Fri, 22 Nov 2019 02:31:03 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK 11u on AArch64
Message-ID: <1268109911.1912.1574389864399.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk11u/openjdk-jtreg-nightly-tests/summary/2019/325/summary.html
 
-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/aug/03 pass: 5,746; fail: 4
Build 1: aarch64/2019/aug/10 pass: 5,747; fail: 4
Build 2: aarch64/2019/aug/15 pass: 5,753; fail: 4
Build 3: aarch64/2019/aug/22 pass: 5,755; fail: 4
Build 4: aarch64/2019/sep/04 pass: 5,764; fail: 2
Build 5: aarch64/2019/sep/05 pass: 5,764; fail: 2
Build 6: aarch64/2019/sep/10 pass: 5,764; fail: 2
Build 7: aarch64/2019/sep/17 pass: 5,763; fail: 3
Build 8: aarch64/2019/sep/21 pass: 5,764; fail: 2
Build 9: aarch64/2019/oct/04 pass: 5,764; fail: 2
Build 10: aarch64/2019/oct/17 pass: 5,764; fail: 2
Build 11: aarch64/2019/oct/31 pass: 5,784; fail: 1
Build 12: aarch64/2019/nov/09 pass: 5,773; fail: 3
Build 13: aarch64/2019/nov/16 pass: 5,775; fail: 1
Build 14: aarch64/2019/nov/21 pass: 5,775; fail: 1

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/aug/03 pass: 8,429; fail: 509; error: 18
Build 1: aarch64/2019/aug/10 pass: 8,450; fail: 485; error: 16
Build 2: aarch64/2019/aug/15 pass: 8,443; fail: 496; error: 13
Build 3: aarch64/2019/aug/22 pass: 8,446; fail: 494; error: 15
Build 4: aarch64/2019/sep/04 pass: 8,483; fail: 465; error: 10
Build 5: aarch64/2019/sep/05 pass: 8,465; fail: 479; error: 14
Build 6: aarch64/2019/sep/10 pass: 8,444; fail: 500; error: 14
Build 7: aarch64/2019/sep/17 pass: 8,462; fail: 482; error: 12
Build 8: aarch64/2019/sep/21 pass: 8,467; fail: 478; error: 13
Build 9: aarch64/2019/oct/04 pass: 8,444; fail: 498; error: 16
Build 10: aarch64/2019/oct/17 pass: 8,452; fail: 493; error: 16
Build 11: aarch64/2019/oct/31 pass: 8,468; fail: 490; error: 14
Build 12: aarch64/2019/nov/09 pass: 8,487; fail: 470; error: 16
Build 13: aarch64/2019/nov/16 pass: 8,475; fail: 484; error: 15
Build 14: aarch64/2019/nov/21 pass: 8,489; fail: 497; error: 13

4 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/aug/03 pass: 3,908
Build 1: aarch64/2019/aug/10 pass: 3,909
Build 2: aarch64/2019/aug/15 pass: 3,909
Build 3: aarch64/2019/aug/22 pass: 3,909
Build 4: aarch64/2019/sep/04 pass: 3,910
Build 5: aarch64/2019/sep/05 pass: 3,910
Build 6: aarch64/2019/sep/10 pass: 3,910
Build 7: aarch64/2019/sep/17 pass: 3,910
Build 8: aarch64/2019/sep/21 pass: 3,910
Build 9: aarch64/2019/oct/04 pass: 3,910
Build 10: aarch64/2019/oct/17 pass: 3,910
Build 11: aarch64/2019/oct/31 pass: 3,910
Build 12: aarch64/2019/nov/09 pass: 3,910
Build 13: aarch64/2019/nov/16 pass: 3,910
Build 14: aarch64/2019/nov/21 pass: 3,910

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdk11u/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 7.38x
Relative performance: Server critical-jOPS (nc): 8.20x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk11u/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 210.67

Server 210.67 / Server 2014-04-01 (71.00): 2.97x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdk11u/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-08-04 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/215/results/
2019-08-11 pass rate: 10488/10488, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/222/results/
2019-08-16 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/227/results/
2019-08-23 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/234/results/
2019-09-05 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/247/results/
2019-09-07 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/248/results/
2019-09-11 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/253/results/
2019-09-18 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/260/results/
2019-09-22 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/264/results/
2019-10-05 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/277/results/
2019-10-18 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/290/results/
2019-11-01 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/304/results/
2019-11-10 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/313/results/
2019-11-17 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/320/results/
2019-11-22 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/2019/325/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdk11u/jcstress-nightly-runs/

From Pengfei.Li at arm.com  Fri Nov 22 08:45:47 2019
From: Pengfei.Li at arm.com (Pengfei Li (Arm Technology China))
Date: Fri, 22 Nov 2019 08:45:47 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <399b921f-2110-3ceb-24b9-e346e7733ab5@redhat.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <DB7PR08MB3115D5D4659DC9A7886F2A44964D0@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <355b7629-2ef0-316e-8d75-0593a8a50dc8@redhat.com>
 <DB7PR08MB311548BD55E8D8F07D3E97DF964D0@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <638df51f-f9e0-5ceb-6509-7d396f85d8ec@redhat.com>
 <DB7PR08MB3115C5D0762DA2EEB3B7DF58964C0@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <399b921f-2110-3ceb-24b9-e346e7733ab5@redhat.com>
Message-ID: <DB7PR08MB3115AC735AAABBEEA51B439496490@DB7PR08MB3115.eurprd08.prod.outlook.com>

Hi Andrew,

> > In which file do you think we should add the flag? Can we just check the
> value of CompressedKlassPointers::base() in reg_mask_init() ?
> 
> I would call from the #ifdef AARCH64 code that allocates the memory into a
> static method Assembler::setCompressedBaseAndScale().

Thanks for your suggestion. I have ever tried to set a flag from the metaspace reservation code but now I'm switching back to my another approach. Below is my justification.

The #ifdef code block which allocates metaspace is actually used by both AARCH64 and AIX. Of course, we can add AArch64-specific logic inside with AARCH64_ONLY(), but it doesn't cover all scenarios that r27 isn't used. In klass pointers encoding and decoding, we have a special path called use_XOR_for_compressed_class_base where the metaspace may be not nicely fit but r27 isn't used. [1]

Regarding your suggestion of setting compressed base and shift values into AArch64 assembler, it can solve the problem of covering the use_XOR_for_compressed_class_base path. But we have to do it in Metaspace::set_narrow_klass_base_and_shift() where the base and shift are finally determined and introduce new code block of "#ifdef AARCH64 #endif" in HotSpot shared code.

In my approach, I added a method in aarch64.ad to check the base and shift in reg_mask_init(), and moved the logic of use_XOR_for_compressed_class_base here from the MacroAssembler constructor. I know my implementation has a drawback that the logic of my new method may be mis-aligned with the encoding/decoding logic if someone changes the MacroAssembler code without noticing my code. So I also added a few lines of comments to avoid this happening. See my updated webrev below.

http://cr.openjdk.java.net/~pli/rfr/8233743/webrev.01/

Please let me know if you have any further suggestions or disagreements.

[1] http://hg.openjdk.java.net/jdk/jdk/file/fcd74557a9cc/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#l3918

--
Thanks,
Pengfei


From ci_notify at linaro.org  Sat Nov 23 01:36:11 2019
From: ci_notify at linaro.org (ci_notify at linaro.org)
Date: Sat, 23 Nov 2019 01:36:11 +0000 (UTC)
Subject: [aarch64-port-dev ] JTREG, JCStress,
 SPECjbb2015 and Hadoop/Terasort results for OpenJDK JDK on AArch64
Message-ID: <1328467443.2022.1574472971999.JavaMail.javamailuser@localhost>

This is a summary of the JTREG test results
===========================================
 
The build and test results are cycled every 15 days.
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/summary/2019/326/summary.html
 
-------------------------------------------------------------------------------
client-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,780; fail: 19; not run: 90

-------------------------------------------------------------------------------
client-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,495; fail: 670; error: 23

-------------------------------------------------------------------------------
client-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

-------------------------------------------------------------------------------
release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/18 pass: 5,760
Build 1: aarch64/2019/oct/21 pass: 5,716; fail: 43; error: 1
Build 2: aarch64/2019/oct/23 pass: 5,760; fail: 1
Build 3: aarch64/2019/oct/28 pass: 5,766
Build 4: aarch64/2019/oct/30 pass: 5,768
Build 5: aarch64/2019/nov/01 pass: 5,768; fail: 1
Build 6: aarch64/2019/nov/04 pass: 5,769
Build 7: aarch64/2019/nov/06 pass: 5,766; fail: 2
Build 8: aarch64/2019/nov/08 pass: 5,761
Build 9: aarch64/2019/nov/11 pass: 5,762
Build 10: aarch64/2019/nov/13 pass: 5,764; fail: 1
Build 11: aarch64/2019/nov/15 pass: 5,750
Build 12: aarch64/2019/nov/18 pass: 5,750; fail: 1
Build 13: aarch64/2019/nov/20 pass: 5,752
Build 14: aarch64/2019/nov/22 pass: 5,752; fail: 1

1 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/18 pass: 8,694; fail: 522; error: 17
Build 1: aarch64/2019/oct/21 pass: 8,705; fail: 512; error: 18
Build 2: aarch64/2019/oct/23 pass: 8,712; fail: 505; error: 18
Build 3: aarch64/2019/oct/28 pass: 8,711; fail: 509; error: 18
Build 4: aarch64/2019/oct/30 pass: 8,723; fail: 504; error: 19
Build 5: aarch64/2019/nov/01 pass: 8,774; fail: 506; error: 18
Build 6: aarch64/2019/nov/04 pass: 8,777; fail: 509; error: 17
Build 7: aarch64/2019/nov/06 pass: 8,775; fail: 507; error: 19
Build 8: aarch64/2019/nov/08 pass: 8,774; fail: 510; error: 17
Build 9: aarch64/2019/nov/11 pass: 8,777; fail: 509; error: 15
Build 10: aarch64/2019/nov/13 pass: 8,773; fail: 509; error: 21
Build 11: aarch64/2019/nov/15 pass: 8,756; fail: 511; error: 19
Build 12: aarch64/2019/nov/18 pass: 8,765; fail: 504; error: 18
Build 13: aarch64/2019/nov/20 pass: 8,768; fail: 504; error: 19
Build 14: aarch64/2019/nov/22 pass: 8,769; fail: 507; error: 18

3 fatal errors were detected; please follow the link above for more detail.

-------------------------------------------------------------------------------
release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2019/oct/18 pass: 3,979
Build 1: aarch64/2019/oct/21 pass: 3,979
Build 2: aarch64/2019/oct/23 pass: 3,980
Build 3: aarch64/2019/oct/28 pass: 3,980
Build 4: aarch64/2019/oct/30 pass: 3,980
Build 5: aarch64/2019/nov/01 pass: 3,980
Build 6: aarch64/2019/nov/04 pass: 3,980
Build 7: aarch64/2019/nov/06 pass: 3,980
Build 8: aarch64/2019/nov/08 pass: 3,980
Build 9: aarch64/2019/nov/11 pass: 3,980
Build 10: aarch64/2019/nov/13 pass: 3,980
Build 11: aarch64/2019/nov/15 pass: 3,981
Build 12: aarch64/2019/nov/18 pass: 3,981
Build 13: aarch64/2019/nov/20 pass: 3,981
Build 14: aarch64/2019/nov/22 pass: 3,981

-------------------------------------------------------------------------------
server-release/hotspot
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 5,787; fail: 18; not run: 90

-------------------------------------------------------------------------------
server-release/jdk
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 8,476; fail: 686; error: 27

-------------------------------------------------------------------------------
server-release/langtools
-------------------------------------------------------------------------------
Build 0: aarch64/2018/oct/15 pass: 3,970; fail: 5

Previous results can be found here: 
 
  http://openjdk.linaro.org/jdkX/openjdk-jtreg-nightly-tests/index.html
 

SPECjbb2015 composite regression test completed
===============================================

This test measures the relative performance of the server
compiler running the SPECjbb2015 composite tests and compares
the performance against the baseline performance of the server
compiler taken on 2016-11-21.

In accordance with [1], the SPECjbb2015 tests are run on a system
which is not production ready and does not meet all the
requirements for publishing compliant results. The numbers below
shall be treated as non-compliant (nc) and are for experimental
purposes only.

Relative performance: Server max-jOPS (nc): 8.24x
Relative performance: Server critical-jOPS (nc): 9.93x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/SPECjbb2015-results/

[1] http://www.spec.org/fairuse.html#Academic

Regression test Hadoop-Terasort completed
=========================================

This test measures the performance of the server and client compilers
running Hadoop sorting a 1GB file using Terasort and compares
the performance against the baseline performance of the Zero interpreter
and against the baseline performance of the server compiler
on 2014-04-01.

Relative performance: Zero: 1.0, Server: 213.86

Server 213.86 / Server 2014-04-01 (71.00): 3.01x

Details of the test setup and historical results may be found here:

    http://openjdk.linaro.org/jdkX/hadoop-terasort-benchmark-results/

This is a summary of the jcstress test results
==============================================
 
The build and test results are cycled every 15 days.
 
2019-10-17 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/289/results/
2019-10-19 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/291/results/
2019-10-22 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/294/results/
2019-10-23 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/296/results/
2019-10-29 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/301/results/
2019-10-31 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/303/results/
2019-11-02 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/305/results/
2019-11-05 pass rate: 10489/10489, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/308/results/
2019-11-07 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/310/results/
2019-11-12 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/315/results/
2019-11-14 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/317/results/
2019-11-16 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/319/results/
2019-11-19 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/322/results/
2019-11-21 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/324/results/
2019-11-23 pass rate: 10490/10490, results: http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/2019/326/results/
 
For detailed information on the test output please refer to: 
 
  http://openjdk.linaro.org/jdkX/jcstress-nightly-runs/

From felix.yang at huawei.com  Mon Nov 25 11:33:18 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Mon, 25 Nov 2019 11:33:18 +0000
Subject: [aarch64-port-dev ] RFR(XS): 8233466: aarch64: remove
 unnecessary load of mdo when profiling return and parameters type
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820EDAA21F3E@dggeml527-mbx.china.huawei.com>

Ping?   Any comments?

Thanks,
Felix

From: Yangfei (Felix)
Sent: Thursday, November 7, 2019 9:17 AM
To: hotspot-runtime-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net
Subject: RFR(XS): 8233466: aarch64: remove unnecessary load of mdo when profiling return and parameters type

Hi,

   Please review the following patch:

      Bug: https://bugs.openjdk.java.net/browse/JDK-8233466

Webrev: http://cr.openjdk.java.net/~fyang/8233466/webrev.00/


When profiling return and parameters type from the interpreter on aarch64 platform, 'mdp' is loaded by test_method_data_pointer which is called by profile_return_type & profile_parameters_type.

It's not necessary to load mdo before calling __ profile_return_type or __ profile_parameters_type.


Passed tier1-3 testing.


Thanks,

Felix

From nick.gasson at arm.com  Tue Nov 26 09:02:46 2019
From: nick.gasson at arm.com (Nick Gasson)
Date: Tue, 26 Nov 2019 17:02:46 +0800
Subject: [aarch64-port-dev ] runtime/memory/ReadFromNoaccessArea.java
 crashes on aarch64
In-Reply-To: <AM6PR02MB50785EEFAF6E963466727CA5934F0@AM6PR02MB5078.eurprd02.prod.outlook.com>
References: <AM6PR02MB50785EEFAF6E963466727CA5934F0@AM6PR02MB5078.eurprd02.prod.outlook.com>
Message-ID: <19d990fc-7b62-bba2-da95-9abedfb13d37@arm.com>

Hi Matthias,

> Hello, are you aware that the   jtreg hotspot test  runtime/memory/ReadFromNoaccessArea.java   crashes   on aarch64 for some days ?
> We notice  the crash since 15. November .

Thanks, I've made a Jira ticket to track this:

https://bugs.openjdk.java.net/browse/JDK-8234794

It's been failing since the fix for JDK-8231610.


Nick

From nick.gasson at arm.com  Tue Nov 26 09:25:03 2019
From: nick.gasson at arm.com (Nick Gasson)
Date: Tue, 26 Nov 2019 17:25:03 +0800
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
Message-ID: <cb73db58-b00b-2e12-d64f-ae7671aa7cbe@arm.com>

Hi Andrew,

> 
> I see
> 
>    if (use_XOR_for_compressed_class_base) {
>      if (CompressedKlassPointers::shift() != 0) {
>        eor(dst, src, (uint64_t)CompressedKlassPointers::base());
>        lsr(dst, dst, LogKlassAlignmentInBytes);
>      } else {
>        eor(dst, src, (uint64_t)CompressedKlassPointers::base());
>      }
>      return;
>    }
> 
>    if (((uint64_t)CompressedKlassPointers::base() & 0xffffffff) == 0
>        && CompressedKlassPointers::shift() == 0) {
>      movw(dst, src);
>      return;
>    }
> 
>    ... followed by code which does use r27.
> 
> Do you ever see r27 being used? If so, I'd be interested to know how
> this gets triggered and what command-line arguments you use. It's
> rather inefficient.
> 

Oddly enough the test case runtime/memory/ReadFromNoaccessArea.java now 
hits this. I see:

CompressedKlassPointers::base() => 0xffff0b4b5000
CompressedKlassPointers::shift() => 3

The itable stub calls MacroAssembler::load_klass() twice which then 
calls the above decode_klass_not_null() with dst==src if 
UseCompressedClassPointers is true. So we do the saving/restoring 
rheapbase dance twice which blows up the size of the itable stub beyond 
the estimated 152B max size.

The key is that this test passes -XX:HeapBaseMinAddress=33G. That in 
conjunction with the recent changes to where the CDS archive is loaded 
hits this code path (I don't see this with -Xshare:off).


Thanks,
Nick

From nick.gasson at arm.com  Tue Nov 26 10:34:32 2019
From: nick.gasson at arm.com (Nick Gasson)
Date: Tue, 26 Nov 2019 18:34:32 +0800
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <cb73db58-b00b-2e12-d64f-ae7671aa7cbe@arm.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <cb73db58-b00b-2e12-d64f-ae7671aa7cbe@arm.com>
Message-ID: <34081dab-0049-dda7-dead-3b2934f8c41b@arm.com>

(Not related to the original RFR.)

> 
> The itable stub calls MacroAssembler::load_klass() twice which then
> calls the above decode_klass_not_null() with dst==src if
> UseCompressedClassPointers is true. So we do the saving/restoring
> rheapbase dance twice which blows up the size of the itable stub beyond
> the estimated 152B max size.
> 

Actually I don't think we need to call load_klass twice on AArch64? The
compiled code doesn't use callee save registers so we should have plenty
spare to use as temporaries. I.e. could we do the following?

--- a/src/hotspot/cpu/aarch64/vtableStubs_aarch64.cpp
+++ b/src/hotspot/cpu/aarch64/vtableStubs_aarch64.cpp
@@ -175,6 +175,7 @@ VtableStub* VtableStubs::create_itable_stub(int itable_index) {
   const Register holder_klass_reg   = r16; // declaring interface klass (DECC)
   const Register resolved_klass_reg = rmethod; // resolved interface klass (REFC)
   const Register temp_reg           = r11;
+  const Register temp_reg2          = r15;
   const Register icholder_reg       = rscratch2;

   Label L_no_such_interface;
@@ -193,7 +194,7 @@ VtableStub* VtableStubs::create_itable_stub(int itable_index) {
   __ lookup_interface_method(// inputs: rec. class, interface
                              recv_klass_reg, resolved_klass_reg, noreg,
                              // outputs:  scan temp. reg1, scan temp. reg2
-                             recv_klass_reg, temp_reg,
+                             temp_reg2, temp_reg,
                              L_no_such_interface,
                              /*return_method=*/false);

@@ -201,7 +202,7 @@ VtableStub* VtableStubs::create_itable_stub(int itable_index) {
   start_pc = __ pc();

   // Get selected method from declaring class and itable index
-  __ load_klass(recv_klass_reg, j_rarg0);   // restore recv_klass_reg
+  //__ load_klass(recv_klass_reg, j_rarg0);   // restore recv_klass_reg
   __ lookup_interface_method(// inputs: rec. class, interface, itable index
                              recv_klass_reg, holder_klass_reg, itable_index,
                              // outputs: method, scan temp. reg


Thanks,
Nick

From gnu.andrew at redhat.com  Wed Nov 27 05:31:21 2019
From: gnu.andrew at redhat.com (Andrew John Hughes)
Date: Wed, 27 Nov 2019 05:31:21 +0000
Subject: [aarch64-port-dev ] [RFR] [8u] 8u242-b01 Upstream Sync
Message-ID: <646b15f6-a7bc-b5c5-a502-83fb3df9f54d@redhat.com>

Webrevs: https://cr.openjdk.java.net/~andrew/shenandoah-8/u242-b01/

Merge changesets:

http://cr.openjdk.java.net/~andrew/shenandoah-8/u242-b01/corba/merge.changeset
http://cr.openjdk.java.net/~andrew/shenandoah-8/u242-b01/jaxp/merge.changeset
http://cr.openjdk.java.net/~andrew/shenandoah-8/u242-b01/jaxws/merge.changeset
http://cr.openjdk.java.net/~andrew/shenandoah-8/u242-b01/jdk/merge.changeset
http://cr.openjdk.java.net/~andrew/shenandoah-8/u242-b01/hotspot/merge.changeset
http://cr.openjdk.java.net/~andrew/shenandoah-8/u242-b01/langtools/merge.changeset
http://cr.openjdk.java.net/~andrew/shenandoah-8/u242-b01/nashorn/merge.changeset
http://cr.openjdk.java.net/~andrew/shenandoah-8/u242-b01/root/merge.changeset

Changes in aarch64-shenandoah-jdk8u242-b01:
  - S8010500: [parfait] Possible null pointer dereference at
hotspot/src/share/vm/opto/loopnode.hpp
  - S8067429: java.lang.VerifyError: Inconsistent stackmap frames at
branch target
  - S8073154: NULL-pointer dereferencing in LIR_OpProfileType::print_instr
  - S8077707: jdk9 b58 cannot run any graphical application on Win 8
with JAWS running
  - S8132249: Clean up JAB debugging code
  - S8133951: Zero interpreter asserts in stubRoutines.cpp
  - S8134739: compiler/loopopts/superword/TestVectorizationWithInvariant
crashes in loop opts
  - S8209835: Aarch64: elide barriers on all volatile operations
  - S8212071: Need to set the FreeType LCD Filter to reduce fringing.
  - S8230238: Add another regression test for JDK-8134739
  - S8230813: Add JDK-8010500 to
compiler/loopopts/superword/TestFuzzPreLoop.java bug list
  - S8231398: Add time tracing for gc log rotation at safepoint cleanup
  - S8231988: Unexpected test result caused by C2
IdealLoopTree::do_remove_empty_loop

Main issues of note:
* 8209835 is already upstream but is part of this tag.
* 8073154 change to src/share/vm/c1/c1_LIR.cpp was already included in
an earlier form as part of "Implement type profiling in C1." [0]. Merge
conflict was resolve to use the 8u upstream version.

diffstat for root
 b/.hgtags |    3 +++
 1 file changed, 3 insertions(+)

diffstat for corba
 b/.hgtags |    3 +++
 1 file changed, 3 insertions(+)

diffstat for jaxp
 b/.hgtags |    3 +++
 1 file changed, 3 insertions(+)

diffstat for jaxws
 b/.hgtags |    3 +++
 1 file changed, 3 insertions(+)

diffstat for langtools
 b/.hgtags                                            |    3
 b/src/share/classes/com/sun/tools/javac/jvm/Gen.java |   19 ++-
 b/test/tools/javac/BranchToFewerDefines.java         |  111
+++++++++++++++++++
 3 files changed, 128 insertions(+), 5 deletions(-)

diffstat for nashorn
 b/.hgtags |    3 +++
 1 file changed, 3 insertions(+)

diffstat for jdk
 b/.hgtags                                                          |    3
 b/src/share/native/sun/font/freetypeScaler.c                       |    3
 b/src/windows/native/sun/bridge/AccessBridgeATInstance.cpp         |    2
 b/src/windows/native/sun/bridge/AccessBridgeJavaEntryPoints.cpp    |    2
 b/src/windows/native/sun/bridge/AccessBridgeJavaVMInstance.cpp     |    2
 b/src/windows/native/sun/bridge/AccessBridgeWindowsEntryPoints.cpp |    1
 b/src/windows/native/sun/bridge/JavaAccessBridge.cpp               |
51 ++--------
 b/src/windows/native/sun/bridge/JavaAccessBridge.h                 |    2
 b/src/windows/native/sun/bridge/WinAccessBridge.cpp                |    4
 9 files changed, 26 insertions(+), 44 deletions(-)

diffstat for hotspot
 b/.hgtags                                                |    3
 b/src/share/vm/c1/c1_LIR.cpp                             |    8 -
 b/src/share/vm/opto/loopTransform.cpp                    |    9 +
 b/src/share/vm/opto/loopnode.hpp                         |    1
 b/src/share/vm/opto/superword.cpp                        |   26 +++-
 b/src/share/vm/runtime/safepoint.cpp                     |    1
 b/src/share/vm/runtime/stubRoutines.cpp                  |    4
 b/test/compiler/loopopts/TestRemoveEmptyLoop.java        |   53 +++++++++
 b/test/compiler/loopopts/superword/TestFuzzPreLoop.java  |   65 +++++++++++
 b/test/compiler/print/TestProfileReturnTypePrinting.java |   68 +++++++++++
 b/test/runtime/RedefineTests/test8178870.sh              |   87
+++++++++++++++
 11 files changed, 312 insertions(+), 13 deletions(-)

Successfully built on x86, x86_64, s390, s390x, ppc, ppc64, ppc64le &
aarch64.

Ok to push?

[0]
https://hg.openjdk.java.net/aarch64-port/jdk8u-shenandoah/hotspot/rev/050fe4f6976ab67316

Thanks,
-- 
Andrew :)

Senior Free Java Software Engineer
Red Hat, Inc. (http://www.redhat.com)

PGP Key: ed25519/0xCFDA0F9B35964222 (hkp://keys.gnupg.net)
Fingerprint = 5132 579D D154 0ED2 3E04  C5A0 CFDA 0F9B 3596 4222
https://keybase.io/gnu_andrew


From aph at redhat.com  Wed Nov 27 10:54:36 2019
From: aph at redhat.com (Andrew Haley)
Date: Wed, 27 Nov 2019 10:54:36 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <cb73db58-b00b-2e12-d64f-ae7671aa7cbe@arm.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <cb73db58-b00b-2e12-d64f-ae7671aa7cbe@arm.com>
Message-ID: <6a7e7eb1-f3a5-d361-45ab-a4b7004683ec@redhat.com>

On 11/26/19 9:25 AM, Nick Gasson wrote:
> Oddly enough the test case runtime/memory/ReadFromNoaccessArea.java now 
> hits this. I see:
> 
> CompressedKlassPointers::base() => 0xffff0b4b5000
> CompressedKlassPointers::shift() => 3

This is bad. Can you have a look at the allocation code to see why the search
for an appropriate address range fails?

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Wed Nov 27 10:56:59 2019
From: aph at redhat.com (Andrew Haley)
Date: Wed, 27 Nov 2019 10:56:59 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <34081dab-0049-dda7-dead-3b2934f8c41b@arm.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <cb73db58-b00b-2e12-d64f-ae7671aa7cbe@arm.com>
 <34081dab-0049-dda7-dead-3b2934f8c41b@arm.com>
Message-ID: <97c0b309-e911-ff9f-4cea-b6f00bd4962d@redhat.com>

On 11/26/19 10:34 AM, Nick Gasson wrote:
> Actually I don't think we need to call load_klass twice on AArch64? The
> compiled code doesn't use callee save registers so we should have plenty
> spare to use as temporaries. I.e. could we do the following?

I guess. It'd be nicer to fix CDS on AArch64 so that it doesn't cause
performance regressions.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From boris.ulasevich at bell-sw.com  Wed Nov 27 12:55:18 2019
From: boris.ulasevich at bell-sw.com (Boris Ulasevich)
Date: Wed, 27 Nov 2019 15:55:18 +0300
Subject: [aarch64-port-dev ] RFR(S) 8234891: AArch64: Fix build failure
	after JDK-8234387
Message-ID: <847197ac-5969-900c-840a-61c475b19d84@bell-sw.com>

Hi,

Please review the fix in aarch64.ad to address the build issue "Ideal 
node missing: CmpOp" raised after recent change in C2. The intuitive 
operand name case correction CmpOp->cmpOp fixes the build, but leads to 
unworkable jvm. Removing the match rule works good and jdk/hotspot tests 
are Ok.

http://bugs.openjdk.java.net/browse/JDK-8234891
http://cr.openjdk.java.net/~bulasevich/8234891/webrev.00

ARM32 build fails too. I will fix the problem in arm32.ad file separately.

thanks,
Boris

From vladimir.x.ivanov at oracle.com  Wed Nov 27 13:23:57 2019
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Wed, 27 Nov 2019 16:23:57 +0300
Subject: [aarch64-port-dev ] RFR(S) 8234891: AArch64: Fix build failure
	after JDK-8234387
In-Reply-To: <847197ac-5969-900c-840a-61c475b19d84@bell-sw.com>
References: <847197ac-5969-900c-840a-61c475b19d84@bell-sw.com>
Message-ID: <abea05bf-b100-83e5-0b33-cc44c16fe42c@oracle.com>

The fix looks good and trivial.

Best regards,
Vladimir Ivanov

On 27.11.2019 15:55, Boris Ulasevich wrote:
> Hi,
> 
> Please review the fix in aarch64.ad to address the build issue "Ideal 
> node missing: CmpOp" raised after recent change in C2. The intuitive 
> operand name case correction CmpOp->cmpOp fixes the build, but leads to 
> unworkable jvm. Removing the match rule works good and jdk/hotspot tests 
> are Ok.
> 
> http://bugs.openjdk.java.net/browse/JDK-8234891
> http://cr.openjdk.java.net/~bulasevich/8234891/webrev.00
> 
> ARM32 build fails too. I will fix the problem in arm32.ad file separately.
> 
> thanks,
> Boris

From stuart.monteith at linaro.org  Wed Nov 27 16:06:44 2019
From: stuart.monteith at linaro.org (Stuart Monteith)
Date: Wed, 27 Nov 2019 16:06:44 +0000
Subject: [aarch64-port-dev ] RFR(S) 8234891: AArch64: Fix build failure
 after JDK-8234387
In-Reply-To: <abea05bf-b100-83e5-0b33-cc44c16fe42c@oracle.com>
References: <847197ac-5969-900c-840a-61c475b19d84@bell-sw.com>
 <abea05bf-b100-83e5-0b33-cc44c16fe42c@oracle.com>
Message-ID: <CAEGA6kbYiHBkrQb5vTcjPmP1J4n146nGQWoonm=8a_Tg1_6tSg@mail.gmail.com>

Thanks Boris - looks good to me.
Please ask me or my fellow Arm engineers if you should need any help
testing in future.

On Wed, 27 Nov 2019 at 13:26, Vladimir Ivanov
<vladimir.x.ivanov at oracle.com> wrote:
>
> The fix looks good and trivial.
>
> Best regards,
> Vladimir Ivanov
>
> On 27.11.2019 15:55, Boris Ulasevich wrote:
> > Hi,
> >
> > Please review the fix in aarch64.ad to address the build issue "Ideal
> > node missing: CmpOp" raised after recent change in C2. The intuitive
> > operand name case correction CmpOp->cmpOp fixes the build, but leads to
> > unworkable jvm. Removing the match rule works good and jdk/hotspot tests
> > are Ok.
> >
> > http://bugs.openjdk.java.net/browse/JDK-8234891
> > http://cr.openjdk.java.net/~bulasevich/8234891/webrev.00
> >
> > ARM32 build fails too. I will fix the problem in arm32.ad file separately.
> >
> > thanks,
> > Boris

From patrick at os.amperecomputing.com  Thu Nov 28 03:11:48 2019
From: patrick at os.amperecomputing.com (Patrick Zhang OS)
Date: Thu, 28 Nov 2019 03:11:48 +0000
Subject: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub
 threshold of string_compare intrinsic tunable
Message-ID: <MN2PR01MB6093FB8BC624CE2F22D3C2A38F470@MN2PR01MB6093.prod.exchangelabs.com>

Hi Andrew,

I collected the timings and did a comparison, please see the spread sheet in [1].
Per the comments from Dmitrij in another thread, and rethought the concerns you and Andrew Dinn reminded, I revised the patch to drop the tunable flags and the extra overprefech checking for LL/UU, then updated the shared STUB_THRESHOLD for UU/LU/UL respectively, according to the experimental data (but note that I only have one aarch64 system, the coverage might be limited). 
Please review.

JBS: https://bugs.openjdk.java.net/browse/JDK-8229351 
Webrev: http://cr.openjdk.java.net/~qpzhang/8229351/webrev.05

1. LL is the most common use case, especially the shorter strings, I will not change this as the tests cannot consistently produce a very positive data. So this part is same as what Dmitrij worked on, nothing changed.
2. UU used the same threshold 72 (chars) as LL, which meant 144 bytes for UU and 72 bytes for LL (with -XX:+CompactStrings by default). I updated it from 72 to 36 so the limit is fair to UU now. The test shows (see figure [2]) there is in average ~10% perf gains with [36, 71] characters, other lengths are same.
3. LU/UL, updated the threshold from 72 (chars) to 24 (chars). According to the algorithm in generate_compare_long_string_different_encoding, 24 is the minimum length that can take the advantage of compare_string_16_x_LU function to process 16 chars (32 bytes) in a loop, and can be faster than the outer function which processes every 8 bytes in the main loop. See figure [3], the perf gains are up to 60% (the secondary axis at the right)
4. Added " align(OptoLoopAlignment);" for main loops in the stub code, per early suggestions from Aleksei.
5. Updated the two relevant test cases under test/hotspot/jtreg/compiler/intrinsics/string, with additional string lengths that can better cover the cases referred in this patch.

More about the figures in the spread sheet and the JPG files (same):
The lengths are scatter points [4] suggested by Andrew, the main axis (at left) shows the times (multiplied by a const, so don't use the absolute values with ns/op), the blue dots shows the base trend, the orange dots belong to the patch, while the yellow dots (secondary axis) stand for the perf gains (patch vs base). For example, in [3], with the patch, the orange curve (patch) becomes be "monotonically increasing" with [24, 71] chars, which is better than the shape of blue curve (base) and it is what I (we) want.

Tests: jtreg tier1 all, hotspot_all, string-density-bench.jar, no regression found.

[1] http://cr.openjdk.java.net/~qpzhang/8229351/8229351-strcmp-perf.xlsx 
[2] http://cr.openjdk.java.net/~qpzhang/8229351/perf-strcmp-UU.JPG 
[3] http://cr.openjdk.java.net/~qpzhang/8229351/perf-strcmp-LU.JPG 
[4] 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 17, 19, 21, 23, 25, 28, 30, 34, 37, 41, 45, 49, 54, 60, 66, 72, 80, 88, 97, 106, 117, 129, 142, 156, 171, 189, 207, 228, 251

Regards
Patrick

-----Original Message-----
From: aarch64-port-dev <aarch64-port-dev-bounces at openjdk.java.net> On Behalf Of Patrick Zhang OS
Sent: Friday, November 15, 2019 4:05 PM
To: Andrew Haley <aph at redhat.com>; Andrew Dinn <adinn at redhat.com>
Cc: aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

To avoid future confusion, I am going to split the patch, take out the updates for generate_compare_long_string_different_encoding, which drops two redundant temp Register vars and related unused instructions, then create a new for your review. It has nothing to do with the proposed option. 

And I will continue working the remaining parts according to your comments and suggestions..

Regards
Patrick

-----Original Message-----
From: aarch64-port-dev <aarch64-port-dev-bounces at openjdk.java.net> On Behalf Of Patrick Zhang OS
Sent: Thursday, November 14, 2019 7:14 PM
To: Andrew Haley <aph at redhat.com>; aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

>> Why do we care about out-of-boundary prefetching for LL/UU? I don't think we do if it requires any extra logic.
I was thinking out-of-boundary prefetching should be prevented, and UL/LU has the same condition, if no need, we could force set largeLoopExitCondition to be 64 so that more cases can freely stay in the large loop. I don't think so.
http://hg.openjdk.java.net/jdk/jdk/file/355f4f42dda5/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#l4194 
    if (SoftwarePrefetchHintDistance >= 0) {
      __ bind(LARGE_LOOP_PREFETCH);
        __ prfm(Address(str1, SoftwarePrefetchHintDistance));
        __ prfm(Address(str2, SoftwarePrefetchHintDistance));
        compare_string_16_bytes_same(DIFF, DIFF2);
        compare_string_16_bytes_same(DIFF, DIFF2);
        __ sub(cnt2, cnt2, isLL ? 64 : 32);
        compare_string_16_bytes_same(DIFF, DIFF2);
-        __ subs(rscratch2, cnt2, largeLoopExitCondition);
+        __ subs(rscratch2, cnt2, 64);
        compare_string_16_bytes_same(DIFF, DIFF2);
        __ br(__ GT, LARGE_LOOP_PREFETCH);
        __ cbz(cnt2, LAST_CHECK_AND_LENGTH_DIFF); // no more chars left?
    }

>> Do you have a theory that LU/UL cases are common? Why?
The only "theory" can be compare_string_16_x_LU (in the stub) is fater than the 8 bytes main loop (out of the stub) (http://hg.openjdk.java.net/jdk/jdk/file/355f4f42dda5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#l4997), even not in the large loop of stub, the small loop can be faster as well since it is able to process more bytes within fewer instructions (http://hg.openjdk.java.net/jdk/jdk/file/355f4f42dda5/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#l4203). 

I can prepare a new patch with the updates to tests, and plot the timings soon latter.

Regards
Patrick

-----Original Message-----
From: Andrew Haley <aph at redhat.com>
Sent: Thursday, November 14, 2019 6:33 PM
To: Patrick Zhang OS <patrick at os.amperecomputing.com>; aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

On 11/14/19 9:20 AM, Patrick Zhang OS wrote:
> Thanks for the comments, see my answers below please.
> 
>>> 1. This patch seems to do rather a lot.

> Yes, it enables tweaking the stub parameters (not really changed any 
> in this patch), fixed an out-of-boundary prefetching for LL/UU, and 
> fixed some redundant instructions in LU/UL code path.  The latter two 
> are code-quality-wise, if splitting the patch could make the changes 
> clearer, I'd like to do.

Why do we care about out-of-boundary prefetching for LL/UU? I don't think we do if it requires any extra logic.

>>> 2. Are the thresholds bytes or characters?

> All thresholds are (and should be) in characters. This was a little 
> bit misleading, for LL/LU/UL, the const STUB_THRESHOLD meant chars, 
> while for UU it could be explained as bytes. If specified 
> -XX:-CompactStrings, all code path going to UU would make the 
> threshold mean bytes, which might confuse developers. This patch can 
> clarify it, and the description of tunable options can provide further 
> guidance.

It must. Without some commentary both maintainers and developers are lost. Unless there is some very strong reason, all counts must specify units.

>>> 3. How are we supposed to test with these different thresholds?

> There are two jtreg tests for checking the impacts of 
> SoftwarePrefetchHintDistance over the intrinsics, I have locally added 
> non-default thresholds inside and tested with many lengths (took days 
> on a test system). This has not been included in the proposed patch, 
> maybe a follow-up one would do, any advice?
> hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToSameLength
> .java
> hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToDifferentL
> ength.java

I won't accept this patch unless it is accompanied by test cases that properly exercise the code.

>>> 4. What are the thresholds you tested? 

> Firstly, the default threshold, the hardcoded 72 is my testing focus 
> since I would try best not to bring negative impacts to aarch64-port 
> normal state, especially other CPU vendors.

> Second, I tested two extreme thresholds: 24 and 255, which means more 
> shorter strings (24 to 71 chars) or only very long strings
> (>=255) could go to the stub code path, respectively. Function tests 
> passed (listed in the initial email), while performance test results 
> (with string-density-bench, StringCompareBench.java, and
> SPECjbb2015) could be varying with different systems (as well as 
> microarchitectures).

> Third, some other non-default thresholds, as sanity check, 
> particularly for ensuring correctness.

It's the extremes that really matter, I suspect.

>>> 5. But the more serious problem is the fact that we have different 
>>> code paths for different microarchitectures, and somehow this has to 
>>> be standard supportable software. In order to test this stuff we'll 
>>> need different test parameters for SoftwarePrefetchHintDistance, 
>>> CompareLongStringLimitLatin, CompareLongStringLimitUTF

> The STUB_THRESHOLD was introduced to control the stub code insertion, 
> tested on some aarch64 systems. I think making it tunable is the way 
> to let different microarchitectures be able to configure optimal ones 
> for their own.

Well, yes. The question is whether we go down this rabbit hole or try to find a compromise that is perhaps not quite optimal for anyone but good enough for everyone.

> I would like to have a common threshold too, or no threshold for all, 
> but lacking of full-coverage tests over all systems. Maybe I 
> misunderstood you points here with regards to "supportable", the two 
> new options can be kept as default if developers have no concerns on 
> string compare intrinsics.

I rather suspect that vendors will want to change the defaults sooner or later. And besides, we'll all have to support these options.

>>> 6. We already emit a great deal of in-line code in the 
>>> string_compare intrinsic, with the intention that this be as fast as 
>>> possible because we want to avoid having to call the intrinsic. So 
>>> why is the intrinsic actually faster in your case?

> Avoid having to call the intrinsic?

I meant "the stub".

> If you did NOT mean completely "avoiding intrinsic", but the strings 
> shorter than 72 chars, I would have to say, "it depends". The stub 
> functions try best to process every 16 chars, while the outer logic 
> processes every 8 bytes, which is the major diff. For example, I can 
> see consistent 1.5x faster with lengths 24-71 for LU/UL cases, maybe 
> others cannot, which can be reason why we need an option here.

I know that strings of length 24 - 30ish are very common, so this is an important case.

Do you have a theory that LU/UL cases are common? Why?

What is it like with LL/UU? I'd need to see real timings.

I'd either do all numbers < 256 or (to save time) a sequence like...

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 17, 19, 21, 23, 25, 28, 30, 34, 37, 41, 45, 49, 54, 60, 66, 72, 80, 88, 97, 106, 117, 129, 142, 156, 171, 189, 207, 228, 251

The idea here is that we an plot a graph. The timings should ideally be monotonically increasing.

And then we could see how different processors behave, and hopefully find a decent solution for all.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From nick.gasson at arm.com  Thu Nov 28 07:50:32 2019
From: nick.gasson at arm.com (Nick Gasson)
Date: Thu, 28 Nov 2019 15:50:32 +0800
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <6a7e7eb1-f3a5-d361-45ab-a4b7004683ec@redhat.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <cb73db58-b00b-2e12-d64f-ae7671aa7cbe@arm.com>
 <6a7e7eb1-f3a5-d361-45ab-a4b7004683ec@redhat.com>
Message-ID: <60b30e79-78e6-cda9-a813-f4f1a58f7501@arm.com>

Hi Andrew,

>>
>> CompressedKlassPointers::base() => 0xffff0b4b5000
>> CompressedKlassPointers::shift() => 3
> 
> This is bad. Can you have a look at the allocation code to see why the search
> for an appropriate address range fails?
> 

We have a loop in Metaspace::allocate_metaspace_compressed_klass_ptrs 
that searches for a 4G aligned location for the compressed class space 
on AArch64, but this search is not done if CDS is in use and the archive 
was loaded successfully, because in that case the class space has 
already been mapped (i.e. `metaspace_rs.is_reserved()' is true).

Previously it was only possible to map the CDS archive at 0x800000000. 
The compressed class base is set to the start of this region which 
happens to be 4G aligned so our MacroAssembler::load_klass optimisation 
applies and we emit the short code sequence.

With the recent change in 8231610, if the CDS archive cannot be mapped 
at that address (e.g. because of ASLR or because the heap is mapped 
there) then the CDS archive will be relocated to an arbitrary address 
decided by mmap. That's where the oddly-aligned compressed klass base 
above comes from. This causes MacroAssembler::load_klass to emit the 
inefficient sequence which then overflows the buffer for the itable stub 
(the worst-case size estimate there is wrong, which needs to be fixed 
separately).

A minimal way to reproduce this is:

$ java -XX:HeapBaseMinAddress=33G -Xshare:on -Xlog:cds=debug -version
...
[0.050s][info ][cds] CDS archive was created with max heap size = 128M, 
and the following configuration:
[0.050s][info ][cds]     narrow_klass_base = 0x0000fffec7507000, 
narrow_klass_shift = 3
...
#  guarantee(masm->pc() <= s->code_end()) failed: itable #2: overflowed 
buffer, estimated len: 180, actual len: 184, overrun: 4


I suggest we move the 4G-aligned search from 
allocate_metaspace_compressed_klass_ptrs into its own function that can 
then be called from MetaspaceShared::reserve_shared_space when 
requested_address==NULL (i.e. the fallback path when mmap at 0x800000000 
fails). If you're happy with this I'll make a patch for review?


Thanks,
Nick

From boris.ulasevich at bell-sw.com  Thu Nov 28 08:42:43 2019
From: boris.ulasevich at bell-sw.com (Boris Ulasevich)
Date: Thu, 28 Nov 2019 11:42:43 +0300
Subject: [aarch64-port-dev ] RFR(S) 8234891: AArch64: Fix build failure
 after JDK-8234387
In-Reply-To: <CAEGA6kbYiHBkrQb5vTcjPmP1J4n146nGQWoonm=8a_Tg1_6tSg@mail.gmail.com>
References: <847197ac-5969-900c-840a-61c475b19d84@bell-sw.com>
 <abea05bf-b100-83e5-0b33-cc44c16fe42c@oracle.com>
 <CAEGA6kbYiHBkrQb5vTcjPmP1J4n146nGQWoonm=8a_Tg1_6tSg@mail.gmail.com>
Message-ID: <6e5d8aec-538e-20c7-a035-b04ff7e8691f@bell-sw.com>

Thank you!

On 27.11.2019 19:06, Stuart Monteith wrote:
> Thanks Boris - looks good to me.
> Please ask me or my fellow Arm engineers if you should need any help
> testing in future.
> 
> On Wed, 27 Nov 2019 at 13:26, Vladimir Ivanov
> <vladimir.x.ivanov at oracle.com> wrote:
>>
>> The fix looks good and trivial.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 27.11.2019 15:55, Boris Ulasevich wrote:
>>> Hi,
>>>
>>> Please review the fix in aarch64.ad to address the build issue "Ideal
>>> node missing: CmpOp" raised after recent change in C2. The intuitive
>>> operand name case correction CmpOp->cmpOp fixes the build, but leads to
>>> unworkable jvm. Removing the match rule works good and jdk/hotspot tests
>>> are Ok.
>>>
>>> http://bugs.openjdk.java.net/browse/JDK-8234891
>>> http://cr.openjdk.java.net/~bulasevich/8234891/webrev.00
>>>
>>> ARM32 build fails too. I will fix the problem in arm32.ad file separately.
>>>
>>> thanks,
>>> Boris

From aph at redhat.com  Thu Nov 28 09:36:24 2019
From: aph at redhat.com (Andrew Haley)
Date: Thu, 28 Nov 2019 09:36:24 +0000
Subject: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820ED6037478@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED6037478@dggeml527-mbx.china.huawei.com>
Message-ID: <496dd418-02a5-6566-4f22-76d87f263926@redhat.com>

See also "8220351: Cross-modifying code". That scheme is used by other ports
but not AArch64.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Thu Nov 28 10:03:18 2019
From: aph at redhat.com (Andrew Haley)
Date: Thu, 28 Nov 2019 10:03:18 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <60b30e79-78e6-cda9-a813-f4f1a58f7501@arm.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <cb73db58-b00b-2e12-d64f-ae7671aa7cbe@arm.com>
 <6a7e7eb1-f3a5-d361-45ab-a4b7004683ec@redhat.com>
 <60b30e79-78e6-cda9-a813-f4f1a58f7501@arm.com>
Message-ID: <52a223ea-13d3-d7ef-72c7-bd0f590c12be@redhat.com>

On 11/28/19 7:50 AM, Nick Gasson wrote:
> Hi Andrew,
> 
>>>
>>> CompressedKlassPointers::base() => 0xffff0b4b5000
>>> CompressedKlassPointers::shift() => 3
>>
>> This is bad. Can you have a look at the allocation code to see why the search
>> for an appropriate address range fails?
> 
> We have a loop in Metaspace::allocate_metaspace_compressed_klass_ptrs 
> that searches for a 4G aligned location for the compressed class space 
> on AArch64, but this search is not done if CDS is in use and the archive 
> was loaded successfully, because in that case the class space has 
> already been mapped (i.e. `metaspace_rs.is_reserved()' is true).

Right. At the time I wrote that code, CDS was not much used by anything, so
I thought of it as a mariganl use case.

> Previously it was only possible to map the CDS archive at 0x800000000. 
> The compressed class base is set to the start of this region which 
> happens to be 4G aligned so our MacroAssembler::load_klass optimisation 
> applies and we emit the short code sequence.
> 
> With the recent change in 8231610, if the CDS archive cannot be mapped 
> at that address (e.g. because of ASLR or because the heap is mapped 
> there) then the CDS archive will be relocated to an arbitrary address 
> decided by mmap. That's where the oddly-aligned compressed klass base 
> above comes from. This causes MacroAssembler::load_klass to emit the 
> inefficient sequence which then overflows the buffer for the itable stub 
> (the worst-case size estimate there is wrong, which needs to be fixed 
> separately).

Correcting the stub size is a minor tidy-up which does not really need
its own Bug ID.

> A minimal way to reproduce this is:
> 
> $ java -XX:HeapBaseMinAddress=33G -Xshare:on -Xlog:cds=debug -version
> ...
> [0.050s][info ][cds] CDS archive was created with max heap size = 128M, 
> and the following configuration:
> [0.050s][info ][cds]     narrow_klass_base = 0x0000fffec7507000, 
> narrow_klass_shift = 3
> ...
> #  guarantee(masm->pc() <= s->code_end()) failed: itable #2: overflowed 
> buffer, estimated len: 180, actual len: 184, overrun: 4
> 
> 
> I suggest we move the 4G-aligned search from 
> allocate_metaspace_compressed_klass_ptrs into its own function that can 
> then be called from MetaspaceShared::reserve_shared_space when 
> requested_address==NULL (i.e. the fallback path when mmap at 0x800000000 
> fails). If you're happy with this I'll make a patch for review?

Yes, that sounds excellent. We really need it to avoid compressed
class pointers becoming an expensive option.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From nick.gasson at arm.com  Thu Nov 28 10:18:38 2019
From: nick.gasson at arm.com (Nick Gasson)
Date: Thu, 28 Nov 2019 18:18:38 +0800
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <52a223ea-13d3-d7ef-72c7-bd0f590c12be@redhat.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <cb73db58-b00b-2e12-d64f-ae7671aa7cbe@arm.com>
 <6a7e7eb1-f3a5-d361-45ab-a4b7004683ec@redhat.com>
 <60b30e79-78e6-cda9-a813-f4f1a58f7501@arm.com>
 <52a223ea-13d3-d7ef-72c7-bd0f590c12be@redhat.com>
Message-ID: <044ce397-96c7-8b01-a6ed-d3ea2546749a@arm.com>

On 28/11/2019 18:03, Andrew Haley wrote:
>> (the worst-case size estimate there is wrong, which needs to be fixed
>> separately).
> 
> Correcting the stub size is a minor tidy-up which does not really need
> its own Bug ID.
> 

OK, but I'd like to also try removing the second call to __ load_klass 
in VtableStubs::create_itable_stub as that will shave a few instructions 
even in the normal case. I'll recalculate the size estimate when I do that.


Thanks,
Nick

From aph at redhat.com  Thu Nov 28 10:28:49 2019
From: aph at redhat.com (Andrew Haley)
Date: Thu, 28 Nov 2019 10:28:49 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <044ce397-96c7-8b01-a6ed-d3ea2546749a@arm.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <cb73db58-b00b-2e12-d64f-ae7671aa7cbe@arm.com>
 <6a7e7eb1-f3a5-d361-45ab-a4b7004683ec@redhat.com>
 <60b30e79-78e6-cda9-a813-f4f1a58f7501@arm.com>
 <52a223ea-13d3-d7ef-72c7-bd0f590c12be@redhat.com>
 <044ce397-96c7-8b01-a6ed-d3ea2546749a@arm.com>
Message-ID: <f1ba5a8e-daaa-5165-44ad-a9ef3c7a72c8@redhat.com>

On 11/28/19 10:18 AM, Nick Gasson wrote:
> OK, but I'd like to also try removing the second call to __ load_klass 
> in VtableStubs::create_itable_stub as that will shave a few instructions 
> even in the normal case. I'll recalculate the size estimate when I do that.

OK. But beware of spending time on things that don't really matter. There's
a risk in making any change.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From felix.yang at huawei.com  Thu Nov 28 11:50:16 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Thu, 28 Nov 2019 11:50:16 +0000
Subject: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
In-Reply-To: <496dd418-02a5-6566-4f22-76d87f263926@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED6037478@dggeml527-mbx.china.huawei.com>
 <496dd418-02a5-6566-4f22-76d87f263926@redhat.com>
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820EDAA24958@dggeml527-mbx.china.huawei.com>

> -----Original Message-----
> From: Andrew Haley [mailto:aph at redhat.com]
> Sent: Thursday, November 28, 2019 5:36 PM
> To: Yangfei (Felix) <felix.yang at huawei.com>;
> aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
> 
> See also "8220351: Cross-modifying code". That scheme is used by other ports
> but not AArch64.

Thanks for this helpful information.  
BTW: should we change aarch64 to use this scheme too?  

Felix

From adinn at redhat.com  Thu Nov 28 13:32:03 2019
From: adinn at redhat.com (Andrew Dinn)
Date: Thu, 28 Nov 2019 13:32:03 +0000
Subject: [aarch64-port-dev ] RFR(S) 8234891: AArch64: Fix build failure
	after JDK-8234387
In-Reply-To: <abea05bf-b100-83e5-0b33-cc44c16fe42c@oracle.com>
References: <847197ac-5969-900c-840a-61c475b19d84@bell-sw.com>
 <abea05bf-b100-83e5-0b33-cc44c16fe42c@oracle.com>
Message-ID: <0960f240-6801-48c2-9664-c7509e90f4a5@redhat.com>

On 27/11/2019 13:23, Vladimir Ivanov wrote:
> The fix looks good and trivial.

Yes, the patch is good. The CmpOp matches are not needed and perhaps
never were.

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill


From adinn at redhat.com  Thu Nov 28 13:43:09 2019
From: adinn at redhat.com (Andrew Dinn)
Date: Thu, 28 Nov 2019 13:43:09 +0000
Subject: [aarch64-port-dev ] RFR(XS): 8233466: aarch64: remove
 unnecessary load of mdo when profiling return and parameters type
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820EDAA21F3E@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820EDAA21F3E@dggeml527-mbx.china.huawei.com>
Message-ID: <d5dd8664-5a6c-0517-f052-5eea02a06990@redhat.com>

Hi Felix,

On 25/11/2019 11:33, Yangfei (Felix) wrote:
> Ping?   Any comments?

Yes, that load into mdp is redundant. x86 omits the load and so should
AArch64. The patch is good.


regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill


From aph at redhat.com  Thu Nov 28 14:02:28 2019
From: aph at redhat.com (Andrew Haley)
Date: Thu, 28 Nov 2019 14:02:28 +0000
Subject: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820EDAA24958@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED6037478@dggeml527-mbx.china.huawei.com>
 <496dd418-02a5-6566-4f22-76d87f263926@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820EDAA24958@dggeml527-mbx.china.huawei.com>
Message-ID: <d596806f-bec4-dade-cb05-746f295ceefa@redhat.com>

On 11/28/19 11:50 AM, Yangfei (Felix) wrote:
>> -----Original Message-----
>> From: Andrew Haley [mailto:aph at redhat.com]
>> Sent: Thursday, November 28, 2019 5:36 PM
>> To: Yangfei (Felix) <felix.yang at huawei.com>;
>> aarch64-port-dev at openjdk.java.net
>> Subject: Re: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
>>
>> See also "8220351: Cross-modifying code". That scheme is used by other ports
>> but not AArch64.
> 
> Thanks for this helpful information.  
> BTW: should we change aarch64 to use this scheme too?  

Not unless we have a reason. I had a look and there seemed to be no advantage.

By the way, did you find the source of your original problem?

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From felix.yang at huawei.com  Fri Nov 29 02:41:27 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Fri, 29 Nov 2019 02:41:27 +0000
Subject: [aarch64-port-dev ] RFR(XS): 8233466: aarch64: remove
 unnecessary load of mdo when profiling return and parameters type
In-Reply-To: <d5dd8664-5a6c-0517-f052-5eea02a06990@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820EDAA21F3E@dggeml527-mbx.china.huawei.com>
 <d5dd8664-5a6c-0517-f052-5eea02a06990@redhat.com>
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820EDAA24A9E@dggeml527-mbx.china.huawei.com>

> -----Original Message-----
> From: Andrew Dinn [mailto:adinn at redhat.com]
> Sent: Thursday, November 28, 2019 9:43 PM
> To: Yangfei (Felix) <felix.yang at huawei.com>;
> hotspot-runtime-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net
> Subject: Re: RFR(XS): 8233466: aarch64: remove unnecessary load of mdo when
> profiling return and parameters type
> 
> Hi Felix,
> 
> On 25/11/2019 11:33, Yangfei (Felix) wrote:
> > Ping?   Any comments?
> 
> Yes, that load into mdp is redundant. x86 omits the load and so should AArch64.
> The patch is good.
> 
Hi Andrew, 

  Thanks for reviewing.  Pushed: http://hg.openjdk.java.net/jdk/jdk/rev/fc216dcef2bb

Felix

From Pengfei.Li at arm.com  Fri Nov 29 03:41:56 2019
From: Pengfei.Li at arm.com (Pengfei Li (Arm Technology China))
Date: Fri, 29 Nov 2019 03:41:56 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <97c0b309-e911-ff9f-4cea-b6f00bd4962d@redhat.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <cb73db58-b00b-2e12-d64f-ae7671aa7cbe@arm.com>
 <34081dab-0049-dda7-dead-3b2934f8c41b@arm.com>
 <97c0b309-e911-ff9f-4cea-b6f00bd4962d@redhat.com>
Message-ID: <DB7PR08MB31153A896B02FC1F39C498C196460@DB7PR08MB3115.eurprd08.prod.outlook.com>

Hi Andrew,

I just caught up with your discussion with Nick.

> I guess. It'd be nicer to fix CDS on AArch64 so that it doesn't cause
> performance regressions.

The 4G alignment search may still fail after the fix. Regarding to my second webrev, do you agree that checking the base and shift values in a function in aarch64.ad? See my last email [1] for detail explanations.

[1] https://mail.openjdk.java.net/pipermail/aarch64-port-dev/2019-November/008278.html

--
Thanks,
Pengfei


From Pengfei.Li at arm.com  Fri Nov 29 03:56:50 2019
From: Pengfei.Li at arm.com (Pengfei Li (Arm Technology China))
Date: Fri, 29 Nov 2019 03:56:50 +0000
Subject: [aarch64-port-dev ] RFR(S): 8234791: Fix Client VM build for x86_64
	and AArch64
Message-ID: <DB7PR08MB3115121B5DBB949E8A96A2FF96460@DB7PR08MB3115.eurprd08.prod.outlook.com>

Hi,

Please help review this small fix for 64-bit client build.

Webrev: http://cr.openjdk.java.net/~pli/rfr/8234791/webrev.00/
JBS: https://bugs.openjdk.java.net/browse/JDK-8234791

Current 64-bit client VM build fails because errors occurred in dumping
the CDS archive. In JDK 12, we enabled "Default CDS Archives"[1] which
runs "java -Xshare:dump" after linking the JDK image. But for Client VM
build on 64-bit platforms, the ergonomic flag UseCompressedOops is not
set.[2] This leads to VM exits in checking the flags for dumping the
shared archive.[3]

This change removes the "#if defined" macro to make shared archive dump
successful in 64-bit client build. By tracking the history of the macro,
I found it is initially added as "#ifndef COMPILER1"[4] 10 years ago
when C1 did not have a good support of compressed oops and modified to
current shape[5] in the implementation of tiered compilation. It should
be safe to be removed today.

This patch also fixes another client build issue on AArch64.

[1] http://openjdk.java.net/jeps/341
[2] http://hg.openjdk.java.net/jdk/jdk/file/981a55672786/src/hotspot/share/runtime/arguments.cpp#l1694
[3] http://hg.openjdk.java.net/jdk/jdk/file/981a55672786/src/hotspot/share/runtime/arguments.cpp#l3551
[4] http://hg.openjdk.java.net/jdk8/jdk8/hotspot/rev/323bd24c6520#l11.7
[5] http://hg.openjdk.java.net/jdk8/jdk8/hotspot/rev/d5d065957597#l86.56

--
Thanks,
Pengfei


From nick.gasson at arm.com  Fri Nov 29 06:40:23 2019
From: nick.gasson at arm.com (Nick Gasson)
Date: Fri, 29 Nov 2019 14:40:23 +0800
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <DB7PR08MB31153A896B02FC1F39C498C196460@DB7PR08MB3115.eurprd08.prod.outlook.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <cb73db58-b00b-2e12-d64f-ae7671aa7cbe@arm.com>
 <34081dab-0049-dda7-dead-3b2934f8c41b@arm.com>
 <97c0b309-e911-ff9f-4cea-b6f00bd4962d@redhat.com>
 <DB7PR08MB31153A896B02FC1F39C498C196460@DB7PR08MB3115.eurprd08.prod.outlook.com>
Message-ID: <f1d7906d-9a1e-1c35-5909-02397c651e2b@arm.com>

On 29/11/2019 11:41, Pengfei Li (Arm Technology China) wrote:
> 
> The 4G alignment search may still fail after the fix. Regarding to my second webrev, do you agree that checking the base and shift values in a function in aarch64.ad? See my last email [1] for detail explanations.
> 

How about we exit with a fatal error if we can't find a suitably aligned 
region? Then we can remove the code in decode_klass_non_null that uses 
R27 and this patch is much simpler. That code path is poorly tested at 
the moment so it seems risky to leave it in. With a hard error at least 
users will report it to us so we can fix it.

Thanks,
Nick

From nick.gasson at arm.com  Fri Nov 29 06:56:47 2019
From: nick.gasson at arm.com (Nick Gasson)
Date: Fri, 29 Nov 2019 14:56:47 +0800
Subject: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
In-Reply-To: <d596806f-bec4-dade-cb05-746f295ceefa@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED6037478@dggeml527-mbx.china.huawei.com>
 <496dd418-02a5-6566-4f22-76d87f263926@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820EDAA24958@dggeml527-mbx.china.huawei.com>
 <d596806f-bec4-dade-cb05-746f295ceefa@redhat.com>
Message-ID: <69910cf8-84ee-2048-796f-452d43adaaf9@arm.com>

On 28/11/2019 22:02, Andrew Haley wrote:
>> BTW: should we change aarch64 to use this scheme too?
> 
> Not unless we have a reason. I had a look and there seemed to be no advantage.
> 

I don't think it helps on AArch64: that 
OrderAccess::cross_modifying_fence() is only called when a thread is 
about to return from the safepoint handler. But it's possible for a 
safepoint with code patching to happen in the background while a thread 
is in native code, in which case we still need to do an ISB when 
returning to Java.

I'm not sure how other ports that need a serialising instruction handle 
this?


Thanks,
Nick

From felix.yang at huawei.com  Fri Nov 29 08:06:02 2019
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Fri, 29 Nov 2019 08:06:02 +0000
Subject: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
In-Reply-To: <d596806f-bec4-dade-cb05-746f295ceefa@redhat.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED6037478@dggeml527-mbx.china.huawei.com>
 <496dd418-02a5-6566-4f22-76d87f263926@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820EDAA24958@dggeml527-mbx.china.huawei.com>
 <d596806f-bec4-dade-cb05-746f295ceefa@redhat.com>
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820EDAA24B0A@dggeml527-mbx.china.huawei.com>

> -----Original Message-----
> From: Andrew Haley [mailto:aph at redhat.com]
> Sent: Thursday, November 28, 2019 10:02 PM
> To: Yangfei (Felix) <felix.yang at huawei.com>;
> aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
> 
> On 11/28/19 11:50 AM, Yangfei (Felix) wrote:
> >> -----Original Message-----
> >> From: Andrew Haley [mailto:aph at redhat.com]
> >> Sent: Thursday, November 28, 2019 5:36 PM
> >> To: Yangfei (Felix) <felix.yang at huawei.com>;
> >> aarch64-port-dev at openjdk.java.net
> >> Subject: Re: [aarch64-port-dev ] Question about ISB usage in the
> >> aarch64 port
> >>
> >> See also "8220351: Cross-modifying code". That scheme is used by
> >> other ports but not AArch64.
> >
> > Thanks for this helpful information.
> > BTW: should we change aarch64 to use this scheme too?
> 
> Not unless we have a reason. I had a look and there seemed to be no
> advantage.
> 
> By the way, did you find the source of your original problem?

Not yet.  It triggers randomly which makes it hard to narrow down the root cause.  
Suggestions are welcome  :-)  


Thanks,
Felix

From adinn at redhat.com  Fri Nov 29 09:19:12 2019
From: adinn at redhat.com (Andrew Dinn)
Date: Fri, 29 Nov 2019 09:19:12 +0000
Subject: [aarch64-port-dev ] RFR(S): 8234791: Fix Client VM build for
	x86_64 and AArch64
In-Reply-To: <DB7PR08MB3115121B5DBB949E8A96A2FF96460@DB7PR08MB3115.eurprd08.prod.outlook.com>
References: <DB7PR08MB3115121B5DBB949E8A96A2FF96460@DB7PR08MB3115.eurprd08.prod.outlook.com>
Message-ID: <3e79b1e4-c548-dc3d-075f-e04e496d3863@redhat.com>

Hi Pengfei,

On 29/11/2019 03:56, Pengfei Li (Arm Technology China) wrote:

> Please help review this small fix for 64-bit client build.
> 
> Webrev: http://cr.openjdk.java.net/~pli/rfr/8234791/webrev.00/
> JBS: https://bugs.openjdk.java.net/browse/JDK-8234791
> 
> Current 64-bit client VM build fails because errors occurred in dumping
> the CDS archive. In JDK 12, we enabled "Default CDS Archives"[1] which
> runs "java -Xshare:dump" after linking the JDK image. But for Client VM
> build on 64-bit platforms, the ergonomic flag UseCompressedOops is not
> set.[2] This leads to VM exits in checking the flags for dumping the
> shared archive.[3]
> 
> This change removes the "#if defined" macro to make shared archive dump
> successful in 64-bit client build. By tracking the history of the macro,
> I found it is initially added as "#ifndef COMPILER1"[4] 10 years ago
> when C1 did not have a good support of compressed oops and modified to
> current shape[5] in the implementation of tiered compilation. It should
> be safe to be removed today.
> 
> This patch also fixes another client build issue on AArch64.
> 
> [1] http://openjdk.java.net/jeps/341
> [2] http://hg.openjdk.java.net/jdk/jdk/file/981a55672786/src/hotspot/share/runtime/arguments.cpp#l1694
> [3] http://hg.openjdk.java.net/jdk/jdk/file/981a55672786/src/hotspot/share/runtime/arguments.cpp#l3551
> [4] http://hg.openjdk.java.net/jdk8/jdk8/hotspot/rev/323bd24c6520#l11.7
> [5] http://hg.openjdk.java.net/jdk8/jdk8/hotspot/rev/d5d065957597#l86.56
Your explanation sounds correct and the change to arguments.cpp looks good.

Can you explain why you have modified sharedRuntime_aarch64.cpp to
include nativeInst_aarch64.hpp? I don't see any other change in the
source file that would make this necessary.

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill


From aph at redhat.com  Fri Nov 29 09:53:14 2019
From: aph at redhat.com (Andrew Haley)
Date: Fri, 29 Nov 2019 09:53:14 +0000
Subject: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
In-Reply-To: <DA41BE1DDCA941489001C7FBD7A8820EDAA24B0A@dggeml527-mbx.china.huawei.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED6037478@dggeml527-mbx.china.huawei.com>
 <496dd418-02a5-6566-4f22-76d87f263926@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820EDAA24958@dggeml527-mbx.china.huawei.com>
 <d596806f-bec4-dade-cb05-746f295ceefa@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820EDAA24B0A@dggeml527-mbx.china.huawei.com>
Message-ID: <09b127f2-95bc-5d53-9b02-a2e8e23c0deb@redhat.com>

On 11/29/19 8:06 AM, Yangfei (Felix) wrote:
> Not yet.  It triggers randomly which makes it hard to narrow down the root cause.  
> Suggestions are welcome  :-)  

I'd set things up to deoptimize and recompile continually, thrashing
the life out of the code cache. Run many Java threads. If the problem
really is recompilation you'll see it.

That is always my recommendation: if you have a bug to diagnose, do
everything you can to make the bug worse.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Fri Nov 29 09:59:18 2019
From: aph at redhat.com (Andrew Haley)
Date: Fri, 29 Nov 2019 09:59:18 +0000
Subject: [aarch64-port-dev ] Question about ISB usage in the aarch64 port
In-Reply-To: <69910cf8-84ee-2048-796f-452d43adaaf9@arm.com>
References: <DA41BE1DDCA941489001C7FBD7A8820ED6037478@dggeml527-mbx.china.huawei.com>
 <496dd418-02a5-6566-4f22-76d87f263926@redhat.com>
 <DA41BE1DDCA941489001C7FBD7A8820EDAA24958@dggeml527-mbx.china.huawei.com>
 <d596806f-bec4-dade-cb05-746f295ceefa@redhat.com>
 <69910cf8-84ee-2048-796f-452d43adaaf9@arm.com>
Message-ID: <7ca12202-6b65-9121-9e9d-1f3c6001124f@redhat.com>

On 11/29/19 6:56 AM, Nick Gasson wrote:

> I don't think it helps on AArch64: that 
> OrderAccess::cross_modifying_fence() is only called when a thread is 
> about to return from the safepoint handler. But it's possible for a 
> safepoint with code patching to happen in the background while a thread 
> is in native code, in which case we still need to do an ISB when 
> returning to Java.

Indeed we do.

> I'm not sure how other ports that need a serialising instruction handle 
> this?

PPC requirements are very similar to ours. I would have thought they
already had something before this patch, or they would surely have had
some problems. In any case, I haven't studied the code transitions
that are covered by that this patch.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From Pengfei.Li at arm.com  Fri Nov 29 10:01:37 2019
From: Pengfei.Li at arm.com (Pengfei Li (Arm Technology China))
Date: Fri, 29 Nov 2019 10:01:37 +0000
Subject: [aarch64-port-dev ] RFR(S): 8234791: Fix Client VM build for
	x86_64 and AArch64
In-Reply-To: <3e79b1e4-c548-dc3d-075f-e04e496d3863@redhat.com>
References: <DB7PR08MB3115121B5DBB949E8A96A2FF96460@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <3e79b1e4-c548-dc3d-075f-e04e496d3863@redhat.com>
Message-ID: <DB7PR08MB311513D3C81A3BAB1ECABF0A96460@DB7PR08MB3115.eurprd08.prod.outlook.com>

Hi Andrew Dinn,

> Your explanation sounds correct and the change to arguments.cpp looks
> good.
> 
> Can you explain why you have modified sharedRuntime_aarch64.cpp to
> include nativeInst_aarch64.hpp? I don't see any other change in the source
> file that would make this necessary.

Thanks for review. There is another build error below after I fixed arguments.cpp.

For target hotspot_variant-client_libjvm_objs_sharedRuntime_aarch64.o:
....../src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp:2836:22: error: 'NativeInstruction' has not been declared
__ add(r20, r20, NativeInstruction::instruction_size);

We see that sharedRuntime_aarch64.cpp uses NativeInstruction but doesn't include nativeInst_aarch64.hpp.
There is no error in Server VM build because the header file is included indirectly from some C2 file.
But for Client VM build where C2 files are not in, this error occurs.

--
Thanks,
Pengfei


From aph at redhat.com  Fri Nov 29 10:07:38 2019
From: aph at redhat.com (Andrew Haley)
Date: Fri, 29 Nov 2019 10:07:38 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <DB7PR08MB31153A896B02FC1F39C498C196460@DB7PR08MB3115.eurprd08.prod.outlook.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <cb73db58-b00b-2e12-d64f-ae7671aa7cbe@arm.com>
 <34081dab-0049-dda7-dead-3b2934f8c41b@arm.com>
 <97c0b309-e911-ff9f-4cea-b6f00bd4962d@redhat.com>
 <DB7PR08MB31153A896B02FC1F39C498C196460@DB7PR08MB3115.eurprd08.prod.outlook.com>
Message-ID: <8a0ae655-8544-a4fc-7551-d7634ebdaaa8@redhat.com>

On 11/29/19 3:41 AM, Pengfei Li (Arm Technology China) wrote:
> The 4G alignment search may still fail after the fix.

It may, but very unlikely.

 Regarding to my second webrev, do you agree that checking the base and shift values in a function in aarch64.ad? See my last email [1] for detail explanations.
> 
> [1] https://mail.openjdk.java.net/pipermail/aarch64-port-dev/2019-November/008278.html

Not really, no. A method should be called exactly once from the code
that does the memory allocation, and then set a flag to be read
thereafter. It is not ideal to do it from the MacroAssembler
constructor, because Assembler instances are created wihte very hihg
frequency. I don't undestand why you simply can't do what I suggested.

You say

> But we have to do it in Metaspace::set_narrow_klass_base_and_shift()
> where the base and shift are finally determined and introduce new
> code block of "#ifdef AARCH64 #endif" in HotSpot shared code.

So do that, or perhaps introduce an overridable function in
AbstractAssembler which does nothing on other ports. But don't keep
executing the same logic again and again. Once base and shift are set
they never change.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Fri Nov 29 10:10:07 2019
From: aph at redhat.com (Andrew Haley)
Date: Fri, 29 Nov 2019 10:10:07 +0000
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <f1d7906d-9a1e-1c35-5909-02397c651e2b@arm.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <cb73db58-b00b-2e12-d64f-ae7671aa7cbe@arm.com>
 <34081dab-0049-dda7-dead-3b2934f8c41b@arm.com>
 <97c0b309-e911-ff9f-4cea-b6f00bd4962d@redhat.com>
 <DB7PR08MB31153A896B02FC1F39C498C196460@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <f1d7906d-9a1e-1c35-5909-02397c651e2b@arm.com>
Message-ID: <70f2fb20-f72d-07f6-b8ff-7d1e467c3c12@redhat.com>

On 11/29/19 6:40 AM, Nick Gasson wrote:
> On 29/11/2019 11:41, Pengfei Li (Arm Technology China) wrote:
>> The 4G alignment search may still fail after the fix. Regarding to my second webrev, do you agree that checking the base and shift values in a function in aarch64.ad? See my last email [1] for detail explanations.
>>
> How about we exit with a fatal error if we can't find a suitably aligned 
> region? Then we can remove the code in decode_klass_non_null that uses 
> R27 and this patch is much simpler. That code path is poorly tested at 
> the moment so it seems risky to leave it in. With a hard error at least 
> users will report it to us so we can fix it.

That is starting to sound very attractive. With a 64-bit address space I'm
finding it very hard to imagine a scenario in which we don't find a
suitable address. I think AOT-compiled code would still be OK, because it
generates different code, but we'd have to do some testing.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Fri Nov 29 10:11:33 2019
From: aph at redhat.com (Andrew Haley)
Date: Fri, 29 Nov 2019 10:11:33 +0000
Subject: [aarch64-port-dev ] RFR(S): 8234791: Fix Client VM build for
	x86_64 and AArch64
In-Reply-To: <DB7PR08MB311513D3C81A3BAB1ECABF0A96460@DB7PR08MB3115.eurprd08.prod.outlook.com>
References: <DB7PR08MB3115121B5DBB949E8A96A2FF96460@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <3e79b1e4-c548-dc3d-075f-e04e496d3863@redhat.com>
 <DB7PR08MB311513D3C81A3BAB1ECABF0A96460@DB7PR08MB3115.eurprd08.prod.outlook.com>
Message-ID: <6af3f9d1-f1e3-7f03-6056-fd0c36af65b7@redhat.com>

On 11/29/19 10:01 AM, Pengfei Li (Arm Technology China) wrote:
> Thanks for review. There is another build error below after I fixed arguments.cpp.
> 
> For target hotspot_variant-client_libjvm_objs_sharedRuntime_aarch64.o:
> ....../src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp:2836:22: error: 'NativeInstruction' has not been declared
> __ add(r20, r20, NativeInstruction::instruction_size);
> 
> We see that sharedRuntime_aarch64.cpp uses NativeInstruction but doesn't include nativeInst_aarch64.hpp.
> There is no error in Server VM build because the header file is included indirectly from some C2 file.
> But for Client VM build where C2 files are not in, this error occurs.

OK.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From adinn at redhat.com  Fri Nov 29 10:20:40 2019
From: adinn at redhat.com (Andrew Dinn)
Date: Fri, 29 Nov 2019 10:20:40 +0000
Subject: [aarch64-port-dev ] RFR(S): 8234791: Fix Client VM build for
	x86_64 and AArch64
In-Reply-To: <DB7PR08MB311513D3C81A3BAB1ECABF0A96460@DB7PR08MB3115.eurprd08.prod.outlook.com>
References: <DB7PR08MB3115121B5DBB949E8A96A2FF96460@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <3e79b1e4-c548-dc3d-075f-e04e496d3863@redhat.com>
 <DB7PR08MB311513D3C81A3BAB1ECABF0A96460@DB7PR08MB3115.eurprd08.prod.outlook.com>
Message-ID: <bcea3df2-8a97-ad13-d3d4-a208a50b6330@redhat.com>

HiPengfei,

On 29/11/2019 10:01, Pengfei Li (Arm Technology China) wrote:
>> Can you explain why you have modified sharedRuntime_aarch64.cpp to
>> include nativeInst_aarch64.hpp? I don't see any other change in the source
>> file that would make this necessary.
> 
> Thanks for review. There is another build error below after I fixed arguments.cpp.
> 
> For target hotspot_variant-client_libjvm_objs_sharedRuntime_aarch64.o:
> ....../src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp:2836:22: error: 'NativeInstruction' has not been declared
> __ add(r20, r20, NativeInstruction::instruction_size);
> 
> We see that sharedRuntime_aarch64.cpp uses NativeInstruction but doesn't include nativeInst_aarch64.hpp.
> There is no error in Server VM build because the header file is included indirectly from some C2 file.
> But for Client VM build where C2 files are not in, this error occurs.
Ok, in that case the patch is good to push.

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill


From adinn at redhat.com  Fri Nov 29 11:57:01 2019
From: adinn at redhat.com (Andrew Dinn)
Date: Fri, 29 Nov 2019 11:57:01 +0000
Subject: [aarch64-port-dev ] 8233948: AArch64: Incorrect mapping between
 OptoReg and VMReg for high 64 bits of Vector Register
In-Reply-To: <VE1PR08MB4880428B5053CEC070A2483E88700@VE1PR08MB4880.eurprd08.prod.outlook.com>
References: <VE1PR08MB4880832B668EFD8A88419E2188770@VE1PR08MB4880.eurprd08.prod.outlook.com>
 <VE1PR08MB4880227F4AEE8A866F3AB9CD88770@VE1PR08MB4880.eurprd08.prod.outlook.com>
 <VE1PR08MB4880428B5053CEC070A2483E88700@VE1PR08MB4880.eurprd08.prod.outlook.com>
Message-ID: <d6e9d79a-87b7-ed97-1dc3-bef1aaf861a3@redhat.com>

Hi Joshua,

Thanks for looking into this and suggesting the required cleanup.

On 15/11/2019 10:29, Joshua Zhu (Arm Technology China) wrote:
>> Please review the following patch:
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8233948
>> Webrev: http://cr.openjdk.java.net/~jzhu/8233948/webrev.00/
> 
> Please let me know if any comments. Thanks a lot.
I think this is a good start but there is more work to do to clean up
method RegisterSaver::save_live_registers defined in file
sharedRuntime_aarch64.cpp. It would be good to do that clean up as part
of this patch so it is all consistent.

So, the first step is to add a couple of extra enum constants in
FloatRegisterImpl:

 128 class FloatRegisterImpl: public AbstractRegisterImpl {
 129  public:
 130   enum {
 131     number_of_registers = 32,
 132     max_slots_per_register = 4,
         save_slots_per_register = 2,
         extra_save_slots_per_register = 2

The 2 new tags are needed because sharedRuntime_aarch64.cpp normally
only saves 2 slots per register but it occasionally needs to save all 4.

The first bit of code in sharedRuntime_aarch64.cpp that needs fixing is
this enum:

 100   enum layout {
 101                 fpu_state_off = 0,
 102                 fpu_state_end = fpu_state_off+FPUStateSizeInWords-1,
 103                 // The frame sender code expects that rfp will be in
 104                 // the "natural" place and will override any oopMap
 105                 // setting for it. We must therefore force the layout
 106                 // so that it agrees with the frame sender code.
 107                 r0_off = fpu_state_off+FPUStateSizeInWords,
 108                 rfp_off = r0_off + 30 * 2,
 109                 return_off = rfp_off + 2,      // slot for return
address
 110                 reg_save_size = return_off + 2};

This information defines the layout of the data normally saved to stack
(i.e. 2 slots per fp reg). These values should really be computed using
the enum values you added to the definitions for RegisterImpl and
FloatRegisterImpl.

FPUStateSizeInWords is actually defined in assembler.hpp.  It doesn't
really need to be there but we put it there to follow the logic for x86
where the amount of saved state is more complicated. The AArch64
definiton at assembler.hpp:607 is this:

 607   const int FPUStateSizeInWords = 32 * 2;

So, that can now be redefined as

 607   const int FPUStateSizeInWords =
FloatRegisterImpl::number_of_registers *
FloatRegisterImpl::save_slots_per_register;

We then need to redefine the code at lines 108 - 110 to use the enum values:

 108                 rfp_off = r0_off +
(RegisterImpl::number_of_registers - 2) *
RegisterImpl::max_slots_per_register,
 109                 return_off = rfp_off +
RegisterImpl::max_slots_per_register,      // slot for return address
 110                 reg_save_size = return_off +
RegisterImpl::max_slots_per_register};

Finally, we can method edit save_live_registers at the point where it
allows space for the extra vector register content. That needs to be
updated to use the relevant constants:

 116   if (save_vectors) {
 117     // Save upper half of vector registers
 118     int vect_words = FloatRegisterImpl::number_of_registers *
FloatRegisterImpl::extra_save_slots_per_register;
 119     additional_frame_words += vect_words;

Could you prepare a new webrev with these extra changes in and check it
is ok?

Also, could you report what testing you did before and after your change
(other than checking the dump output). You will probably need to repeat
it to ensure these extra changes are ok.

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill


From ioi.lam at oracle.com  Thu Nov 28 08:19:36 2019
From: ioi.lam at oracle.com (Ioi Lam)
Date: Thu, 28 Nov 2019 00:19:36 -0800
Subject: [aarch64-port-dev ] RFR(M): 8233743: AArch64: Make r27
 conditionally allocatable
In-Reply-To: <60b30e79-78e6-cda9-a813-f4f1a58f7501@arm.com>
References: <DB7PR08MB3115B98F12ACC996FE0EE4FA96760@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <93b1dff3-b760-e305-1422-d21178e9ed75@redhat.com>
 <DB7PR08MB311558A967C30C11C10CAADE96700@DB7PR08MB3115.eurprd08.prod.outlook.com>
 <28d32c06-9a8e-6f4b-8425-3f07a4fbfe82@redhat.com>
 <cb73db58-b00b-2e12-d64f-ae7671aa7cbe@arm.com>
 <6a7e7eb1-f3a5-d361-45ab-a4b7004683ec@redhat.com>
 <60b30e79-78e6-cda9-a813-f4f1a58f7501@arm.com>
Message-ID: <f7aa728b-a540-67b9-39d3-8a39a3318a09@oracle.com>


On 11/27/19 11:50 PM, Nick Gasson wrote:
> Hi Andrew,
>
>>>
>>> CompressedKlassPointers::base() => 0xffff0b4b5000
>>> CompressedKlassPointers::shift() => 3
>>
>> This is bad. Can you have a look at the allocation code to see why 
>> the search
>> for an appropriate address range fails?
>>
>
> We have a loop in Metaspace::allocate_metaspace_compressed_klass_ptrs 
> that searches for a 4G aligned location for the compressed class space 
> on AArch64, but this search is not done if CDS is in use and the 
> archive was loaded successfully, because in that case the class space 
> has already been mapped (i.e. `metaspace_rs.is_reserved()' is true).
>
> Previously it was only possible to map the CDS archive at 0x800000000. 
> The compressed class base is set to the start of this region which 
> happens to be 4G aligned so our MacroAssembler::load_klass 
> optimisation applies and we emit the short code sequence.
>
> With the recent change in 8231610, if the CDS archive cannot be mapped 
> at that address (e.g. because of ASLR or because the heap is mapped 
> there) then the CDS archive will be relocated to an arbitrary address 
> decided by mmap. That's where the oddly-aligned compressed klass base 
> above comes from. This causes MacroAssembler::load_klass to emit the 
> inefficient sequence which then overflows the buffer for the itable 
> stub (the worst-case size estimate there is wrong, which needs to be 
> fixed separately).
>
> A minimal way to reproduce this is:
>
> $ java -XX:HeapBaseMinAddress=33G -Xshare:on -Xlog:cds=debug -version
> ...
> [0.050s][info ][cds] CDS archive was created with max heap size = 
> 128M, and the following configuration:
> [0.050s][info ][cds]???? narrow_klass_base = 0x0000fffec7507000, 
> narrow_klass_shift = 3
> ...
> #? guarantee(masm->pc() <= s->code_end()) failed: itable #2: 
> overflowed buffer, estimated len: 180, actual len: 184, overrun: 4
>
>
> I suggest we move the 4G-aligned search from 
> allocate_metaspace_compressed_klass_ptrs into its own function that 
> can then be called from MetaspaceShared::reserve_shared_space when 
> requested_address==NULL (i.e. the fallback path when mmap at 
> 0x800000000 fails). If you're happy with this I'll make a patch for 
> review?
>

You can also force CDS archive relocation with 
-XX:+UnlockDiagnosticVMOptions -XX:ArchiveRelocationMode=1. That way you 
can test the behavior with the default heap settings.

Thanks
- Ioi


>
> Thanks,
> Nick


From ioi.lam at oracle.com  Sat Nov 30 01:02:29 2019
From: ioi.lam at oracle.com (Ioi Lam)
Date: Fri, 29 Nov 2019 17:02:29 -0800
Subject: [aarch64-port-dev ] RFR(S): 8234791: Fix Client VM build for
	x86_64 and AArch64
In-Reply-To: <DB7PR08MB3115121B5DBB949E8A96A2FF96460@DB7PR08MB3115.eurprd08.prod.outlook.com>
References: <DB7PR08MB3115121B5DBB949E8A96A2FF96460@DB7PR08MB3115.eurprd08.prod.outlook.com>
Message-ID: <0b33fa95-ff15-b628-1891-f990f239e60f@oracle.com>

Hi Pengfei,

I have cc-ed hotspot-compiler-dev at openjdk.java.net.

Please do not push the patch until someone from hotspot-compiler-dev has 
looked at it.

Many people are away due to Thanksgiving in the US.

Thanks
- Ioi

On 11/28/19 7:56 PM, Pengfei Li (Arm Technology China) wrote:
> Hi,
>
> Please help review this small fix for 64-bit client build.
>
> Webrev: http://cr.openjdk.java.net/~pli/rfr/8234791/webrev.00/
> JBS: https://bugs.openjdk.java.net/browse/JDK-8234791
>
> Current 64-bit client VM build fails because errors occurred in dumping
> the CDS archive. In JDK 12, we enabled "Default CDS Archives"[1] which
> runs "java -Xshare:dump" after linking the JDK image. But for Client VM
> build on 64-bit platforms, the ergonomic flag UseCompressedOops is not
> set.[2] This leads to VM exits in checking the flags for dumping the
> shared archive.[3]
>
> This change removes the "#if defined" macro to make shared archive dump
> successful in 64-bit client build. By tracking the history of the macro,
> I found it is initially added as "#ifndef COMPILER1"[4] 10 years ago
> when C1 did not have a good support of compressed oops and modified to
> current shape[5] in the implementation of tiered compilation. It should
> be safe to be removed today.
>
> This patch also fixes another client build issue on AArch64.
>
> [1] http://openjdk.java.net/jeps/341
> [2] http://hg.openjdk.java.net/jdk/jdk/file/981a55672786/src/hotspot/share/runtime/arguments.cpp#l1694
> [3] http://hg.openjdk.java.net/jdk/jdk/file/981a55672786/src/hotspot/share/runtime/arguments.cpp#l3551
> [4] http://hg.openjdk.java.net/jdk8/jdk8/hotspot/rev/323bd24c6520#l11.7
> [5] http://hg.openjdk.java.net/jdk8/jdk8/hotspot/rev/d5d065957597#l86.56
>
> --
> Thanks,
> Pengfei
>