RFR: 8302850: Consider implementing a C1 clone intrinsic that uses ArrayCopyNode for primitive arrays

Claes Redestad redestad at openjdk.org
Thu Feb 1 11:29:02 UTC 2024


On Thu, 1 Feb 2024 05:53:23 GMT, Galder Zamarreño <galder at openjdk.org> wrote:

> Adding C1 intrinsic for primitive array clone invocations for aarch64 and x86 architectures.
> 
> The intrinsic includes a change to avoid zeroing the newly allocated array because its contents are copied over within the same intrinsic with arraycopy. This means that the performance of primitive array clone exceeds that of primitive array copy. As an example, here are the microbenchmark results on darwin/aarch64:
> 
> 
> $ make test TEST="micro:java.lang.ArrayClone" MICRO="JAVA_OPTIONS=-XX:TieredStopAtLevel=1"
> Benchmark                 (size)  Mode  Cnt    Score    Error  Units
> ArrayClone.byteArraycopy       0  avgt   15    3.476 ?  0.018  ns/op
> ArrayClone.byteArraycopy      10  avgt   15    3.740 ?  0.017  ns/op
> ArrayClone.byteArraycopy     100  avgt   15    7.124 ?  0.010  ns/op
> ArrayClone.byteArraycopy    1000  avgt   15   39.301 ?  0.106  ns/op
> ArrayClone.byteClone           0  avgt   15    3.478 ?  0.008  ns/op
> ArrayClone.byteClone          10  avgt   15    3.562 ?  0.007  ns/op
> ArrayClone.byteClone         100  avgt   15    5.888 ?  0.206  ns/op
> ArrayClone.byteClone        1000  avgt   15   25.762 ?  0.203  ns/op
> ArrayClone.intArraycopy        0  avgt   15    3.199 ?  0.016  ns/op
> ArrayClone.intArraycopy       10  avgt   15    4.521 ?  0.008  ns/op
> ArrayClone.intArraycopy      100  avgt   15   17.429 ?  0.039  ns/op
> ArrayClone.intArraycopy     1000  avgt   15  178.432 ?  0.777  ns/op
> ArrayClone.intClone            0  avgt   15    3.406 ?  0.016  ns/op
> ArrayClone.intClone           10  avgt   15    4.272 ?  0.006  ns/op
> ArrayClone.intClone          100  avgt   15   13.110 ?  0.122  ns/op
> ArrayClone.intClone         1000  avgt   15  113.196 ? 13.400  ns/op
> 
> 
> It also includes an optimization to avoid instantiating the array copy stub in scenarios like this.
> 
> I run hotspot compiler tests successfully limiting them to C1 compilation darwin/aarch64, linux/x86_64 and linux/686. E.g.
> 
> 
> $ make test TEST="hotspot_compiler" JTREG="JAVA_OPTIONS=-XX:TieredStopAtLevel=1"
> ...
>    TEST                                              TOTAL  PASS  FAIL ERROR
>    jtreg:test/hotspot/jtreg:hotspot_compiler          1234  1234     0     0
> 
> 
> One question I had is what to do about non-primitive object arrays, see my [question](https://bugs.openjdk.org/browse/JDK-8302850?focusedId=14634879&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14634879) on the issue. @cl4es any thoughts?
> 
> Thanks @rwestrel for his help shaping this up :)

This looks very much like what I had in mind: implement a C1 clone intrinsic by reusing existing arraycopy code as much as possible while not going through too much trouble to do so. While typically the benefits will only be to startup/warmup, the OpenJDK itself and many others do run short running processes with `-XX:TieredStopAtLevel=1` where this might result in a significant overall speed-up. Nice work! 

Sorry for erroneously referencing C2's `ArrayCopyNode` in the RFE summary. Feel free to rephrase this as "Implement C1 clone intrinsic that reuses arraycopy code for primitive arrays"

Regarding your question on object arrays then I have no strong opinion except that if it's much more work then it's probably not worth it. It's less used than the primitive clones, e.g. `java.util.Array.copyOfRange(T[], int, int, Class<? extends T[]>)` does not use `clone` unlike its primitive array relatives. Recent changes to such APIs made use of primitive array clone more prominent, which in turn made the C1 cost difference between clone and arraycopy more apparent.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17667#issuecomment-1921104618


More information about the hotspot-compiler-dev mailing list