[9] RFR(S): 6941938: Improve array equals intrinsic on SPARC

Tobias Hartmann tobias.hartmann at oracle.com
Tue Apr 19 12:35:43 UTC 2016


Hi,

please review the following enhancement:
https://bugs.openjdk.java.net/browse/JDK-6941938

MacroAssembler::array_equals() currently emits a 4-byte array comparison loop on SPARC but we can do a 8-byte comparison if the array addresses are doubleword aligned. The intrinsic is used for byte/char Arrays.equals() and String.equals().

I added the method MacroAssembler::array_equals_loop() that loads and compares array elements of size 'byte_width' until the elements are not equal or we reached the end of the arrays. Instead of first comparing trailing characters if the array sizes are not a multiple of 'byte_width', we simply read over the end of the arrays, bail out and compare the remaining elements by shifting out the garbage bits.

We use a width of 8 bytes if the array addresses are doubleword aligned, or fall back to 4 bytes if they are only word aligned (which is always guaranteed). I also refactored MacroAssembler::string_compare() to use load_sized_value().

Performance evaluation showed an improvement of up to 65% with the microbenchmarks on a SPARC M7. I used a slightly modified version of the "EqualsBench" [1] from Aleksey's compact string microbenchmark suite [2] that exercises the intrinsic on two equal/unequal char/byte arrays. Larger benchmarks (SPECjbb, SPECjvm) show no significant difference in performance.

I evaluated the following three versions of the patch.

-- Basic --
http://cr.openjdk.java.net/~thartmann/6941938/webrev.basic/
The basic version implements the algorithm described above. Microbenchmark evaluation [3] shows a great improvement for all array sizes except for size 16 where the performance suddenly drops: http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/char_equal_basic.png

I figured out that this is due to caching. It seems that with 8 byte steps, the processor notices later in the loop that we are going to access the subsequent array elements, causing the reads to trigger costly cache misses that degrade overall performance. With 4 byte reads, fetching is triggered earlier and the subsequent reads are very fast. I executed the same benchmark on a SPARC T4 and there is no performance problem. Version "prefetching" tries to improve this.

There is also a regression for sizes 1 and 2 if the arrays are not equal because the baseline version explicitly checks those sizes before the loop: http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/char_unequal_basic.png
Version "small" tries to improve this.

-- Prefetching --
http://cr.openjdk.java.net/~thartmann/6941938/webrev.prefetch/
This version adds explicit prefetching before the loop to make sure the first array elements are always in the cache. This helps to avoid the cache misses with 16 bytes: http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/byte_unequal_prefetch.png

However, prefetching negatively affects overall performance in the other cases (see [4]). Zoltan suggested that this may be because software prefetching temporarily disables or affects the hardware prefetcher. I also experimented with adding incremental prefetching to the loop body but this does not improve performance.

-- Small --
http://cr.openjdk.java.net/~thartmann/6941938/webrev.small/
This version includes special handling of arrays of size <= 4 by emitting a single load without a loop and the corresponding checks. Performance evaluation showed that this does not pay off because the additional check induces an overhead for small arrays > 4 elements (see sheet "small arrays").

The numbers can be found here:
http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/microbenchmark_results_6941938.xlsx

I would prefer to use the basic version because explicit prefetching is highly dependent on the processor and behavior may change. I also noticed that we don't use prefetching for arraycopy and other intrinsics on SPARC. 

What do you think?

Tested with all hotspot tests on RBT (-Xmixed/-Xcomp). Links are attached to the bug.

Thanks,
Tobias

[1] https://bugs.openjdk.java.net/secure/attachment/58697/EqualsBench.java
[2] http://cr.openjdk.java.net/~shade/density/string-density-bench.zip
[3] Microbenchmark results for the "basic" implementation
http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/byte_equal_basic.png
http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/byte_unequal_basic.png
http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/char_unequal_basic.png
http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/char_equal_basic.png
[4] Microbenchmark results for the "prefetching" implementation
http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/byte_equal_prefetch.png
http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/byte_unequal_prefetch.png
http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/char_equal_prefetch.png
http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/char_unequal_prefetch.png


More information about the hotspot-compiler-dev mailing list