[9] RFR(S): 6941938: Improve array equals intrinsic on SPARC

Wed Apr 27 07:48:24 UTC 2016

I think next should be greaterEqual (branch from loop when limit+8 == 0) to avoid reading 8 bytes beyond array.

+  // Bail out if we reached the end (but still do the comparison)
+  br(Assembler::positive, false, Assembler::pn, Lremaining);

An other trick you can is use xorcc instead of cmp.
Then you need only one srlx and compare it with zero (may be use cmp_zero_and_br()).

Thanks,
Vladimir

On 4/26/16 4:25 AM, Tobias Hartmann wrote:
> Thanks, Chris!
>
> On 25.04.2016 21:29, Christian Thalinger wrote:
>>
>>> On Apr 25, 2016, at 4:11 AM, Tobias Hartmann <tobias.hartmann at oracle.com <mailto:tobias.hartmann at oracle.com>> wrote:
>>>
>>> Hi Mikael,
>>>
>>> On 25.04.2016 09:59, Mikael Gerdin wrote:
>>>> Hi Tobias
>>>>
>>>> On 2016-04-25 08:38, Tobias Hartmann wrote:
>>>>> Hi,
>>>>>
>>>>> I executed a complete hs-comp PIT (all hotspot tests with -Xcomp/-Xmixed) with the alignment checks mentioned below and it turned out that the arrays are always 8-byte aligned (results are attached to the bug).
>>>>
>>>> I think that if you disable compressed klass pointers (or compressed oops) then I suspect that array contents are not always 8 byte aligned.
>>>>
>>>> With compressed oops enabled the mark word + compressed class + alength
>>>> add up to 16 bytes and thus the array contents are guarateed to be 8 byte aligned.
>>>> If compressed klass pointers are enabled then the array header will become 20 bytes and I don't know how the contents are laid out in that case.
>>>
>>> Thanks for the suggestion! I've run some tests (-testset hotspot) with "-XX:-UseCompressedOops -XX:-UseCompressedClassPointers" but it seems that the arrays are still always 8-byte aligned.
>>
>> See arrayOop.hpp:
>>
>>   // Returns the offset of the first element.
>>   static int base_offset_in_bytes(BasicType type) {
>>     return header_size(type) * HeapWordSize;
>>   }
>
> I verified that Mikael's claim is right:
> With UseCompressedClassPointers the header size is 16 bytes: mark oop (8) + compressed klass (4) + length (4).
> Without UseCompressedClassPointers the header size is 20 bytes: mark oop (8) + klass (8) + length (4).
>
> However, the header size is always aligned to HeapWordSize:
>
> static int header_size_in_bytes() {
>   size_t hs = align_size_up(length_offset_in_bytes() + sizeof(int), HeapWordSize);
>
> and therefore without UseCompressedClassPointers the header size is actually 24 bytes. On 64 bit systems, the first array element is always 8-byte aligned.
>
> Since we don't support 32-bit SPARC anymore, I wonder if it's okay to just remove the alignment check completely? This would simplify the code a lot (we don't need the "array_equals_loop" method) and improve performance:
> http://cr.openjdk.java.net/~thartmann/6941938/webrev.basic.01/
>
> Thanks,
> Tobias
>
>>>
>>> Best regards,
>>> Tobias
>>>
>>>>
>>>> /Mikael
>>>>
>>>>>
>>>>> If there are no objections, I would like to push the basic version:
>>>>> http://cr.openjdk.java.net/~thartmann/6941938/webrev.basic/
>>>>>
>>>>> If unaligned performance turns out to be a problem, we can still improve the intrinsic.
>>>>>
>>>>> Thanks,
>>>>> Tobias
>>>>>
>>>>> On 20.04.2016 15:31, Tobias Hartmann wrote:
>>>>>> Hi John,
>>>>>>
>>>>>> On 20.04.2016 03:46, John Rose wrote:
>>>>>>> So I started looking at your code and my inner SPARC junkie took over.
>>>>>>>
>>>>>>> This is what happened:
>>>>>>>   http://cr.openjdk.java.net/~jrose/draft/sparc/6941938/
>>>>>>
>>>>>> Thanks a lot for having a look!
>>>>>>
>>>>>>> Perhaps there are some ideas that might be helpful:
>>>>>>> - The rampdown logic can lose a couple of instructions by using xorcc and movr.
>>>>>>
>>>>>> Right, this simplifies the code a bit:
>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/webrev.00/
>>>>>>
>>>>>> I did some experiments but surprisingly it seems that this does not improve but slightly degrade performance. See benchmark results on page "webrev.00" of [1]. Any idea why that is?
>>>>>>
>>>>>>> - Perhaps the 32-bit version only makes sense for sizes of 16 bytes or less?
>>>>>>
>>>>>> I tried this already with an explicit check and branch (see webrev.small [2]) and the benchmarks showed a regression for small array sizes because of the additional check. I also evaluated the "shared check" you proposed:
>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/webrev.01
>>>>>>
>>>>>> Unfortunately, this leads to a regression as well. See page "webrev.01" of [1].
>>>>>>
>>>>>>> - It's possible to work with 64-bit loads in more cases (both-odd and one-odd).
>>>>>>
>>>>>> Yes, I thought about this when implementing the intrinsic but assumed that those cases are rare and it's sufficient to fall back to the 4-byte loop. I added runtime checks for misalignment [3] to the intrinsic and executed some tests (Microbenchmarks, JPRT and Nashorn + Octane). It seems that the arrays are always 8 byte aligned and misalignment is really rare. I would therefore like to avoid the additional complexity of the skewed loop.
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Thanks,
>>>>>> Tobias
>>>>>>
>>>>>> [1] http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/microbenchmark_results_6941938_T4.xlsx
>>>>>> [2] http://cr.openjdk.java.net/~thartmann/6941938/webrev.small/
>>>>>> [3] Runtime alignment checks:
>>>>>>
>>>>>> bind(Lunaligned);
>>>>>> Label next;
>>>>>> xor3(ary1, ary2, tmp);
>>>>>> and3(tmp, 7, tmp);
>>>>>> br_null_short(tmp, Assembler::pn, next);
>>>>>> STOP("One array is unaligned!");
>>>>>> should_not_reach_here();
>>>>>> bind(next);
>>>>>> STOP("Both arrays are unaligned!");
>>>>>>
>>>>>>> On the other hand, what you wrote is nice and simple.
>>>>>>>
>>>>>>> HTH
>>>>>>> — John
>>>>>>>
>>>>>>> P.S. On the otherest hand, I wish we could cover the mismatch intrinsic, and more
>>>>>>> versions of misalignment, still with vectorization, as with the arraycopy stubs.
>>>>>>> But that's neither nice nor simple.
>>>>>>>
>>>>>>>> On Apr 19, 2016, at 10:13 AM, Tobias Hartmann <tobias.hartmann at oracle.com <mailto:tobias.hartmann at oracle.com>> wrote:
>>>>>>>>
>>>>>>>> Thanks, Vladimir!
>>>>>>>>
>>>>>>>> On 19.04.2016 18:06, Vladimir Kozlov wrote:
>>>>>>>>> Very good. Go with basic. We can do SPU special improvements later if needed.
>>>>>>>>
>>>>>>>> Okay, I'll push the basic version.
>>>>>>>>
>>>>>>>> For reference, here are the results on a SPARC T4:
>>>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/byte_equal_basic_T4.png
>>>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/byte_unequal_basic_T4.png
>>>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/char_equal_basic_T4.png
>>>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/char_unequal_basic_T4.png
>>>>>>>>
>>>>>>>>> "I also noticed that we don't use prefetching for arraycopy and other intrinsics on SPARC."
>>>>>>>>> We do have arraycopy code for it but by default we don't use it:
>>>>>>>>>  product(uintx,  ArraycopySrcPrefetchDistance, 0,
>>>>>>>>>  product(uintx,  ArraycopyDstPrefetchDistance, 0,
>>>>>>>>>
>>>>>>>>> Experiments back then did not show improvement on JBB benchmarks but some workloads may have benefit. that is why we keep the code.
>>>>>>>>
>>>>>>>> Right but currently it's not possible to enable prefetching, because "ArraycopySrcPrefetchDistanceConstraintFunc" enforces the prefetch distance to be 0:
>>>>>>>>
>>>>>>>> java -XX:ArraycopySrcPrefetchDistance=42 -version
>>>>>>>> ArraycopySrcPrefetchDistance (42) must be 0
>>>>>>>> Error: Could not create the Java Virtual Machine.
>>>>>>>> Error: A fatal exception has occurred. Program will exit
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Tobias
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Vladimir
>>>>>>>>>
>>>>>>>>> On 4/19/16 5:35 AM, Tobias Hartmann wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> please review the following enhancement:
>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-6941938
>>>>>>>>>>
>>>>>>>>>> MacroAssembler::array_equals() currently emits a 4-byte array comparison loop on SPARC but we can do a 8-byte comparison if the array addresses are doubleword aligned. The intrinsic is used for byte/char Arrays.equals() and String.equals().
>>>>>>>>>>
>>>>>>>>>> I added the method MacroAssembler::array_equals_loop() that loads and compares array elements of size 'byte_width' until the elements are not equal or we reached the end of the arrays. Instead of first comparing trailing characters if the array sizes are not a multiple of 'byte_width', we simply read over the end of the arrays, bail out and compare the remaining elements by shifting out the garbage bits.
>>>>>>>>>>
>>>>>>>>>> We use a width of 8 bytes if the array addresses are doubleword aligned, or fall back to 4 bytes if they are only word aligned (which is always guaranteed). I also refactored MacroAssembler::string_compare() to use load_sized_value().
>>>>>>>>>>
>>>>>>>>>> Performance evaluation showed an improvement of up to 65% with the microbenchmarks on a SPARC M7. I used a slightly modified version of the "EqualsBench" [1] from Aleksey's compact string microbenchmark suite [2] that exercises the intrinsic on two equal/unequal char/byte arrays. Larger benchmarks (SPECjbb, SPECjvm) show no significant difference in performance.
>>>>>>>>>>
>>>>>>>>>> I evaluated the following three versions of the patch.
>>>>>>>>>>
>>>>>>>>>> -- Basic --
>>>>>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/webrev.basic/
>>>>>>>>>> The basic version implements the algorithm described above. Microbenchmark evaluation [3] shows a great improvement for all array sizes except for size 16 where the performance suddenly drops: http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/char_equal_basic.png
>>>>>>>>>>
>>>>>>>>>> I figured out that this is due to caching. It seems that with 8 byte steps, the processor notices later in the loop that we are going to access the subsequent array elements, causing the reads to trigger costly cache misses that degrade overall performance. With 4 byte reads, fetching is triggered earlier and the subsequent reads are very fast. I executed the same benchmark on a SPARC T4 and there is no performance problem. Version "prefetching" tries to improve this.
>>>>>>>>>>
>>>>>>>>>> There is also a regression for sizes 1 and 2 if the arrays are not equal because the baseline version explicitly checks those sizes before the loop: http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/char_unequal_basic.png
>>>>>>>>>> Version "small" tries to improve this.
>>>>>>>>>>
>>>>>>>>>> -- Prefetching --
>>>>>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/webrev.prefetch/
>>>>>>>>>> This version adds explicit prefetching before the loop to make sure the first array elements are always in the cache. This helps to avoid the cache misses with 16 bytes: http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/byte_unequal_prefetch.png
>>>>>>>>>>
>>>>>>>>>> However, prefetching negatively affects overall performance in the other cases (see [4]). Zoltan suggested that this may be because software prefetching temporarily disables or affects the hardware prefetcher. I also experimented with adding incremental prefetching to the loop body but this does not improve performance.
>>>>>>>>>>
>>>>>>>>>> -- Small --
>>>>>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/webrev.small/
>>>>>>>>>> This version includes special handling of arrays of size <= 4 by emitting a single load without a loop and the corresponding checks. Performance evaluation showed that this does not pay off because the additional check induces an overhead for small arrays > 4 elements (see sheet "small arrays").
>>>>>>>>>>
>>>>>>>>>> The numbers can be found here:
>>>>>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/microbenchmark_results_6941938.xlsx
>>>>>>>>>>
>>>>>>>>>> I would prefer to use the basic version because explicit prefetching is highly dependent on the processor and behavior may change. I also noticed that we don't use prefetching for arraycopy and other intrinsics on SPARC.
>>>>>>>>>>
>>>>>>>>>> What do you think?
>>>>>>>>>>
>>>>>>>>>> Tested with all hotspot tests on RBT (-Xmixed/-Xcomp). Links are attached to the bug.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Tobias
>>>>>>>>>>
>>>>>>>>>> [1] https://bugs.openjdk.java.net/secure/attachment/58697/EqualsBench.java
>>>>>>>>>> [2] http://cr.openjdk.java.net/~shade/density/string-density-bench.zip
>>>>>>>>>> [3] Microbenchmark results for the "basic" implementation
>>>>>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/byte_equal_basic.png
>>>>>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/byte_unequal_basic.png
>>>>>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/char_unequal_basic.png
>>>>>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/char_equal_basic.png
>>>>>>>>>> [4] Microbenchmark results for the "prefetching" implementation
>>>>>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/byte_equal_prefetch.png
>>>>>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/byte_unequal_prefetch.png
>>>>>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/char_equal_prefetch.png
>>>>>>>>>> http://cr.openjdk.java.net/~thartmann/6941938/benchmarks/char_unequal_prefetch.png
>>