Hooking up the array mismatch stub as an intrinsic in the template interpreter

Fri Apr 15 15:06:27 UTC 2016

> On 15 Apr 2016, at 15:39, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
> 
> 
> 
> On 4/15/16 4:29 PM, Vladimir Ivanov wrote:
>> An idea how to avoid interpreter changes.
>> 
>> Interpreter can't benefit from "intrinsifiable" methods directly, but if
>> you create a wrapper and call it instead [1], JIT-compilers can take
>> care of stand-alone versions for you. The interpreter will work with
>> them as if they are ordinary Java methods.
> 
> ... or even add such logic directly into the JVM: for methods marked w/ @HotSpotIntrinsicCandidate (or better with some new annotation, since most intrinsics depend on the context they are invoked in) create an intrinsified stand-alone version.
> 

Very interesting, some good lateral thinking here!

Thanks,
Paul.

> Best regards,
> Vladimir Ivanov
> 
>> 
>> The only missing case is early startup phase when everything is
>> interpreted, but we can add a special logic in the JVM to eagerly
>> compile such methods (either during startup or on the first invocation)
>> which would be much simpler than adding intrinsics specifically for the
>> interpreter.
>> 
>> [1]
>> diff --git a/src/java.base/share/classes/java/util/ArraysSupport.java
>> b/src/java.base/share/classes/java/util/ArraysSupport.java
>> --- a/src/java.base/share/classes/java/util/ArraysSupport.java
>> +++ b/src/java.base/share/classes/java/util/ArraysSupport.java
>> @@ -26,6 +26,7 @@
>> 
>>  import jdk.internal.HotSpotIntrinsicCandidate;
>>  import jdk.internal.misc.Unsafe;
>> +import jdk.internal.vm.annotation.ForceInline;
>> 
>>  /**
>>   * Utility methods to find a mismatch between two primitive arrays.
>> @@ -106,8 +107,16 @@
>>       * compliment of the number of remaining pairs of elements to be
>> checked in
>>       * the tail of the two arrays.
>>       */
>> +    @ForceInline
>> +    static int vectorizedMismatch(Object a, long aOffset,
>> +                                  Object b, long bOffset,
>> +                                  int length,
>> +                                  int log2ArrayIndexScale) {
>> +        return vectorizedMismatch0(a, aOffset, b, bOffset, length,
>> log2ArrayIndexScale);
>> +    }
>> +
>>      @HotSpotIntrinsicCandidate
>> -    static int vectorizedMismatch(Object a, long aOffset,
>> +    private static int vectorizedMismatch0(Object a, long aOffset,
>>                                    Object b, long bOffset,
>>                                    int length,
>>                                    int log2ArrayIndexScale) {
>> 
>> On 4/15/16 4:07 PM, Paul Sandoz wrote:
>>> 
>>>> On 15 Apr 2016, at 14:12, Coleen Phillimore
>>>> <coleen.phillimore at oracle.com> wrote:
>>>> 
>>>> 
>>>> I don't know why we'd add even more assembly code to the
>>>> interpreter.  Why doesn't the JIT optimize this function instead? By
>>>> adding a stub in the interpreter does that prevent the JIT from
>>>> inlining this function since it's not invocation counted?
>>>> 
>>> 
>>> I have updated the webrev with C1 support [1] and determined,
>>> eyeballing generated code, that the stub call gets inlined for C1 and
>>> C2 and appears unaffected by the wiring up of that same stub in the
>>> template interpreter.
>>> 
>>> A stub was added and wired up to C2 with the intention to wire that up
>>> to C1, and possible to the interpreter. One reason for the latter was
>>> because of the performance results presented in the last email
>>> (potentially ~200x over the current approach, and ~35x improvement
>>> over the original Java code). Does that matter? would you be concerned
>>> about that?
>>> 
>>> Array equality is quite a fundamental operation so i was concerned
>>> about such a regression in the interpreter.
>>> 
>>> Another reason for the latter, which i may be off base on here, is it
>>> might make it easier to consolidate the intrinsics added for compact
>>> string equality/comparison to this more general mismatch functionality.
>>> 
>>> —
>>> 
>>> Regarding the changes to C1 in [1]. Like for the CRC intrinsics i
>>> added the _vectorizedMismatch intrinsic to the set of intrinsics that
>>> preserve state and can trap. Is that correct? Also i am not sure if
>>> the 32-bit part is correct.
>>> 
>>> Thanks,
>>> Paul.
>>> 
>>> [1]
>>> http://cr.openjdk.java.net/~psandoz/jdk9/JDK-8151268-int-c1-mismatch/webrev/
>>> 
>>> (Note: this is still incomplete i need to appropriately update all
>>> CPU-based code.)
>>> 
>>> Benchmark              (lastNEQ)   (n)  Mode  Cnt     Score    Error
>>> Units
>>> # Baseline
>>> # VM options: -XX:TieredStopAtLevel=1
>>> ByteArray.base_equals      false  1024  avgt   10  1190.177 ± 21.387
>>> ns/op
>>> ByteArray.base_equals       true  1024  avgt   10  1191.767 ± 35.196
>>> ns/op
>>> 
>>> # Before patch
>>> # VM options: -XX:TieredStopAtLevel=1 -XX:-SpecialArraysEquals
>>> -XX:-UseVectorizedMismatchIntrinsic
>>> ByteArray.jdk_equals       false  1024  avgt   10   208.014 ±  5.224
>>> ns/op
>>> ByteArray.jdk_equals        true  1024  avgt   10   218.271 ± 10.749
>>> ns/op
>>> 
>>> # After patch
>>> # VM options: -XX:TieredStopAtLevel=1 -XX:-SpecialArraysEquals
>>> -XX:+UseVectorizedMismatchIntrinsic
>>> ByteArray.jdk_equals       false  1024  avgt   10    70.097 ±  2.321
>>> ns/op
>>> ByteArray.jdk_equals        true  1024  avgt   10    72.284 ±  1.578
>>> ns/op
>>> 
>>> 
>>> 
>>>> thanks,
>>>> Coleen
>>>> 
>>>> 
>>>> On 4/14/16 10:53 AM, Paul Sandoz wrote:
>>>>> Hi,
>>>>> 
>>>>> I hooked up the array mismatch stub to the interpreter, with a bit
>>>>> of code cargo culting the CRC work and some lldb debugging [*] it
>>>>> appears to work and pass tests.
>>>>> 
>>>>> Can someone have a quick look to see if i am not the right track here:
>>>>> 
>>>>> 
>>>>> http://cr.openjdk.java.net/~psandoz/jdk9/JDK-8151268-int-c1-mismatch/webrev/
>>>>> <http://cr.openjdk.java.net/~psandoz/jdk9/JDK-8151268-int-c1-mismatch/webrev/>
>>>>> 
>>>>> 
>>>>> 
>>>>> Here are some quick numbers running using -Xint for byte[] equality:
>>>>> 
>>>>> Benchmark              (lastNEQ)   (n)  Mode  Cnt       Score
>>>>> Error  Units
>>>>> # Baseline
>>>>> # VM options: -Xint
>>>>> ByteArray.base_equals      false  1024  avgt   10  16622.453 ±
>>>>> 498.475  ns/op
>>>>> ByteArray.base_equals       true  1024  avgt   10  16889.244 ±
>>>>> 439.895  ns/op
>>>>> 
>>>>> # Before patch
>>>>> # VM options: -Xint -XX:-UseVectorizedMismatchIntrinsic
>>>>> ByteArray.jdk_equals       false  1024  avgt   10  106436.195 ±
>>>>> 3657.508  ns/op
>>>>> ByteArray.jdk_equals        true  1024  avgt   10  103306.001 ±
>>>>> 2723.130  ns/op
>>>>> 
>>>>> # After patch
>>>>> # VM options: -Xint -XX:+UseVectorizedMismatchIntrinsic
>>>>> ByteArray.jdk_equals       false  1024  avgt   10    448.764 ±
>>>>> 18.977  ns/op
>>>>> ByteArray.jdk_equals        true  1024  avgt   10    448.657 ±
>>>>> 22.656  ns/op
>>>>> 
>>>>> 
>>>>> 
>>>>> The next step is to wire up C1.
>>>>> 
>>>>> Further steps would be to substitute some of intrinsics added/used
>>>>> for compact strings with mismatch, then evaluate the performance.
>>>>> 
>>>>> Thanks,
>>>>> Paul.
>>>>> 
>>>>> [*] Stubs to be used as intrinsics in the template interpreter need
>>>>> to be created during the initial stage of generation, otherwise the
>>>>> stub address is null which leads to a SEGV that’s hard to track down.
>>>> 
>>>