Hooking up the array mismatch stub as an intrinsic in the template interpreter

Fri Apr 15 13:39:41 UTC 2016

On 4/15/16 4:29 PM, Vladimir Ivanov wrote:
> An idea how to avoid interpreter changes.
>
> Interpreter can't benefit from "intrinsifiable" methods directly, but if
> you create a wrapper and call it instead [1], JIT-compilers can take
> care of stand-alone versions for you. The interpreter will work with
> them as if they are ordinary Java methods.

... or even add such logic directly into the JVM: for methods marked w/ 
@HotSpotIntrinsicCandidate (or better with some new annotation, since 
most intrinsics depend on the context they are invoked in) create an 
intrinsified stand-alone version.

Best regards,
Vladimir Ivanov

>
> The only missing case is early startup phase when everything is
> interpreted, but we can add a special logic in the JVM to eagerly
> compile such methods (either during startup or on the first invocation)
> which would be much simpler than adding intrinsics specifically for the
> interpreter.
>
> [1]
> diff --git a/src/java.base/share/classes/java/util/ArraysSupport.java
> b/src/java.base/share/classes/java/util/ArraysSupport.java
> --- a/src/java.base/share/classes/java/util/ArraysSupport.java
> +++ b/src/java.base/share/classes/java/util/ArraysSupport.java
> @@ -26,6 +26,7 @@
>
>   import jdk.internal.HotSpotIntrinsicCandidate;
>   import jdk.internal.misc.Unsafe;
> +import jdk.internal.vm.annotation.ForceInline;
>
>   /**
>    * Utility methods to find a mismatch between two primitive arrays.
> @@ -106,8 +107,16 @@
>        * compliment of the number of remaining pairs of elements to be
> checked in
>        * the tail of the two arrays.
>        */
> +    @ForceInline
> +    static int vectorizedMismatch(Object a, long aOffset,
> +                                  Object b, long bOffset,
> +                                  int length,
> +                                  int log2ArrayIndexScale) {
> +        return vectorizedMismatch0(a, aOffset, b, bOffset, length,
> log2ArrayIndexScale);
> +    }
> +
>       @HotSpotIntrinsicCandidate
> -    static int vectorizedMismatch(Object a, long aOffset,
> +    private static int vectorizedMismatch0(Object a, long aOffset,
>                                     Object b, long bOffset,
>                                     int length,
>                                     int log2ArrayIndexScale) {
>
> On 4/15/16 4:07 PM, Paul Sandoz wrote:
>>
>>> On 15 Apr 2016, at 14:12, Coleen Phillimore
>>> <coleen.phillimore at oracle.com> wrote:
>>>
>>>
>>> I don't know why we'd add even more assembly code to the
>>> interpreter.  Why doesn't the JIT optimize this function instead? By
>>> adding a stub in the interpreter does that prevent the JIT from
>>> inlining this function since it's not invocation counted?
>>>
>>
>> I have updated the webrev with C1 support [1] and determined,
>> eyeballing generated code, that the stub call gets inlined for C1 and
>> C2 and appears unaffected by the wiring up of that same stub in the
>> template interpreter.
>>
>> A stub was added and wired up to C2 with the intention to wire that up
>> to C1, and possible to the interpreter. One reason for the latter was
>> because of the performance results presented in the last email
>> (potentially ~200x over the current approach, and ~35x improvement
>> over the original Java code). Does that matter? would you be concerned
>> about that?
>>
>> Array equality is quite a fundamental operation so i was concerned
>> about such a regression in the interpreter.
>>
>> Another reason for the latter, which i may be off base on here, is it
>> might make it easier to consolidate the intrinsics added for compact
>> string equality/comparison to this more general mismatch functionality.
>>
>> —
>>
>> Regarding the changes to C1 in [1]. Like for the CRC intrinsics i
>> added the _vectorizedMismatch intrinsic to the set of intrinsics that
>> preserve state and can trap. Is that correct? Also i am not sure if
>> the 32-bit part is correct.
>>
>> Thanks,
>> Paul.
>>
>> [1]
>> http://cr.openjdk.java.net/~psandoz/jdk9/JDK-8151268-int-c1-mismatch/webrev/
>>
>> (Note: this is still incomplete i need to appropriately update all
>> CPU-based code.)
>>
>> Benchmark              (lastNEQ)   (n)  Mode  Cnt     Score    Error
>> Units
>> # Baseline
>> # VM options: -XX:TieredStopAtLevel=1
>> ByteArray.base_equals      false  1024  avgt   10  1190.177 ± 21.387
>> ns/op
>> ByteArray.base_equals       true  1024  avgt   10  1191.767 ± 35.196
>> ns/op
>>
>> # Before patch
>> # VM options: -XX:TieredStopAtLevel=1 -XX:-SpecialArraysEquals
>> -XX:-UseVectorizedMismatchIntrinsic
>> ByteArray.jdk_equals       false  1024  avgt   10   208.014 ±  5.224
>> ns/op
>> ByteArray.jdk_equals        true  1024  avgt   10   218.271 ± 10.749
>> ns/op
>>
>> # After patch
>> # VM options: -XX:TieredStopAtLevel=1 -XX:-SpecialArraysEquals
>> -XX:+UseVectorizedMismatchIntrinsic
>> ByteArray.jdk_equals       false  1024  avgt   10    70.097 ±  2.321
>> ns/op
>> ByteArray.jdk_equals        true  1024  avgt   10    72.284 ±  1.578
>> ns/op
>>
>>
>>
>>> thanks,
>>> Coleen
>>>
>>>
>>> On 4/14/16 10:53 AM, Paul Sandoz wrote:
>>>> Hi,
>>>>
>>>> I hooked up the array mismatch stub to the interpreter, with a bit
>>>> of code cargo culting the CRC work and some lldb debugging [*] it
>>>> appears to work and pass tests.
>>>>
>>>> Can someone have a quick look to see if i am not the right track here:
>>>>
>>>>
>>>> http://cr.openjdk.java.net/~psandoz/jdk9/JDK-8151268-int-c1-mismatch/webrev/
>>>> <http://cr.openjdk.java.net/~psandoz/jdk9/JDK-8151268-int-c1-mismatch/webrev/>
>>>>
>>>>
>>>>
>>>> Here are some quick numbers running using -Xint for byte[] equality:
>>>>
>>>> Benchmark              (lastNEQ)   (n)  Mode  Cnt       Score
>>>> Error  Units
>>>> # Baseline
>>>> # VM options: -Xint
>>>> ByteArray.base_equals      false  1024  avgt   10  16622.453 ±
>>>> 498.475  ns/op
>>>> ByteArray.base_equals       true  1024  avgt   10  16889.244 ±
>>>> 439.895  ns/op
>>>>
>>>> # Before patch
>>>> # VM options: -Xint -XX:-UseVectorizedMismatchIntrinsic
>>>> ByteArray.jdk_equals       false  1024  avgt   10  106436.195 ±
>>>> 3657.508  ns/op
>>>> ByteArray.jdk_equals        true  1024  avgt   10  103306.001 ±
>>>> 2723.130  ns/op
>>>>
>>>> # After patch
>>>> # VM options: -Xint -XX:+UseVectorizedMismatchIntrinsic
>>>> ByteArray.jdk_equals       false  1024  avgt   10    448.764 ±
>>>> 18.977  ns/op
>>>> ByteArray.jdk_equals        true  1024  avgt   10    448.657 ±
>>>> 22.656  ns/op
>>>>
>>>>
>>>>
>>>> The next step is to wire up C1.
>>>>
>>>> Further steps would be to substitute some of intrinsics added/used
>>>> for compact strings with mismatch, then evaluate the performance.
>>>>
>>>> Thanks,
>>>> Paul.
>>>>
>>>> [*] Stubs to be used as intrinsics in the template interpreter need
>>>> to be created during the initial stage of generation, otherwise the
>>>> stub address is null which leads to a SEGV that’s hard to track down.
>>>
>>