Adding an intrinsic to the interpreter

Wed Sep 30 01:29:22 UTC 2015

I thought these special case interpreter entries prevent the compiler 
from counting the invocations and cause them not to be inlined by the 
compiler.   That's why we removed generate_accessor_entry and 
generate_empty_entry.

Coleen

On 9/16/15 5:36 AM, Paul Sandoz wrote:
> Hi,
>
> Here is a quick a dirty patch:
>
>    http://cr.openjdk.java.net/~psandoz/tmp/interpreter-unsafe-getLong-intrinsic/webrev/
>
> It wires up getLong and also getLongUnaligned (if unaligned access is supported) to an intrinsic. Seems to work, but it am not entirely sure i did things correctly regarding the generating method.
>
> Benchmark with results is here:
>
>    http://cr.openjdk.java.net/~psandoz/tmp/interpreter-unsafe-getLong-intrinsic/LongAccess.java
>
> When the intrinsic is enabled the costs are reduced.
>
> Here are some benchmark results run against the lexico patch comparing array equals:
>
> # VM options: -XX:+UnlockDiagnosticVMOptions -XX:+UseUnsafeInterpreterIntrinsics -Xint
> Benchmark              (lastNEQ)   (n)  Mode  Cnt      Score      Error  Units
> LongArray.base_equals       true     1  avgt   10    203.886 ±    3.971  ns/op
> LongArray.base_equals       true  1024  avgt   10  15695.233 ±  204.514  ns/op
> LongArray.jdk_equals        true     1  avgt   10    860.569 ±   14.528  ns/op
> LongArray.jdk_equals        true  1024  avgt   10  65302.751 ± 1129.216  ns/op
> ByteArray.base_equals       true     1  avgt   10    210.963 ±    2.743  ns/op
> ByteArray.base_equals       true  1024  avgt   10  14883.093 ±  387.772  ns/op
> ByteArray.jdk_equals        true     1  avgt   10    277.830 ±    5.126  ns/op
> ByteArray.jdk_equals        true  1024  avgt   10   8935.940 ±  121.070  ns/op
>
> # VM options: -XX:-UnlockDiagnosticVMOptions -XX:-UseUnsafeInterpreterIntrinsics -Xint
> Benchmark              (lastNEQ)   (n)  Mode  Cnt       Score       Error  Units
> LongArray.base_equals       true     1  avgt   10     212.514 ±    23.749  ns/op
> LongArray.base_equals       true  1024  avgt   10   16191.692 ±   717.162  ns/op
> LongArray.jdk_equals        true     1  avgt   10    1057.496 ±   102.620  ns/op
> LongArray.jdk_equals        true  1024  avgt   10  355476.908 ± 12577.777  ns/op
> ByteArray.base_equals       true     1  avgt   10     200.575 ±     3.199  ns/op
> ByteArray.base_equals       true  1024  avgt   10   14907.001 ±   297.510  ns/op
> ByteArray.jdk_equals        true     1  avgt   10     270.780 ±     2.692  ns/op
> ByteArray.jdk_equals        true  1024  avgt   10   44466.436 ±   623.087  ns/op
>
> The cost is reduced and in the case of bytes there is an improvement once the array length gets large enough.
>
> I am not sure we can really make more improvements to reduce the cost per-element without going further up the stack and that defeats the purpose of not pushing specialisations down into the VM. And i suspect given the high cost of making invocations in the interpreter such differences are likely to be less of a concern in real world cases where C1/C2 kick in.
>
> My conclusion is we have a potential tweak we can use if necessary.
>
>
> On 15 Sep 2015, at 06:42, John Rose <john.r.rose at oracle.com> wrote:
>
>> On Sep 14, 2015, at 11:35 AM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
>>> Thanks. Those patches provides a useful guide of changes required. If I take the plunge I would prefer to tackle getLong, as a quicker hack, rather than vectorizedMismatch in terms of the generated machine code.
>> I agree this is worth a first try.  Try to intrinsify the smaller bits first.
>> The interpreter has math intrinsics (AbstractInterpreter::java_lang_math_sqrt / vmIntrinsics::_dsqrt)
>> The enum in AbsInterp predates the vmIntrinsics enum, and there is duplication between them.
>>
> Thanks, yes i see the mapping.
>
>
>> If we add new special cases to AbstractInterpreter, they might just vector through the Method::_intrinsic_id slot to a leaf C function.
>> (Perhaps different distinct leaf-function signatures get distinct MethodKind values.)
>> The extra indirections (via a function pointer table indexed by intrinsic_id) are noise in the interpreter.
>> The existing hardwired math functions could (in principle) be treated this way, as an additional cleanup.
>>
> I am not really following all of that. How can we wire up to a C function rather than generating machine code?
>
> Paul.