Adding an intrinsic to the interpreter
Coleen Phillimore
coleen.phillimore at oracle.com
Wed Sep 30 01:29:22 UTC 2015
I thought these special case interpreter entries prevent the compiler
from counting the invocations and cause them not to be inlined by the
compiler. That's why we removed generate_accessor_entry and
generate_empty_entry.
Coleen
On 9/16/15 5:36 AM, Paul Sandoz wrote:
> Hi,
>
> Here is a quick a dirty patch:
>
> http://cr.openjdk.java.net/~psandoz/tmp/interpreter-unsafe-getLong-intrinsic/webrev/
>
> It wires up getLong and also getLongUnaligned (if unaligned access is supported) to an intrinsic. Seems to work, but it am not entirely sure i did things correctly regarding the generating method.
>
> Benchmark with results is here:
>
> http://cr.openjdk.java.net/~psandoz/tmp/interpreter-unsafe-getLong-intrinsic/LongAccess.java
>
> When the intrinsic is enabled the costs are reduced.
>
> Here are some benchmark results run against the lexico patch comparing array equals:
>
> # VM options: -XX:+UnlockDiagnosticVMOptions -XX:+UseUnsafeInterpreterIntrinsics -Xint
> Benchmark (lastNEQ) (n) Mode Cnt Score Error Units
> LongArray.base_equals true 1 avgt 10 203.886 ± 3.971 ns/op
> LongArray.base_equals true 1024 avgt 10 15695.233 ± 204.514 ns/op
> LongArray.jdk_equals true 1 avgt 10 860.569 ± 14.528 ns/op
> LongArray.jdk_equals true 1024 avgt 10 65302.751 ± 1129.216 ns/op
> ByteArray.base_equals true 1 avgt 10 210.963 ± 2.743 ns/op
> ByteArray.base_equals true 1024 avgt 10 14883.093 ± 387.772 ns/op
> ByteArray.jdk_equals true 1 avgt 10 277.830 ± 5.126 ns/op
> ByteArray.jdk_equals true 1024 avgt 10 8935.940 ± 121.070 ns/op
>
> # VM options: -XX:-UnlockDiagnosticVMOptions -XX:-UseUnsafeInterpreterIntrinsics -Xint
> Benchmark (lastNEQ) (n) Mode Cnt Score Error Units
> LongArray.base_equals true 1 avgt 10 212.514 ± 23.749 ns/op
> LongArray.base_equals true 1024 avgt 10 16191.692 ± 717.162 ns/op
> LongArray.jdk_equals true 1 avgt 10 1057.496 ± 102.620 ns/op
> LongArray.jdk_equals true 1024 avgt 10 355476.908 ± 12577.777 ns/op
> ByteArray.base_equals true 1 avgt 10 200.575 ± 3.199 ns/op
> ByteArray.base_equals true 1024 avgt 10 14907.001 ± 297.510 ns/op
> ByteArray.jdk_equals true 1 avgt 10 270.780 ± 2.692 ns/op
> ByteArray.jdk_equals true 1024 avgt 10 44466.436 ± 623.087 ns/op
>
> The cost is reduced and in the case of bytes there is an improvement once the array length gets large enough.
>
> I am not sure we can really make more improvements to reduce the cost per-element without going further up the stack and that defeats the purpose of not pushing specialisations down into the VM. And i suspect given the high cost of making invocations in the interpreter such differences are likely to be less of a concern in real world cases where C1/C2 kick in.
>
> My conclusion is we have a potential tweak we can use if necessary.
>
>
> On 15 Sep 2015, at 06:42, John Rose <john.r.rose at oracle.com> wrote:
>
>> On Sep 14, 2015, at 11:35 AM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
>>> Thanks. Those patches provides a useful guide of changes required. If I take the plunge I would prefer to tackle getLong, as a quicker hack, rather than vectorizedMismatch in terms of the generated machine code.
>> I agree this is worth a first try. Try to intrinsify the smaller bits first.
>> The interpreter has math intrinsics (AbstractInterpreter::java_lang_math_sqrt / vmIntrinsics::_dsqrt)
>> The enum in AbsInterp predates the vmIntrinsics enum, and there is duplication between them.
>>
> Thanks, yes i see the mapping.
>
>
>> If we add new special cases to AbstractInterpreter, they might just vector through the Method::_intrinsic_id slot to a leaf C function.
>> (Perhaps different distinct leaf-function signatures get distinct MethodKind values.)
>> The extra indirections (via a function pointer table indexed by intrinsic_id) are noise in the interpreter.
>> The existing hardwired math functions could (in principle) be treated this way, as an additional cleanup.
>>
> I am not really following all of that. How can we wire up to a C function rather than generating machine code?
>
> Paul.
More information about the hotspot-runtime-dev
mailing list