RFR (S): CR 8014447: Object.hashCode intrinsic breaks inline caches
Aleksey Shipilev
aleksey.shipilev at oracle.com
Thu Sep 26 03:52:28 PDT 2013
Thanks Vladimir!
Correct me if I'm wrong, but I think we hit neither of the corner cases
you outlined, see below.
On 09/26/2013 02:16 AM, Vladimir Kozlov wrote:
> (receiver_count > 0 && profile.morphism() == 1 &&
> profile.receiver(0)->as_klass()->is_java_lang_Object())
>
> to use hashCode intrinsic without delay because profiling shows only
> one. But on other hand Object::hashCode() is native method which we
> can't inline (that is why we have intrinsic) so current (02) code should
> work. But, please, confirm.
It feels wrong to special-case hashCode intrinsic in otherwise general
code. The benchmarks I've shown before clearly show the hashCode is
inlined when we have the monomophic j.l.O:hc() call. You are right, that
is because it is native. This is the chunk from the inline tree for
monomophic call:
@ 14 org.sample.HashCodeProfBench::virt_000 (39 bytes) inline (hot)
@ 26 java.lang.Object::hashCode (0 bytes) native method
\-> TypeProfile (208040/208040 counts) = java/lang/Object
@ 26 java.lang.Object::hashCode (0 bytes) (intrinsic)
> The problem is bimorphic and polymorphic call sites when one of recorded
> types is j.l.O and its percent is significant. You need to use intrinsic
> on the corresponding generated branch where it is used. And it could be
> tricky because call_generator() is called again recursively for each
> branch and we should return intrinsic without delay.
I think it is implicitly taken care of because we miss the type profile
on that branch, and we naturally fallback to intrinsic code? This is the
chunk of the inline three for 90% j.l.O + 10% j.l.I:
@ 14 org.sample.HashCodeProfBench::virt_010 (39 bytes) inline (hot)
@ 26 java.lang.Object::hashCode (0 bytes) native method
@ 26 java.lang.Integer::hashCode (8 bytes) inline (hot)
\-> TypeProfile (16203/162030 counts) = java/lang/Integer
\-> TypeProfile (145827/162030 counts) = java/lang/Object
@ 4 java.lang.Integer::hashCode (2 bytes) inline (hot)
@ 26 java.lang.Object::hashCode (0 bytes) (intrinsic)
Do you want to make this mechanics more explicit?
> You need to check virtual and static cases.
So I did the mixed profiles and bimorphic calls for both static and
virtual hashcodes. These are the results (numbers are percent of Integer
objects, all others are Objects):
baseline:
HashCodeProfBench.stat_000: 3.3 +- 0.2 ns/op
HashCodeProfBench.stat_010: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_020: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_030: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_040: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_050: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_060: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_070: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_080: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_090: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_100: 3.1 +- 0.1 ns/op
HashCodeProfBench.virt_000: 4.9 +- 0.1 ns/op
HashCodeProfBench.virt_010: 5.2 +- 0.1 ns/op
HashCodeProfBench.virt_020: 5.7 +- 0.1 ns/op
HashCodeProfBench.virt_030: 6.2 +- 0.1 ns/op
HashCodeProfBench.virt_040; 6.6 +- 0.1 ns/op
HashCodeProfBench.virt_050: 7.1 +- 0.1 ns/op
HashCodeProfBench.virt_060: 7.5 +- 0.1 ns/op
HashCodeProfBench.virt_070: 8.0 +- 0.1 ns/op
HashCodeProfBench.virt_080: 8.4 +- 0.1 ns/op
HashCodeProfBench.virt_090: 8.9 +- 0.1 ns/op
HashCodeProfBench.virt_100: 9.3 +- 0.1 ns/op
The static case is not affected at all even in the current code (that's
OK, because the type profile is not gathered for the first argument).
The virtual case is gradually degrading as we have more and more
Integers in the type profile, and we go through the slowpath in intrinsic.
patched:
HashCodeProfBench.stat_000: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_010: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_020: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_030: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_040: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_050: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_060: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_070: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_080: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_090: 3.1 +- 0.1 ns/op
HashCodeProfBench.stat_100: 3.1 +- 0.1 ns/op
HashCodeProfBench.virt_000: 3.9 +- 0.1 ns/op
HashCodeProfBench.virt_010: 3.7 +- 0.1 ns/op
HashCodeProfBench.virt_020: 3.5 +- 0.1 ns/op
HashCodeProfBench.virt_030: 3.1 +- 0.1 ns/op
HashCodeProfBench.virt_040: 3.1 +- 0.1 ns/op
HashCodeProfBench.virt_050: 3.1 +- 0.1 ns/op
HashCodeProfBench.virt_060: 2.4 +- 0.1 ns/op
HashCodeProfBench.virt_070: 2.3 +- 0.1 ns/op
HashCodeProfBench.virt_080: 2.1 +- 0.1 ns/op
HashCodeProfBench.virt_090: 2.0 +- 0.1 ns/op
HashCodeProfBench.virt_100: 2.8 +- 0.1 ns/op
Note we have the boost (as shown before), and the boost is also
improving as we go for more Integers (that's because Integer.hashCode()
is dramatically simpler).
The interesting code generation quirk is that patched virt_000 runs ~20%
faster than before. The disassembly implies the intrinsic does the
typecheck against the wide class pointer, while the inline cache (?)
does it against the narrow pointer, saving the decode.
Thanks,
-Aleksey.
More information about the hotspot-compiler-dev
mailing list