RFR (S): CR 8014447: Object.hashCode intrinsic breaks inline caches
Vladimir Kozlov
vladimir.kozlov at oracle.com
Thu Sep 26 08:03:09 PDT 2013
Thank you, Aleksey, for verifying and testing these cases.
Based on this information I think your code is good.
Regards,
Vladimir
On 9/26/13 3:52 AM, Aleksey Shipilev wrote:
> Thanks Vladimir!
>
> Correct me if I'm wrong, but I think we hit neither of the corner cases
> you outlined, see below.
>
> On 09/26/2013 02:16 AM, Vladimir Kozlov wrote:
>> (receiver_count > 0 && profile.morphism() == 1 &&
>> profile.receiver(0)->as_klass()->is_java_lang_Object())
>>
>> to use hashCode intrinsic without delay because profiling shows only
>> one. But on other hand Object::hashCode() is native method which we
>> can't inline (that is why we have intrinsic) so current (02) code should
>> work. But, please, confirm.
>
> It feels wrong to special-case hashCode intrinsic in otherwise general
> code. The benchmarks I've shown before clearly show the hashCode is
> inlined when we have the monomophic j.l.O:hc() call. You are right, that
> is because it is native. This is the chunk from the inline tree for
> monomophic call:
>
> @ 14 org.sample.HashCodeProfBench::virt_000 (39 bytes) inline (hot)
> @ 26 java.lang.Object::hashCode (0 bytes) native method
> \-> TypeProfile (208040/208040 counts) = java/lang/Object
> @ 26 java.lang.Object::hashCode (0 bytes) (intrinsic)
>
>
>> The problem is bimorphic and polymorphic call sites when one of recorded
>> types is j.l.O and its percent is significant. You need to use intrinsic
>> on the corresponding generated branch where it is used. And it could be
>> tricky because call_generator() is called again recursively for each
>> branch and we should return intrinsic without delay.
>
> I think it is implicitly taken care of because we miss the type profile
> on that branch, and we naturally fallback to intrinsic code? This is the
> chunk of the inline three for 90% j.l.O + 10% j.l.I:
>
> @ 14 org.sample.HashCodeProfBench::virt_010 (39 bytes) inline (hot)
> @ 26 java.lang.Object::hashCode (0 bytes) native method
> @ 26 java.lang.Integer::hashCode (8 bytes) inline (hot)
> \-> TypeProfile (16203/162030 counts) = java/lang/Integer
> \-> TypeProfile (145827/162030 counts) = java/lang/Object
> @ 4 java.lang.Integer::hashCode (2 bytes) inline (hot)
> @ 26 java.lang.Object::hashCode (0 bytes) (intrinsic)
>
> Do you want to make this mechanics more explicit?
>
>> You need to check virtual and static cases.
>
> So I did the mixed profiles and bimorphic calls for both static and
> virtual hashcodes. These are the results (numbers are percent of Integer
> objects, all others are Objects):
>
> baseline:
> HashCodeProfBench.stat_000: 3.3 +- 0.2 ns/op
> HashCodeProfBench.stat_010: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_020: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_030: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_040: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_050: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_060: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_070: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_080: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_090: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_100: 3.1 +- 0.1 ns/op
> HashCodeProfBench.virt_000: 4.9 +- 0.1 ns/op
> HashCodeProfBench.virt_010: 5.2 +- 0.1 ns/op
> HashCodeProfBench.virt_020: 5.7 +- 0.1 ns/op
> HashCodeProfBench.virt_030: 6.2 +- 0.1 ns/op
> HashCodeProfBench.virt_040; 6.6 +- 0.1 ns/op
> HashCodeProfBench.virt_050: 7.1 +- 0.1 ns/op
> HashCodeProfBench.virt_060: 7.5 +- 0.1 ns/op
> HashCodeProfBench.virt_070: 8.0 +- 0.1 ns/op
> HashCodeProfBench.virt_080: 8.4 +- 0.1 ns/op
> HashCodeProfBench.virt_090: 8.9 +- 0.1 ns/op
> HashCodeProfBench.virt_100: 9.3 +- 0.1 ns/op
>
> The static case is not affected at all even in the current code (that's
> OK, because the type profile is not gathered for the first argument).
> The virtual case is gradually degrading as we have more and more
> Integers in the type profile, and we go through the slowpath in intrinsic.
>
> patched:
> HashCodeProfBench.stat_000: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_010: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_020: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_030: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_040: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_050: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_060: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_070: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_080: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_090: 3.1 +- 0.1 ns/op
> HashCodeProfBench.stat_100: 3.1 +- 0.1 ns/op
> HashCodeProfBench.virt_000: 3.9 +- 0.1 ns/op
> HashCodeProfBench.virt_010: 3.7 +- 0.1 ns/op
> HashCodeProfBench.virt_020: 3.5 +- 0.1 ns/op
> HashCodeProfBench.virt_030: 3.1 +- 0.1 ns/op
> HashCodeProfBench.virt_040: 3.1 +- 0.1 ns/op
> HashCodeProfBench.virt_050: 3.1 +- 0.1 ns/op
> HashCodeProfBench.virt_060: 2.4 +- 0.1 ns/op
> HashCodeProfBench.virt_070: 2.3 +- 0.1 ns/op
> HashCodeProfBench.virt_080: 2.1 +- 0.1 ns/op
> HashCodeProfBench.virt_090: 2.0 +- 0.1 ns/op
> HashCodeProfBench.virt_100: 2.8 +- 0.1 ns/op
>
> Note we have the boost (as shown before), and the boost is also
> improving as we go for more Integers (that's because Integer.hashCode()
> is dramatically simpler).
>
> The interesting code generation quirk is that patched virt_000 runs ~20%
> faster than before. The disassembly implies the intrinsic does the
> typecheck against the wide class pointer, while the inline cache (?)
> does it against the narrow pointer, saving the decode.
>
> Thanks,
> -Aleksey.
>
More information about the hotspot-compiler-dev
mailing list