RFR (S): CR 8014447: Object.hashCode intrinsic breaks inline caches

Thu Sep 26 08:03:09 PDT 2013

Thank you, Aleksey, for verifying and testing these cases.

Based on this information I think your code is good.

Regards,
Vladimir

On 9/26/13 3:52 AM, Aleksey Shipilev wrote:
> Thanks Vladimir!
>
> Correct me if I'm wrong, but I think we hit neither of the corner cases
> you outlined, see below.
>
> On 09/26/2013 02:16 AM, Vladimir Kozlov wrote:
>> (receiver_count > 0 && profile.morphism() == 1 &&
>>   profile.receiver(0)->as_klass()->is_java_lang_Object())
>>
>> to use hashCode intrinsic without delay because profiling shows only
>> one. But on other hand Object::hashCode() is native method which we
>> can't inline (that is why we have intrinsic) so current (02) code should
>> work. But, please, confirm.
>
> It feels wrong to special-case hashCode intrinsic in otherwise general
> code. The benchmarks I've shown before clearly show the hashCode is
> inlined when we have the monomophic j.l.O:hc() call. You are right, that
> is because it is native. This is the chunk from the inline tree for
> monomophic call:
>
>   @ 14   org.sample.HashCodeProfBench::virt_000 (39 bytes)   inline (hot)
>      @ 26   java.lang.Object::hashCode (0 bytes)   native method
>         \-> TypeProfile (208040/208040 counts) = java/lang/Object
>      @ 26   java.lang.Object::hashCode (0 bytes)   (intrinsic)
>
>
>> The problem is bimorphic and polymorphic call sites when one of recorded
>> types is j.l.O and its percent is significant. You need to use intrinsic
>> on the corresponding generated branch where it is used. And it could be
>> tricky because call_generator() is called again recursively for each
>> branch and we should return intrinsic without delay.
>
> I think it is implicitly taken care of because we miss the type profile
> on that branch, and we naturally fallback to intrinsic code? This is the
> chunk of the inline three for 90% j.l.O + 10% j.l.I:
>
>   @ 14   org.sample.HashCodeProfBench::virt_010 (39 bytes)   inline (hot)
>     @ 26   java.lang.Object::hashCode (0 bytes)   native method
>     @ 26   java.lang.Integer::hashCode (8 bytes)   inline (hot)
>        \-> TypeProfile (16203/162030 counts) = java/lang/Integer
>        \-> TypeProfile (145827/162030 counts) = java/lang/Object
>           @ 4   java.lang.Integer::hashCode (2 bytes)   inline (hot)
>     @ 26   java.lang.Object::hashCode (0 bytes)   (intrinsic)
>
> Do you want to make this mechanics more explicit?
>
>> You need to check virtual and static cases.
>
> So I did the mixed profiles and bimorphic calls for both static and
> virtual hashcodes. These are the results (numbers are percent of Integer
> objects, all others are Objects):
>
> baseline:
>    HashCodeProfBench.stat_000:  3.3 +- 0.2 ns/op
>    HashCodeProfBench.stat_010:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_020:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_030:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_040:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_050:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_060:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_070:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_080:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_090:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_100:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.virt_000:  4.9 +- 0.1 ns/op
>    HashCodeProfBench.virt_010:  5.2 +- 0.1 ns/op
>    HashCodeProfBench.virt_020:  5.7 +- 0.1 ns/op
>    HashCodeProfBench.virt_030:  6.2 +- 0.1 ns/op
>    HashCodeProfBench.virt_040;  6.6 +- 0.1 ns/op
>    HashCodeProfBench.virt_050:  7.1 +- 0.1 ns/op
>    HashCodeProfBench.virt_060:  7.5 +- 0.1 ns/op
>    HashCodeProfBench.virt_070:  8.0 +- 0.1 ns/op
>    HashCodeProfBench.virt_080:  8.4 +- 0.1 ns/op
>    HashCodeProfBench.virt_090:  8.9 +- 0.1 ns/op
>    HashCodeProfBench.virt_100:  9.3 +- 0.1 ns/op
>
> The static case is not affected at all even in the current code (that's
> OK, because the type profile is not gathered for the first argument).
> The virtual case is gradually degrading as we have more and more
> Integers in the type profile, and we go through the slowpath in intrinsic.
>
> patched:
>    HashCodeProfBench.stat_000:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_010:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_020:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_030:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_040:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_050:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_060:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_070:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_080:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_090:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.stat_100:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.virt_000:  3.9 +- 0.1 ns/op
>    HashCodeProfBench.virt_010:  3.7 +- 0.1 ns/op
>    HashCodeProfBench.virt_020:  3.5 +- 0.1 ns/op
>    HashCodeProfBench.virt_030:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.virt_040:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.virt_050:  3.1 +- 0.1 ns/op
>    HashCodeProfBench.virt_060:  2.4 +- 0.1 ns/op
>    HashCodeProfBench.virt_070:  2.3 +- 0.1 ns/op
>    HashCodeProfBench.virt_080:  2.1 +- 0.1 ns/op
>    HashCodeProfBench.virt_090:  2.0 +- 0.1 ns/op
>    HashCodeProfBench.virt_100:  2.8 +- 0.1 ns/op
>
> Note we have the boost (as shown before), and the boost is also
> improving as we go for more Integers (that's because Integer.hashCode()
> is dramatically simpler).
>
> The interesting code generation quirk is that patched virt_000 runs ~20%
> faster than before. The disassembly implies the intrinsic does the
> typecheck against the wide class pointer, while the inline cache (?)
> does it against the narrow pointer, saving the decode.
>
> Thanks,
> -Aleksey.
>