Request for review (L): 7071653: JSR 292: call site change notification should be pushed not pulled

Mon Aug 8 11:12:06 PDT 2011

On Aug 8, 2011, at 4:55 PM, Vladimir Kozlov wrote:

> Christian,
> 
> Should we put "skip bytecode quickening" code under flag to do this only when invoke dynamic is enabled? Or put_code is zero only in invoke dynamic case?

No, it doesn't buy us anything.  The new checking code is only executed the first time as the bytecodes are quickened right after that.  And in the case where a putfield isn't quickened and we call resolve_get_put it gets very expensive anyway.

> 
> On 8/8/11 6:56 AM, Christian Thalinger wrote:
>>> Why on sparc you use ld_ptr() to load from cache but on X86 and X64 you use movl() (only 32  bit)?
>> 
>> Good question.  I took the code from TemplateTable::resolve_cache_and_index without thinking about it and that one uses ld_ptr.
>> 
>> _indices in CosntantPoolCacheEntry is defined as intx:
>> 
>>   volatile intx     _indices;  // constant pool index&  rewrite bytecodes
>> 
>> and bytecode 1 and 2 are in the upper 16-bit of the lower 32-bit word:
>> 
>> // bit number |31                0|
>> // bit length |-8--|-8--|---16----|
>> // --------------------------------
>> // _indices   [ b2 | b1 |  index  ]
>> 
>> Loading 32-bit on LE gives you the right bits but on BE it does not.  I think that's the reason for the "optimization" on x64.
> 
> I don't like this "optimization" but I understand why we using it. Add a comment (especially in x64 file).

I factored reading the bytecode into InterpreterMacroAssembler::get_cache_and_index_and_bytecode_at_bcp since the same code is used twice in TemplateTable and added the comment there.

> 
>>> 
>>> I am concern about using next short branch in new code in templateTable_sparc.cpp:
>>> 
>>> cmp_and_br_short(..., L_patch_done);  // don't patch
>>> 
>>> There is __ stop() call which generates a lot of code so that label L_patch_done could be far.
>> 
>> Yeah, I thought I give it a try if it works.  cmp_and_br_short should assert if the branch displacement is too far, right?
>> 
> 
> Yes, it will assert but may be only in some worst case which we do not test. For example, try to run 64 bit fastdebug VM on Sparc + compressed oops + VerifyOops.

That works.

> 
>>> 
>>> 
>>> Why you added new #include into ciEnv.cpp and nmethod.cpp, what code needs it? Nothing else is changed in these files.
>> 
>> Both files use dependencies and I got linkage errors on Linux while working on the fix (because of inline methods).  It seems that the include is not required in ciEnv.cpp because ciEnv.hpp already includes it.  I missed that.  But nmethod.cpp needs it because nmethod.hpp only declares class Dependencies.
>> 
> 
> OK.
> 
>> 
>>> 
>>> Why you did not leave "volatile" call site inlining with guard? You did not explain why virtual call is fine for it.
>> 
>> The spec of MutableCallSite says:
>> 
>> "For target values which will be frequently updated, consider using a volatile call site instead."
>> 
>> And VolatileCallSite says:
>> 
>> "A VolatileCallSite is a CallSite whose target acts like a volatile variable. An invokedynamic instruction linked to a VolatileCallSite sees updates to its call site target immediately, even if the update occurs in another thread. There may be a performance penalty for such tight coupling between threads.
>> 
>> Unlike MutableCallSite, there is no syncAll operation on volatile call sites, since every write to a volatile variable is implicitly synchronized with reader threads.
>> 
>> In other respects, a VolatileCallSite is interchangeable with MutableCallSite."
>> 
>> Since VolatileCallSite really should only be used when you know the target changes very often we don't do optimizations for this case.  Obviously this is just a guess how people will use VolatileCallSite but I think for now this is a safe bet.
>> 
> 
> Thank you for explaining it.
> 
>> Additionally I had to do two small changes because the build was broken on some configurations:
>> 
>> -  klassOop new_type = _changes.is_klass_change() ? _changes.as_klass_change()->new_type() : NULL;
>> +  klassOop new_type = _changes.is_klass_change() ? _changes.as_klass_change()->new_type() : (klassOop) NULL;
>> 
>> and
>> 
>> -      MutexLockerEx ccl(CodeCache_lock, thread);
>> +      MutexLockerEx ccl(CodeCache_lock, Mutex::_no_safepoint_check_flag);
>> 
>> I updated the webrev.
> 
> Good.

Thanks.

-- Christian

> 
> Vladimir
> 
>> 
>> -- Christian
>> 
>>> 
>>> 
>>> Vladimir
>>> 
>>> On 8/5/11 6:32 AM, Christian Thalinger wrote:
>>>> http://cr.openjdk.java.net/~twisti/7071653
>>>> 
>>>> 7071653: JSR 292: call site change notification should be pushed not pulled
>>>> Reviewed-by:
>>>> 
>>>> Currently every speculatively inlined method handle call site has a
>>>> guard that compares the current target of the CallSite object to the
>>>> inlined one.  This per-invocation overhead can be removed if the
>>>> notification is changed from pulled to pushed (i.e. deoptimization).
>>>> 
>>>> I had to change the logic in TemplateTable::patch_bytecode to skip
>>>> bytecode quickening for putfield instructions when the put_code
>>>> written to the constant pool cache is zero.  This is required so that
>>>> every execution of a putfield to CallSite.target calls out to
>>>> InterpreterRuntime::resolve_get_put to do the deoptimization of
>>>> depending compiled methods.
>>>> 
>>>> I also had to change the dependency machinery to understand other
>>>> dependencies than class hierarchy ones.  DepChange got the super-type
>>>> of two new dependencies, KlassDepChange and CallSiteDepChange.
>>>> 
>>>> Tested with JRuby tests and benchmarks, hand-written testcases, JDK
>>>> tests and vm.mlvm tests.
>>>> 
>>>> Here is the speedup for the JRuby fib benchmark (first is JDK 7 b147,
>>>> second with 7071653).  Since the CallSite targets don't change during
>>>> the runtime of this benchmark we can see the performance benefit of
>>>> eliminating the guard:
>>>> 
>>>> $ jruby --server bench/bench_fib_recursive.rb 5 35
>>>>   0.883000   0.000000   0.883000 (  0.854000)
>>>>   0.715000   0.000000   0.715000 (  0.715000)
>>>>   0.712000   0.000000   0.712000 (  0.712000)
>>>>   0.713000   0.000000   0.713000 (  0.713000)
>>>>   0.713000   0.000000   0.713000 (  0.712000)
>>>> 
>>>> $ jruby --server bench/bench_fib_recursive.rb 5 35
>>>>   0.772000   0.000000   0.772000 (  0.742000)
>>>>   0.624000   0.000000   0.624000 (  0.624000)
>>>>   0.621000   0.000000   0.621000 (  0.621000)
>>>>   0.622000   0.000000   0.622000 (  0.622000)
>>>>   0.622000   0.000000   0.622000 (  0.621000)
>>>> 
>>