Request for review (L): 7071653: JSR 292: call site change notification should be pushed not pulled

Mon Aug 8 12:08:52 PDT 2011

On Aug 8, 2011, at 11:52 AM, Vladimir Kozlov wrote:

> Christian Thalinger wrote:
>> On Aug 8, 2011, at 4:55 PM, Vladimir Kozlov wrote:
>> 
>>> Christian,
>>> 
>>> Should we put "skip bytecode quickening" code under flag to do this only when invoke dynamic is enabled? Or put_code is zero only in invoke dynamic case?
>> 
>> No, it doesn't buy us anything.  The new checking code is only executed the first time as the bytecodes are quickened right after that.  And in the case where a putfield isn't quickened and we call resolve_get_put it gets very expensive anyway.
> 
> You lost me here. New code in resolve_get_put() is executed only for putfield to 
> CallSite.target. But new code in patch_bytecode() skips quickening for all 
> putfield bytecodes. My question is: can you narrow skipping quickening only for 
> putfield to CallSite.target? Or you are saying that there is no performance 
> difference between executing _aputfield vs _fast_aputfield?

It only skips quickening if put_code is zero, which is only done for CallSite.target.  All the others proceed as they used to.

tom

> 
> Vladimir
> 
>> 
>>> On 8/8/11 6:56 AM, Christian Thalinger wrote:
>>>>> Why on sparc you use ld_ptr() to load from cache but on X86 and X64 you use movl() (only 32  bit)?
>>>> Good question.  I took the code from TemplateTable::resolve_cache_and_index without thinking about it and that one uses ld_ptr.
>>>> 
>>>> _indices in CosntantPoolCacheEntry is defined as intx:
>>>> 
>>>>  volatile intx     _indices;  // constant pool index&  rewrite bytecodes
>>>> 
>>>> and bytecode 1 and 2 are in the upper 16-bit of the lower 32-bit word:
>>>> 
>>>> // bit number |31                0|
>>>> // bit length |-8--|-8--|---16----|
>>>> // --------------------------------
>>>> // _indices   [ b2 | b1 |  index  ]
>>>> 
>>>> Loading 32-bit on LE gives you the right bits but on BE it does not.  I think that's the reason for the "optimization" on x64.
>>> I don't like this "optimization" but I understand why we using it. Add a comment (especially in x64 file).
>> 
>> I factored reading the bytecode into InterpreterMacroAssembler::get_cache_and_index_and_bytecode_at_bcp since the same code is used twice in TemplateTable and added the comment there.
>> 
>>>>> I am concern about using next short branch in new code in templateTable_sparc.cpp:
>>>>> 
>>>>> cmp_and_br_short(..., L_patch_done);  // don't patch
>>>>> 
>>>>> There is __ stop() call which generates a lot of code so that label L_patch_done could be far.
>>>> Yeah, I thought I give it a try if it works.  cmp_and_br_short should assert if the branch displacement is too far, right?
>>>> 
>>> Yes, it will assert but may be only in some worst case which we do not test. For example, try to run 64 bit fastdebug VM on Sparc + compressed oops + VerifyOops.
>> 
>> That works.
>> 
>>>>> 
>>>>> Why you added new #include into ciEnv.cpp and nmethod.cpp, what code needs it? Nothing else is changed in these files.
>>>> Both files use dependencies and I got linkage errors on Linux while working on the fix (because of inline methods).  It seems that the include is not required in ciEnv.cpp because ciEnv.hpp already includes it.  I missed that.  But nmethod.cpp needs it because nmethod.hpp only declares class Dependencies.
>>>> 
>>> OK.
>>> 
>>>>> Why you did not leave "volatile" call site inlining with guard? You did not explain why virtual call is fine for it.
>>>> The spec of MutableCallSite says:
>>>> 
>>>> "For target values which will be frequently updated, consider using a volatile call site instead."
>>>> 
>>>> And VolatileCallSite says:
>>>> 
>>>> "A VolatileCallSite is a CallSite whose target acts like a volatile variable. An invokedynamic instruction linked to a VolatileCallSite sees updates to its call site target immediately, even if the update occurs in another thread. There may be a performance penalty for such tight coupling between threads.
>>>> 
>>>> Unlike MutableCallSite, there is no syncAll operation on volatile call sites, since every write to a volatile variable is implicitly synchronized with reader threads.
>>>> 
>>>> In other respects, a VolatileCallSite is interchangeable with MutableCallSite."
>>>> 
>>>> Since VolatileCallSite really should only be used when you know the target changes very often we don't do optimizations for this case.  Obviously this is just a guess how people will use VolatileCallSite but I think for now this is a safe bet.
>>>> 
>>> Thank you for explaining it.
>>> 
>>>> Additionally I had to do two small changes because the build was broken on some configurations:
>>>> 
>>>> -  klassOop new_type = _changes.is_klass_change() ? _changes.as_klass_change()->new_type() : NULL;
>>>> +  klassOop new_type = _changes.is_klass_change() ? _changes.as_klass_change()->new_type() : (klassOop) NULL;
>>>> 
>>>> and
>>>> 
>>>> -      MutexLockerEx ccl(CodeCache_lock, thread);
>>>> +      MutexLockerEx ccl(CodeCache_lock, Mutex::_no_safepoint_check_flag);
>>>> 
>>>> I updated the webrev.
>>> Good.
>> 
>> Thanks.
>> 
>> -- Christian
>> 
>>> Vladimir
>>> 
>>>> -- Christian
>>>> 
>>>>> 
>>>>> Vladimir
>>>>> 
>>>>> On 8/5/11 6:32 AM, Christian Thalinger wrote:
>>>>>> http://cr.openjdk.java.net/~twisti/7071653
>>>>>> 
>>>>>> 7071653: JSR 292: call site change notification should be pushed not pulled
>>>>>> Reviewed-by:
>>>>>> 
>>>>>> Currently every speculatively inlined method handle call site has a
>>>>>> guard that compares the current target of the CallSite object to the
>>>>>> inlined one.  This per-invocation overhead can be removed if the
>>>>>> notification is changed from pulled to pushed (i.e. deoptimization).
>>>>>> 
>>>>>> I had to change the logic in TemplateTable::patch_bytecode to skip
>>>>>> bytecode quickening for putfield instructions when the put_code
>>>>>> written to the constant pool cache is zero.  This is required so that
>>>>>> every execution of a putfield to CallSite.target calls out to
>>>>>> InterpreterRuntime::resolve_get_put to do the deoptimization of
>>>>>> depending compiled methods.
>>>>>> 
>>>>>> I also had to change the dependency machinery to understand other
>>>>>> dependencies than class hierarchy ones.  DepChange got the super-type
>>>>>> of two new dependencies, KlassDepChange and CallSiteDepChange.
>>>>>> 
>>>>>> Tested with JRuby tests and benchmarks, hand-written testcases, JDK
>>>>>> tests and vm.mlvm tests.
>>>>>> 
>>>>>> Here is the speedup for the JRuby fib benchmark (first is JDK 7 b147,
>>>>>> second with 7071653).  Since the CallSite targets don't change during
>>>>>> the runtime of this benchmark we can see the performance benefit of
>>>>>> eliminating the guard:
>>>>>> 
>>>>>> $ jruby --server bench/bench_fib_recursive.rb 5 35
>>>>>>  0.883000   0.000000   0.883000 (  0.854000)
>>>>>>  0.715000   0.000000   0.715000 (  0.715000)
>>>>>>  0.712000   0.000000   0.712000 (  0.712000)
>>>>>>  0.713000   0.000000   0.713000 (  0.713000)
>>>>>>  0.713000   0.000000   0.713000 (  0.712000)
>>>>>> 
>>>>>> $ jruby --server bench/bench_fib_recursive.rb 5 35
>>>>>>  0.772000   0.000000   0.772000 (  0.742000)
>>>>>>  0.624000   0.000000   0.624000 (  0.624000)
>>>>>>  0.621000   0.000000   0.621000 (  0.621000)
>>>>>>  0.622000   0.000000   0.622000 (  0.622000)
>>>>>>  0.622000   0.000000   0.622000 (  0.621000)
>>>>>> 
>> 
> _______________________________________________
> mlvm-dev mailing list
> mlvm-dev at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev