RFR(XXS): 8227117: normal interpreter table is not restored after single stepping with TLH

Fri Jul 5 17:38:37 UTC 2019

On 7/4/19 5:38 AM, David Holmes wrote:
> Hi Erik,
>
> On 4/07/2019 6:08 pm, Erik Österlund wrote:
>> Hi David,
>>
>> When you run without TLH, this copying mechanism is used to 
>> synchronize the safepoint while JavaThreads are running. The 
>> interpreter doesn't emit any polls then. Instead it clobbers the 
>> dispatch table. JavaThreads will be reading from the dispatch table 
>> while it is being (non-atomically) modified. That could crash. For 
>> example with the Solaris + studio + SPARC - TLH configuration, the 
>> compiler will almost certainly emit a memcpy (this transformation has 
>> been observed in practice), the memcpy will use BIS instructions 
>> (observed in practice) for performance, with out-of-thin-air values 
>> (observed in practice), and the JavaThreads will occasionally crash 
>> during safepoint synchronization due to said out-of-thin-air values.
>>
>> So I guess the problem might be larger back when TLH was not default. 
>> But this seems conceptually wrong.
>
> I always thought there were two dispatch tables and we simply switched 
> between them - not copied anything!

Based on my search back into the TeamWare repos, it looks like we have
always copied the table...

Dan

>
> David
>
>> /Erik
>>
>> On 2019-07-04 09:17, David Holmes wrote:
>>> Hi Erik,
>>>
>>> On 4/07/2019 5:10 pm, Erik Österlund wrote:
>>>> Hi Dan,
>>>>
>>>> Thanks for picking this up. The change looks good.
>>>>
>>>> However, when reviewing this, I looked at the code for actually 
>>>> restoring the table (ignore/notice safepoints). It copies the 
>>>> dispatch table for the interpreter. There is a comment stating it 
>>>> is important the copying is atomic for MT-safety, and I can 
>>>> definitely see why. However, the copying the line after that 
>>>> comment is in fact not atomic.
>>>
>>> Is it assuming "atomicity" by virtue of executing at a safepoint?
>>>
>>> David
>>> -----
>>>
>>>> Here is the copying code in templateInterpreter.cpp:
>>>>
>>>> static inline void copy_table(address* from, address* to, int size) {
>>>>    // Copy non-overlapping tables. The copy has to occur word wise 
>>>> for MT safety.
>>>>    while (size-- > 0) *to++ = *from++;
>>>> }
>>>>
>>>> Copying using a loop of non-volatile loads and stores can and 
>>>> definitely will on some compilers turn into memcpy calls instead as 
>>>> the compiler (correctly) considers that an equivalent 
>>>> transformation. And memcpy does not guarantee atomicity. Indeed on 
>>>> some platforms it is not atomic. On some platforms it will even 
>>>> enjoy out-of-thin-air values. Perhaps Copy::disjoint_words_atomic() 
>>>> would be a better choice for atomic word copying. If not, at the 
>>>> very least we should use Atomic::load/store here.
>>>>
>>>> Having said that, the fix for that issue seems like a separate RFE, 
>>>> because it has been sitting there for a lot longer than TLH has 
>>>> been around.
>>>>
>>>> Thanks,
>>>> /Erik
>>>>
>>>> On 2019-07-04 04:04, Daniel D. Daugherty wrote:
>>>>> Greetings,
>>>>>
>>>>> Robbin recently discovered this issue with Thread Local 
>>>>> Handshakes. Since
>>>>> he's not available at the moment, I'm handling the issue:
>>>>>
>>>>>      JDK-8227117 normal interpreter table is not restored after 
>>>>> single stepping with TLH
>>>>>      https://bugs.openjdk.java.net/browse/JDK-8227117
>>>>>
>>>>> When using Thread Local Handshakes, the normal interpreter table is
>>>>> not restored after single stepping. This issue is caused by the
>>>>> VM_ChangeSingleStep VM-op relying on SafepointSynchronize::end() to
>>>>> restore the normal interpreter table for the "off" case.
>>>>>
>>>>> Prior to Thread Local Handshakes, this was a valid assumption to 
>>>>> make.
>>>>> SafepointSynchronize::end() has been refactored into
>>>>> disarm_safepoint() and it only calls Interpreter::ignore_safepoints()
>>>>> on the global safepoint branch. That matches up with the call to
>>>>> Interpreter::notice_safepoints() that is also on the global safepoint
>>>>> branch.
>>>>>
>>>>> The solution is for the VM_ChangeSingleStep VM-op for the "off" case
>>>>> to call Interpreter::ignore_safepoints() directly.
>>>>>
>>>>> Here's the webrev URL:
>>>>>
>>>>> http://cr.openjdk.java.net/~dcubed/8227117-webrev/0_for_jdk14/
>>>>>
>>>>> The fix is just a small addition to VM_ChangeSingleStep::doit():
>>>>>
>>>>>     if (_on) {
>>>>>       Interpreter::notice_safepoints();
>>>>> +  } else {
>>>>> +    Interpreter::ignore_safepoints();
>>>>>     }
>>>>>
>>>>> Everything else is just new logging support for future debugging of
>>>>> interpreter table management and single stepping.
>>>>>
>>>>> Tested this fix with Mach5 Tier[1-3] on the standard Oracle 
>>>>> platforms.
>>>>> Mach5 Tier[4-6] on standard Oracle platforms is running now.
>>>>>
>>>>> Thanks, in advance, for questions, comments or suggestions.
>>>>>
>>>>> Dan
>>>>>
>>