RFR(S): 8153892: Handle unsafe access error directly in signal handler instead of going through a stub

Thu Apr 28 15:44:46 UTC 2016

On 4/28/2016 5:41 AM, David Holmes wrote:
> Hi Mikael,
>
> On 28/04/2016 1:54 AM, Mikael Vidstedt wrote:
>>
>>
>> On 4/27/2016 12:24 AM, Thomas Stüfe wrote:
>>> Hi Mikael,
>>>
>>> On Tue, Apr 26, 2016 at 8:35 PM, Mikael Vidstedt
>>> <<mailto:mikael.vidstedt at oracle.com>mikael.vidstedt at oracle.com> wrote:
>>>
>>>
>>>
>>>     On 4/12/2016 2:15 AM, Thomas Stüfe wrote:
>>>>     Hi Mikael, David,
>>>>
>>>>     On Tue, Apr 12, 2016 at 2:29 AM, David Holmes
>>>> <<mailto:david.holmes at oracle.com>david.holmes at oracle.com> wrote:
>>>>
>>>>         On 11/04/2016 10:57 AM, David Holmes wrote:
>>>>
>>>>             Hi Mikael,
>>>>
>>>>             I think we need to be able to answer the question as to
>>>>             why the stubbed
>>>>             and stubless forms of this code exist to ensure that
>>>>             converting all
>>>>             platforms to the same form is appropriate.
>>>>
>>>>
>>>>         The more I look at this the more the stubs make no sense :)
>>>>         AIII a stub is generated when we need runtime code that may
>>>>         be different to that which we could write directly for
>>>>         compiling at build time - ie to use CPU specific features of
>>>>         the actual CPU. But I see nothing here that suggests any such
>>>>         usage.
>>>>
>>>>         So I agree with removing the stubs.
>>>>
>>>>             I'm still going through this but my initial reaction is
>>>>             to wonder why we
>>>>             don't use the same form of handle_unsafe_access on all
>>>>             platforms and
>>>>             always pass in npc? (That seems to be the only difference
>>>>             in code that
>>>>             otherwise seems platform independent.)
>>>>
>>>>
>>>>         Futher to this and Thomas's comments I think
>>>>         handle_unsafe_access(thread, pc, npc) can be defined in
>>>>         shared code (where? not sure). Further, if we always pass in
>>>>         npc then we don't need to pass in pc as it is unused (seems
>>>>         unused in original code too for sparc).
>>>>
>>>>
>>>>     I agree. We commonized ucontext_set_pc for all Posix platforms,
>>>>     so we can make a common function "handle_unsafe_access(thread,
>>>>     npc)" and inside use os::Posix::ucontext_set_pc to modify the
>>>>     context. Then we can get rid of the special handling in the
>>>>     signal handlers inside os_aix_ppc.cpp and os_linux_ppc.cpp (for
>>>>     both the compiled and the interpreted case).
>>>
>>>     There is definitely room for unification and simplification here.
>>>     Right now the signal handling code is, sadly, different on all the
>>>     different platforms, despite the fact that in many cases it should
>>>     be similar or the exact same. That said, as much as a
>>>     refactoring/rewrite of the signal handler code is needed, it will
>>>     very quickly turn into a much larger effort...
>>>
>>>     In this specific case, it would probably make more sense to pass
>>>     in the full context to the handle_unsafe_access method and have it
>>>     do whatever it feels is necessary to update it. However, a lot of
>>>     the signal handler code assumes that a "stub" variable gets set up
>>>     and only at the end of the main signal handler function does the
>>>     actual context get updated. Changing how that works only for this
>>>     specific case is obviously not a good idea, which means it's back
>>>     to the full scale refactoring and out of scope for the bug fix.
>>>
>>>     So to me the fact that the method prototypes differ depending on
>>>     the exact platform is just a reflection of how the contexts
>>>     differ. In lack of the full context the handler method needs to
>>>     take whatever parts of the context is needed to do it's job. I
>>>     could of course change the handler method to only take a single
>>>     "next_pc" argument, but to me that feels like putting a key part
>>>     of the logic which handles the unsafe access (specifically, the
>>>     part which calculates the next pc) in the wrong place - IMHO that
>>>     should really be tightly coupled with the rest of the logic needed
>>>     to handle an unsafe access (updating the thread state etc.), and
>>>     hence I feel that it really belongs in the handle_unsafe_access
>>>     method itself. Happy to hear your thoughts, but I hope we can
>>>     agree that the suggested fix, even in its current state, is still
>>>     significantly better than what is there now.
>>>
>>>
>>>     Unless somebody has a better suggestion, I'm going to be moving
>>>     the implementations of the handle_unsafe_access methods to
>>>     sharedRuntime (instead of stubRoutines) and will send out a new
>>>     webrev shortly.
>>>
>>>
>>> I am unhappy with the fact that we factor unsafe handling out for x86
>>> and sparc but do it inline for ppc. I know that was done before your
>>> change as well but would be happy if that could be improved. I would
>>> prefer either one of:
>>
>> Fully agree - this is an example of the more general problem of logic
>> which is /almost/ the same across different platforms, but which has
>> been effectively copy/pasted and drifted apart over time.
>>
>>>
>>> 1) flatten out the coding into the signal handlers like it is done in
>>> os_linux_ppc.cpp and os_aix_ppc.cpp or
>>> 2) add a StubRoutines::ppc64::handle_unsafe_access() for the ppc case
>>>
>>> I would actually prefer (1) even though this would multiply the code
>>> out for all os cases into <os_cpu>; we are only talking about 1-2
>>> lines of additional coding, and it would improve the readability of
>>> the signal handlers.
>>>
>>> But this is only my personal opinion, and I do not have strong
>>> emotions. I agree with you that a full cleanup of the signal coding is
>>> out of scope for this issue.
>>
>> I spent yesterday going back and forth on the various alternatives and
>> the only thing I can say with certainty now is that apart from
>> refactoring the whole thing everything else is ugly... For example, I
>> agree that consistency is an important goal here, but since there's
>> little to no consistency there today it's really hard to make a relevant
>> dent in it. :(
>>
>> Flattening it out is an alternative (and a good one), but that is not
>> something I'm willing to do as part of this change because only
>> flattening this specific case/return will actually add to the
>> inconstency... So ultimately yesterday I chose to do something closer to
>> your alternative 2). Is it still ugly? Yes; lipstick on pig and all of
>> that. Have a look at it and see how you feel about it. I try to keep in
>> mind that what is there today is (more) broken. :)
>>
>> Webrev:
>>
>> http://cr.openjdk.java.net/~mikael/webrevs/8153892/webrev.02/hotspot/webrev/ 
>>
>
> Now I see this in code form I really don't understand why next_pc is 
> passed in, unused and then returned ??

Given that the full context can't easily be passed in and updated (which 
I still think is the right, long term way of doing this), I chose to do 
it this way instead. It is a signal to a caller that *only* calling 
handle_unsafe_access is not enough, there's more to it in that the 
context also needs to be updated. I see it as a way to make sure the 
actual next_pc calculation and update is not forgotten, and make it 
clear that it goes hand in hand with updating the thread state. It's 
obviously not perfect, but I do feel like it ever so slightly helps 
clarify how the access fault needs to be handled.

>
> Otherwise in src/share/vm/runtime/sharedRuntime.cpp in the comment 
> block - capitals after periods please :)

Fixed, I'll not post a new webrev for it though :)

Cheers,
Mikael

>
> Stub removal seems fine.
>
> Thanks,
> David
>
>> Incremental from webrev.01:
>>
>> http://cr.openjdk.java.net/~mikael/webrevs/8153892/webrev.02.incr/hotspot/webrev/ 
>>
>>
>> Cheers,
>> Mikael
>>
>>>
>>>
>>>     Cheers,
>>>     Mikael
>>>
>>>>
>>>>
>>>>         BTW I found this comment somewhat unfathomable (both now and
>>>>         in original code):
>>>>
>>>>         +   // pc is the instruction which we must emulate
>>>>         +   // doing a no-op is fine:  return garbage from the load
>>>>
>>>>         but finally realized that it means that after the load that
>>>>         raised the signal the native code proceeds normally but the
>>>>         value apparently loaded is just garbage/arbitrary, and the
>>>>         only sign something went wrong is the setting of the pending
>>>>         unsafe-access-error bit. This would be a potential source of
>>>>         bugs I think, except that when we hit the Java level, we
>>>>         throw the exception and so never actually "return" the
>>>>         garbage value. But it does mean we would have to be careful
>>>>         if calling the unsafe routines from native code.
>>>>
>>>>
>>>>     I admit I do not understand fully how the
>>>>     _special_runtime_exit_condition flag is processed later, at least
>>>>     not for all cases: If I have a java method A using
>>>>     sun.misc.unsafe, which gets compiled, the sun.misc.unsafe
>>>>     intrinsic gets inlined into that method. So, the whole method A
>>>>     gets marked as "has unsafe access"? So, any SIGBUS happening
>>>>     inside this method - which may be larger than the inlined
>>>>     sun.misc.unsafe call - will yield an InternalError? And when is
>>>>     the flag checked if that method A is called from another java
>>>>     method B?
>>>>
>>>>     Sorry if the questions are stupid, I am not a JIT expert, but I
>>>>     try to understand how much can happen between the SIGBUS and the
>>>>     InternalError getting thrown.
>>>
>>>     No questions are stupid here. As you may have seen in the other
>>>     thread, I filed JDK-8154592[1] to cover making the handling of the
>>>     faults synchronous. Hope that helps.
>>>
>>>
>>> Thank you!
>>>
>>> Kind Regards, Thomas
>>>
>>>     Cheers,
>>>     Mikael
>>>
>>>     [1] https://bugs.openjdk.java.net/browse/JDK-8154592
>>>
>>>
>>>>
>>>>     Thanks, Thomas
>>>>
>>>>         Thanks,
>>>>         David
>>>>
>>>>
>>>>             Thanks,
>>>>             David
>>>>
>>>>             On 9/04/2016 8:33 AM, Mikael Vidstedt wrote:
>>>>
>>>>
>>>>                 Please review:
>>>>
>>>>                 Bug: https://bugs.openjdk.java.net/browse/JDK-8153892
>>>>                 Webrev:
>>>> http://cr.openjdk.java.net/~mikael/webrevs/8153892/webrev.01/hotspot/webrev/
>>>> <http://cr.openjdk.java.net/%7Emikael/webrevs/8153892/webrev.01/hotspot/webrev/>
>>>>
>>>>
>>>>
>>>>                 * Note: this is patch 2 in a set of 3 all aiming to
>>>>                 clean up and unify
>>>>                 the unsafe memory getters/setters, along with the
>>>>                 handling of unsafe
>>>>                 access errors. The other two issues are:
>>>>
>>>> https://bugs.openjdk.java.net/browse/JDK-8153890 -
>>>>                 Handle unsafe access
>>>>                 error as an asynchronous exception
>>>> https://bugs.openjdk.java.net/browse/JDK-8150921 -
>>>>                 Update Unsafe
>>>>                 getters/setters to use double-register variants
>>>>
>>>>
>>>>                 * Summary (copied from the bug description)
>>>>
>>>>
>>>>                 In certain cases, such as accessing a region of a
>>>>                 memory mapped file
>>>>                 which has been truncated on unix-style operating
>>>>                 systems, a SIGBUS
>>>>                 signal will be raised and the VM will process it in
>>>>                 the signal handler.
>>>>
>>>>                 How the signal is processed differs depending on the
>>>>                 operating system
>>>>                 and/or CPU architecture, with two major alternatives:
>>>>
>>>>                 * "stubless"
>>>>
>>>>                 Do the necessary thread state updates directly in the
>>>>                 signal handler,
>>>>                 and modify the context so that the signal handler
>>>>                 returns to the place
>>>>                 where the execution should continue
>>>>
>>>>                 * Using a stub
>>>>
>>>>                 Update the context so that when the signal handler
>>>>                 returns the thread
>>>>                 will continue execution in a generated stub, which in
>>>>                 turn will call
>>>>                 some native code in the VM to update the thread state
>>>>                 and figure out
>>>>                 where execution should continue. The stub will then
>>>>                 jump to that new
>>>>                 place.
>>>>
>>>>
>>>>                 It should be noted that the work of updating the
>>>>                 thread state is very
>>>>                 small - it's setting a flag or two in the thread
>>>>                 structure, and figures
>>>>                 out where the next instruction starts. It should also
>>>>                 be noted that the
>>>>                 generated stubs today are broken, because they do not
>>>>                 preserve all the
>>>>                 live registers over the call into the VM. There are
>>>>                 two ways to address
>>>>                 this:
>>>>
>>>>                 * Preserve all the necessary registers
>>>>
>>>>                 This would mean implementing, in macro assembly, the
>>>>                 necessary logic for
>>>>                 preserving all the live registers, including, but not
>>>>                 limited to,
>>>>                 floating point registers, flag registers, etc. It
>>>>                 quickly becomes
>>>>                 obvious that this platform specific and error prone.
>>>>
>>>>                 * Leverage the fact that the operating system already
>>>>                 does this as part
>>>>                 of the signal handling
>>>>
>>>>                 Do the necessary work in the signal handler instead,
>>>>                 removing the need
>>>>                 for the stub alltogether
>>>>
>>>>                 As mentioned, on some platforms the latter model is
>>>>                 already in use. It
>>>>                 is dramatically easier and all platforms should be
>>>>                 updated to do it the
>>>>                 same way.
>>>>
>>>>
>>>>                 * Testing
>>>>
>>>>                 Just as mentioned in the RFR for JDK-8153890, a new
>>>>                 test was developed
>>>>                 to test this code path:
>>>>
>>>> http://cr.openjdk.java.net/~mikael/webrevs/8150921/MappedTruncated.java 
>>>>
>>>> <http://cr.openjdk.java.net/%7Emikael/webrevs/8150921/MappedTruncated.java>
>>>>
>>>>                 In fact, it was when running this test I found the
>>>>                 register preservation
>>>>                 issue. JPRT also passes. Much like JDK-8153890 I
>>>>                 wanted to get some
>>>>                 feedback on this before running additional tests.
>>>>
>>>>
>>>>                 Cheers,
>>>>                 Mikael
>>>>
>>>>
>>>
>>>
>>