RFR(s): 8065895: Synchronous signals during error reporting may terminate or hang VM process
David Holmes
david.holmes at oracle.com
Thu Nov 27 09:01:05 UTC 2014
On 27/11/2014 5:36 PM, Thomas Stüfe wrote:
> Unfortunately, I cannot test it, as I have no ARM environment. The best
> I can come up with without testing is this:
> http://stackoverflow.com/questions/16081618/programmatically-cause-undefined-instruction-exception
The issue is how to handle this? Put ifdefs for ARM in the open code?
Revert to your per-platform solution? Some other variation? Or do we
just not care if we can't trigger SIGILL on ARM? Though I'd like to hear
from the AARCH64 folk too.
David
> Kind regards, Thomas
>
> On Thu, Nov 27, 2014 at 6:18 AM, David Holmes <david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>> wrote:
>
> On 27/11/2014 10:49 AM, David Holmes wrote:
>
> On 26/11/2014 11:33 PM, Thomas Stüfe wrote:
>
> Hi David,
>
> here you go:
> http://cr.openjdk.java.net/~__stuefe/webrevs/8065895/webrev.__02/
> <http://cr.openjdk.java.net/~stuefe/webrevs/8065895/webrev.02/>
>
> Reverted SIGILL-generating function back to its original
> form, plus the
> folding of the 000 case.
>
>
> Thanks Thomas! While we are awaiting a second reviewer I will
> test this
> out internally. It may take a day or two sorry.
>
>
> Unfortunately, on ARM (emulator), the SIGILL test generated a SEGV
> instead:
>
> will jump to PC 0xb6fb1000, which should cause a SIGILL.
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0xb6fb2000, pid=13095, tid=3060999280
>
> If I read the ARM architecture manual correctly all zeroes will map
> to a conditional AND instruction (Ref A8.6.12 AND(register))
>
> David
>
>
> David
>
> I only can guess what your closed platforms are, but if it
> is ARM, I
> believe opcodes 0-31 are undefined. For ia64, 0 is undefined
> as well.
>
> Kind regards, Thomas
>
>
> On Wed, Nov 26, 2014 at 1:02 PM, David Holmes
> <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
> <mailto:david.holmes at oracle.__com
> <mailto:david.holmes at oracle.com>>> wrote:
>
> On 26/11/2014 9:37 PM, Thomas Stüfe wrote:
>
> Hi David,
> ...
>
> - In debug.cpp for the SIGILL can you
> define the
> all zero
> case as a
> default so we only need to add
> platform specific
> definitions when
> all zeroes doesn't work. I really
> hate seeing all
> that CPU
> selection
> in shared code. :(
>
>
> Agreed and fixed, moved the CPU-specific
> sections into
> CPU-specific files.
>
>
> I'd really like to see a way to share the
> all-zeroes case
> so that we
> don't need to add platform specific code
> unnecessarily.
>
>
> sooo.. back to the original code then, just with
> the #ifdef,
> just with
> the zero-cases all folded in into the #else path?
> Or do you
> prefer
> something else?
>
>
> Elsewhere there is a pattern of defining per-platform
> values that
> can override the shared definition. eg:
>
> #ifndef HAS_SPECIAL_PLATFORM_VALUE_____FOR_XXXX
> Foo XXX = ...; //shared/default initalization
> #endif
>
> but this assumes a platform specific header has already
> been
> included that can do:
>
> #define HAS_SPECIAL_PLATFORM_VALUE_____FOR_XXXX
> Foo XXX = ... ; // platform specific initialization
>
> But that is not the case for debug.hpp.
>
> So I guess folding the zero-case into the else path is
> the best we
> can do. However I'm assuming the zero case will work
> for our
> internal platforms ... if it doesn't then we'd have to
> pollute the
> shared code with info for the closed platforms. :(
>
> David
> -----
>
>
> - Style nit: please use i++ rather
> than i ++
>
>
> Fixed.
>
> Aside: we should eradicate the use of
> sigprocmask and
> replace with
> the thread specific version.
>
>
> Agree. Though I never saw any errors
> stemming from the
> use of
> sigprocmask(). According to POSIX,
> sigprocmask() is
> undefined in
> multithreaded environment, and I guess
> most OSes just
> default to
> pthread_sigmask.
>
>
> Yes "probably" works okay but I hate to see us
> using
> something with
> undefined semantics. That's future clean up
> though.
>
>
> We (SAP JVM) already use pthread_sigmask() /
> thr_sigsetmask()
> instead of
> sigprocmask. Works fine. We can port this to the
> OpenJDK.
>
> Getting back to the "thinking more
> about this" ...
> If a
> synchronous
> signal is blocked at the time it is
> generated
> then it
> should remain
> pending on the thread (POSIX spec)
> but that
> doesn't tell us
> what the
> thread will then do - retry the faulting
> instruction? Become
> unschedulable? So I can easily
> imagine that a hang
> or process
> termination may result.
>
>
> This is exactly what happens, but it is
> actually
> covered by
> POSIX, see
> doc on pthread_sigmask: "If any of the
> SIGFPE, SIGILL,
> SIGSEGV, or
> SIGBUS signals are generated while they
> are blocked,
> the result is
> undefined, unless the signal was generated
> by the
> /kill/()
>
>
> <http://pubs.opengroup.org/______onlinepubs/009695399/______functions/kill.html
> <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/kill.html>
>
>
> <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/kill.html
> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html>>
>
>
> <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/kill.html
> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html>
>
> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/kill.html
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/kill.html>>>>
> function, the /sigqueue/()
>
>
> <http://pubs.opengroup.org/______onlinepubs/009695399/______functions/sigqueue.html
> <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/sigqueue.html>
>
>
> <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/sigqueue.html
> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/sigqueue.html>>
>
>
>
> <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/sigqueue.html
> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/sigqueue.html>
>
>
> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/sigqueue.html
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/sigqueue.html>>>>
>
> function, or the /raise/()
>
>
> <http://pubs.opengroup.org/______onlinepubs/009695399/______functions/raise.html
> <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/raise.html>
>
>
> <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/raise.html
> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html>>
>
>
>
> <http://pubs.opengroup.org/____onlinepubs/009695399/____functions/raise.html
> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html>
>
> <http://pubs.opengroup.org/__onlinepubs/009695399/__functions/raise.html
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/raise.html>>>>
> function."
>
>
> Thanks - I managed to miss that part even
> though I found
> the other
> part about the signal handling function
> returning. :(
>
>
> It is well hidden, I found it by accident :) To me
> it looks like
> they
> kept it intentionally vague, to not block platforms
> where those
> signals
> could be somehow dealt with automatically? Hard to
> see though
> how this
> would work.
>
>
>
> In reality, process usually aborts
> abnormally with the
> default
> action
> for the signal, e.g. printing out "Illegal
> Instruction". On
> MacOS, we
> hang (until the Watcherthread finally
> kills the VM).
> On old
> AIXes, we
> die without a trace.
>
> This also can be easily tried out by
> removing SIGILL
> from the
> list of
> signals in vmError_<os>.cpp and executing:
>
> java -XX:ErrorHandlerTest=14
> -XX:TestCrashInErrorHandler=15
>
> which will crash first with a SIGSEGV,
> then in error
> handling with a
> secondary SIGILL. This will interrupt
> error reporting
> and kill
> or hang
> the process.
>
>
> In that sense unblocking those
> signals whilst
> handling the
> initial
> signal may well allow the error
> reporting process
> to continue
> further. But I'm unclear exactly how
> this plays
> out:
>
> - synchronous signal encountered
> - crash_handler invoked
>
> - VMError::report_and_die executes
> - secondary signal encountered
>
> - crash_handler invoked again
>
>
> almost: not again, different signal
> handler now. First
> signal was
> handled by "JVM_handle_<os>_signal()"
>
>
> Ah missed that - thanks - not that it makes much
> difference :)
>
>
> I just like nitpicking :)
>
> - VMError::report_and_die executes
> again and
> sees the
> recursion and
> returns (ignoring abort due to
> excessive recursive
> errors)
>
>
> No..
>
> Is that right? So we actually return
> from the
> crash_handler?
>
>
> Oh, but we dont return.
> VMError::report_and_die()
> will just
> create a new
> frame and re-execute VMError::report() of
> the first
> VMError object.
> Which then will continue with the next
> STEP. We never
> return,
> for each
> secondary error signal a new frame is created.
>
>
> I had trouble tracing through exactly what
> might happen
> on the
> recursive call to report_and_die. I see know
> that report
> comes from:
>
> staticBufferStream sbs(buffer, O_BUFLEN,
> &log);
> first_error->report(&sbs);
> first_error->_current_step = 0;
> // reset
> current_step
> first_error->_current_step_______info =
> ""; // reset
> current_step
>
> string
>
> so the second time through we will call report and
> _current_step
> should indicate where to start executing from.
>
>
> Exactly. There is also a catch, in that the stack
> usage goes
> up. Not
> endlessly, it is limited by the number of error
> reporting steps.
> The more stack VmError::report() does cost, the
> less well this
> works,
> especially in stack overflow scenarios.
>
> Which is why we extended SafeFetch and enabled it
> for the use
> in the
> error handler, which will be one of the the next
> patches I'd
> like to
> port to the OpenJDK, once this one is thru.
>
>
> This all happens in VMError::report_and_die:
> -> first error ? anchor VMError object in
> a static
> variable and
> execute
> VMError::report()
> -> secondary error?
> -> different thread? just sleep forever
> -> same thread? new frame, re-enter
> VMError::report(). Once
> done, abort.
>
> I always found that rather neat, but in
> fact that is
> not our
> invention
> but Sun's :) Anyway, my fix does not
> change this
> behaviour for
> better or
> worse, it only makes it usable for more cases.
>
> Because this puts us in undefined
> territory
> according to POSIX:
>
> "The behavior of a process is
> undefined after it
> returns
> normally
> from a signal-catching function for a
> SIGBUS,
> SIGFPE,
> SIGILL, or
> SIGSEGV signal that was not generated
> by kill(),
> sigqueue(), or
> raise()."
>
> true, but we dont return...
>
> On top of that you also have the
> issue that error
> reporting
> does a
> whole bunch of things that are not
> async-signal-safe so we can
> easily encounter hangs or aborts.
>
> But we're dying anyway so I guess
> none of this
> really
> matters. If
> re-enabling these signals allows
> error reporting to
> progress further
> in some cases then that is a win.
>
>
> Actually, this covers a lot of cases,
> mostly because
> SIGSEGV during
> error reporting is common, so if the
> original error
> was not
> SIGSEGV, but
> e.g. SIGILL, this would always result in
> broken hs-err
> files.
>
> The back story is that at SAP, we rely
> heavily on the
> hs-err
> files. They
> are our main tool for support, because
> working with
> cores is
> often not
> feasible. So, we put a lot of work in
> making error
> reporting
> reliable
> across all platforms. This is also covered
> by many
> tests which
> crash the
> VM in exciting ways and check the hs-err
> files for
> completeness.
>
>
> OK. Modulo the cpu specific SIGILL part
> everything else
> seems fine.
>
> Great. just tell me how you want that part.
>
> Kind regards, Thomas
>
> Thanks,
> David
>
>
>
More information about the hotspot-runtime-dev
mailing list