RFR(s-ish): 8191101: Show register content in hs-err file on assert

Tue Feb 13 10:55:21 UTC 2018

Hi David,

I would like to take up work on this item again. I am sorry for the long
delay, I had a number of things to do which were more pressing. So, please
let me borrow your brain again :)

In your last mail you voiced concerns, I understand them. This is one of
those patches which are technically not that difficult but there are
aesthetic choices to be made.

Short summary: this patch causes a segfault when an assert is happening
just for the sake of grabbing the ucontext and storing it away somewhere.
For that to work, we need to physically copy the ucontext_t structure which
is handed down to the signal handler - because that structure lives on the
signal handling thread stack, which is about to be unrolled again when we
return from the signal handler to continue normal assert processing.

According to Posix, the form of ucontext_t is undefined - it looks like a
simple structure and often is, but sometimes it is not self-contained. On
almost all OSes one can just copy this structure (memcpy(...,
sizeof(ucontext_t)) and that will work. But on some few platforms, extra
steps are needed: On Linux ppc64, one needs fix up some internal pointer
members after the structure is copied for the usual "ucontext_get_pc()" etc
to work. And on MacOS, this structure is not self contained but has
pointers to sub structures containing the machine register contents, and
these sub structures need to be copied too.

The question is where to do these platform specifics:

a) one could write one global function, "copy_ucontext", and in that
function handle Linux PPC64 and MacOS with #ifdefs. That one greatly
reduces the number of changed lines at the expense of two #ifdefs in shared
code
b) one could separate  "copy_ucontext" one a <os-cpu> level. Similar to the
way we handle "ucontext_get_xx" functions today, we would have one
"copy_ucontext()"
function per os-cpu combination. Most of them would just do a
memcpy(), and Linux
PPC64 and MacOS would do more. So, no #ifdefs in shared coding but a much
larger change.
c) Like (b) but smaller: factor out common coding from all these
similar "copy_ucontext()"
functions in one function in a shared file. This is what I did in my
original patch proposal.

I originally choose (c) but am not very happy with it. I can live with any
of the three variants. Slight preference for (a) actually, since I think
two #ifdefs are still the lesser of evils and the patch would get much
smaller.

Then the question is where to place the shared coding. You objected against
my choice of os_posix.cpp/hpp. I did this because ucontext_t is Posix, even
if the concrete form of a context structure is platform dependent. But I am
open to other suggestions - I honestly do not care. Should I put that
coding in os.hpp/cpp?

----

About the safe_memcpy() stuff: that was another choice I did originally, to
harden the coding which copies the context structure. The problem is that
physically copying the context structure is undefined by Posix. So the size
of a context structure could in theory change between OS runtime versions
and differ from what we had as sizeof(ucontext_t) on our build machine. If
the memcpy fails, we would get another sigsegv, which would replace the
assert() we are trying to display. So, the user still would get an hs-err
file, but instead of the assert he would see a sigsegv, which would be
confusing.

To counter that, I wrote a function called "safe_memcpy()" which copies a
chunk of memory safely using SafeFetch. If that fails, we do not get a
segfault but may continue with the original assert, albeit without context
structure. I admit I may be overthinking this. The only other argument for
such a "safe_memcpy()" function I could give is that it is nice to have: I
use such a function in our port in a number of places in error handling,
where we want to print out information from potentially unsafe territory
but behave gracefully if that did not work.

So if you think the introduction of safe_memcpy() is over engineered I'd
relent and remove it.

----

I'll address your remaining points inline:

On Wed, Nov 29, 2017 at 5:46 AM, David Holmes <david.holmes at oracle.com>
wrote:

> Hi Thomas,
>

<snip>

> Your bug synopsis and preamble only mentions assert so I was going to
> comment that:
>
> !   diagnostic_pd(bool, ShowRegistersOnAssert,
>
> should be a develop flag not diagnostic but then I saw you do this for
> guarantees and all the other abort points too! In which case use of
> "assert" in the flag name and CAN_SHOW_REGISTERS_ON_ASSERT, is misleading -
> how about s/assert/abort/ ?
>
>
I can do this. Do you even think we should this for guarantee do? I usually
do not see many failing guarantees, so I do not have many emotions here.
However, you are right: if we do it for guarantee, we would have to rename
the flag and also add a test for the guarantee() case as mentioned below.

> ---
>
> I can't really comment on the safe* functionality - I'm not sure what that
> is all about. ??
>
> ---
>
> src/hotspot/os/posix/os_posix.cpp
>
> Style nit:  if (copy) -> if (copy != NULL)
>
> +   ::free(uc); // [sic] see os::Posix::copy_context(): the ucontext
> should have been copied with raw ::malloc,
>
> I don't understand the comment - should it be referring to
> os::Posix::default_copy_ucontext? Shouldn't the issue with os::malloc use
> be documented in os::Posix::default_copy_ucontext? And the comment here
> just needs to say:
>
> ::free(uc);  // Use raw free to match the raw malloc
>

Your suggestion is better, I'll update the comment.

>
> ---
>
> src/hotspot/os/posix/os_posix.hpp
>
> +   // Copy an ucontext.
>
> an -> a
>
> +   // (which is most of them).Do
>
> Space before Do
>
>
Will fix.

---
>
> src/hotspot/os_cpu/linux_ppc/os_linux_ppc.cpp
>
> Style nit: if (uc) -> if (uc != NULL)
>
>
Will fix.

---
> src/hotspot/share/utilities/debug.hpp
>
>   38 #endif
>   39
>   40 #ifdef CAN_SHOW_REGISTERS_ON_ASSERT
>
> You can delete these three lines as you know you can show registers on
> assert.
>
>
Will fix.

 42   #define TOUCH_ASSERT_POISON (*(char*)g_assert_poison = 'X');
>
> Why the casting? Why not declare g_assert_poison as intptr_t* and write an
> int into it?
>
>
No reason really, I'll do as you suggest.

> Nit: usually macros don't include the trailing ; so that in the code they
> look like statements. (Though I think the use of macro and conditional
> compilation is overkill here.)
>
> ---
>
>  test/hotspot/jtreg/runtime/ErrorHandling/ShowRegistersOnAssertTest.java
>
> I think we need to be able to test a guarantee failure in a product build,
> else how do we know this works okay in that scenario? Do we have something
> in place to allow that?
>
>
Okay. If we do this for guarantee cases too, I will testt guarantees.

---
>
> Thanks,
> David
> -----
>
>
Once we agreed on the points above, I'll post a new patch. I will also add
support for Windows, which should not be that difficult.

Thanks and Kind Regards, Thomas

>
>
> The only difference to before is that I restored the ShouldNotReachHere()
>> at the end of VMError::controlled_crash(). Before that assert, I added a
>> hopefully clearer message than before to enable my test to scan for this
>> message. I also changed the jtreg test to scan for this new message.
>>
>> Note: in the jtreg test, I test that assert supression works with
>> -XX:SupressErrorAt. But I just give it the file
>> (-XX:SuppressErrorAt=/vmError.cpp)
>> and omit the line number, because I do not want the test to stop working
>> when someone adds lines in front of VMError::controlled_crash(). But that
>> means that the ShouldNotReachHere() at the end of controlled_crash() does
>> not fire either, because now all asserts in vmError.cpp are disabled. So,
>> I
>> need this new message.
>>
>> Thanks, Thomas
>>
>> On Mon, Nov 27, 2017 at 11:33 AM, Thomas Stüfe <thomas.stuefe at gmail.com>
>> wrote:
>>
>> Ping... Could I please have reviews? Thank you!
>>> ..Thomas
>>>
>>> On Wed, Nov 22, 2017 at 7:01 PM, Thomas Stüfe <thomas.stuefe at gmail.com>
>>> wrote:
>>>
>>> Hi all,
>>>>
>>>> may I please have reviews for the following enhancement:
>>>>
>>>> issue: https://bugs.openjdk.java.net/browse/JDK-8191101
>>>> webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8191101-
>>>> show-registers-on-assert/webrev.00/webrev/
>>>>
>>>> (Patch looks big but lot of it is os_cpu fluff.)
>>>>
>>>> Prior email discussion at: http://mail.openjdk.java.net/p
>>>> ipermail/hotspot-runtime-dev/2017-October/025018.html
>>>>
>>>> ----
>>>>
>>>> Basically, this adds the ability to show register values in assert
>>>> situations into the error report. This can be useful in certain corner
>>>> cases, e.g. if you want to know the value of some local variables or how
>>>> deep your stack currently runs.
>>>>
>>>> This works by triggering a fault right when an assert happens and
>>>> squirreling the context away for later error reporting.
>>>>
>>>> When an assert happens, we touch a poison page. To preserve the context
>>>> as best as possible, we want to avoid running too much code after the
>>>> assert condition has been evaluated, so this is done in a very simple
>>>> way:
>>>> directly in the assert macro, right after the condition is evaluated, we
>>>> dereference the content of a global poison page pointer. In my tests,
>>>> even
>>>> with slowdebug, this only spoils one register, rax on x64.
>>>>
>>>> In the signal handler, we recognize the assertion poison fault by the
>>>> faulting address. We disable the poison page and store the ucontext
>>>> away.
>>>> Then we just return from signal handling. Poison page is now disarmed,
>>>> the
>>>> load from it is retried and now goes through. Normal assertion handling
>>>> is
>>>> then resumed - so, things like -XX:SuppressAt are unaffected and work
>>>> fine.
>>>>
>>>> Then, when an error report is generated due to this assert, we now also
>>>> use the stored context. So now we get registers and instructions at the
>>>> assert point.
>>>>
>>>> Right now, this is implemented on (non-zero) Linux, though other posix
>>>> platforms should be no problem. Have not yet thought deeply about
>>>> windows.
>>>> It is tested and - in debug - switched on by default on linux x86, ppc,
>>>> s390.
>>>>
>>>> If implemented, it can be switched on and off with
>>>> -XX:+ShowRegistersOnAssert. This is a failsafe, in case the mechanism
>>>> does
>>>> not work and we want to have clean asserts.
>>>>
>>>> To test this, do a java -XX:ErrorHandlerTest=1
>>>> -XX:+ShowRegistersOnAssert
>>>> with a not-product VM. On Linux x64, ppc and s390 we should now see the
>>>> register output in the hs-err file:
>>>>
>>>>    4 #  Internal Error (/shared/projects/openjdk/jdk-
>>>> hs/source/src/hotspot/share/utilities/vmError.cpp:1660), pid=4022,
>>>> tid=4023
>>>>    5 #  assert(str == NULL) failed: expected null
>>>>    6 #
>>>>    7 #
>>>> ...
>>>>   59 Registers:
>>>>   60 RAX=0x00007f736c8f7000, RBX=0x0000000000000000,
>>>> RCX=0x0000000000000000, RDX=0xffffffffffb5ded4
>>>>   61 RSP=0x00007f736c8d8ce0, RBP=0x00007f736c8d8d30,
>>>> RSI=0x0000000000000001, RDI=0x0000000000000001
>>>>   62 R8 =0x0000000000000040, R9 =0x0000000000000001,
>>>> R10=0x0000000000efb028, R11=0x00040788a244f42a
>>>>   63 R12=0x0000000000000000, R13=0x00007fff3de46dbf,
>>>> R14=0x00007f736c8d99c0, R15=0x0000000000000000
>>>>   64 RIP=0x00007f736aec5529, EFLAGS=0x0000000000010202,
>>>> CSGSFS=0x002b000000000033, ERR=0x0000000000000006
>>>>   65   TRAPNO=0x000000000000000e
>>>> ...
>>>>   72
>>>>   73 Instructions: (pc=0x00007f736aec5529)
>>>>   74 0x00007f736aec5509:   8d 05 31 21 4a 00 48 01 d0 ff e0 48 83 7d c8
>>>> 00
>>>>   75 0x00007f736aec5519:   0f 84 7f 03 00 00 48 8d 05 82 fc b1 00 48 8b
>>>> 00
>>>>   76 0x00007f736aec5529:   c6 00 58 e8 cb 17 62 ff 84 c0 74 11 48 8d 3d
>>>> 2f
>>>>   77 0x00007f736aec5539:   1f 4a 00 b8 00 00 00 00 e8 ed 17 62 ff 48 8d
>>>> 0d
>>>>   78
>>>>
>>>> ---
>>>>
>>>> Implementation notes:
>>>>
>>>> - when handling the poison fault, we need to copy the context away from
>>>> the signal handler stack. For posix, this means copying the ucontext_t.
>>>> This is undefined territory. On most platforms, this simply means
>>>> copying
>>>> the ucontex_t as a flat structure. On some platforms more is needed,
>>>> e.g.
>>>> on linux ppc, we need to patch up the context after copying (the
>>>> context is
>>>> not position independent), and on MacOS, the context is not
>>>> self-contained
>>>> but contains pointers to sub structures which need to be copied too and
>>>> whose size is unknown at compile time. Because of these platform
>>>> dependencies, I factored out the copying of ucontext_t to
>>>> os::Posix::copy_ucontext and its implementations are os_cpu specific.
>>>>
>>>> - As an added precaution, when copying the context, we use a safe
>>>> version
>>>> of memcpy (os::safe_memcpy) which I added to copy from potentially
>>>> invalid
>>>> memory regions. The reason is that we have seen on some Unices - e.g.
>>>> hpux
>>>> - that the size of the ucontext_t structure at runtime may be different
>>>> from the build machine, so we tread carefully. os::safe_memcpy() uses
>>>> SafeFetch to copy a range of memory.
>>>>
>>>> Risk:
>>>>
>>>> If this does not work, asserts will become segfaults, which can be
>>>> confusing. But the feature can be disabled with
>>>> -XX:+ShowRegistersOnAssert
>>>> and for now on most platforms is disabled by default.
>>>>
>>>> ---
>>>>
>>>> Limitations:
>>>> - as it is now implemented, this is a one-shot mechanism and only works
>>>> for the first assert.
>>>> - -XX:SuppressAt=... is not affected and works fine. However, if the
>>>> first assert is suppressed, follow-up asserts will not show register
>>>> values.
>>>> - When multiple threads run into an assert, we may or may not see
>>>> register values depending on which thread is the first of finishing the
>>>> poison page handler.
>>>>
>>>> I do not think these limitations are severe. They can be solved, but at
>>>> the cost of added complexity, which I preferred not to add.
>>>>
>>>> Thank you,
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>>
>>>>
>>>