RFR(m): 8185712: [windows] Multiple issues with the native symbol decoder

Wed Aug 16 15:52:24 UTC 2017

On Wed, Aug 16, 2017 at 5:45 PM, Ioi Lam <ioi.lam at oracle.com> wrote:

>
>
> On 8/16/17 8:12 AM, Thomas Stüfe wrote:
>
> Hi Ioi,
>
> On Wed, Aug 16, 2017 at 4:43 PM, Ioi Lam <ioi.lam at oracle.com> wrote:
>
>> Hi Thomas,
>>
>> I am planning to review this but there's a lot of changes so it would
>> take me a little while :-(
>>
>>
> thank you!
>
> I currently wonder whether this is too massive and if I should split this
> up. At least the centralized dbghelp.dll handling could easily moved into
> an own patch, that would make this patch easier to digest. What do you
> think?
>
> I think having multiple smaller patches would be better. I would do a
> subset of our nightly tests before pushing, and then we can let our nightly
> test run for a couple of weeks to make sure no issues come up. Then we can
> try the next patch, and so forth.
>
>
Okay, I will do this then. I withdraw this patch and will chop it up in
digestible chunks.

> Given this is an area that's not used often, and must perform correctly
> when "used" (e.g., where a VM crash state needs to be correctly logged), I
> think we need to take extra precautions.
>

Sure, I agree.

Note that this code is not completely new. It is based on code we use in
our VM since many years. Our error reporting is very robust and quite a bit
enhanced (as I wrote in another mail thread, for our support the hs-err
file is a very important tool because of the difficulty of getting core
dumps from customer sites). Our VMs get used in the field, so at least the
underlying techniques are well tested, if not in this particular form.

> I hope this wouldn't mean too much extra work for you.
>
>
>
>> One quick question - what kind of testing has been done on these changes?
>> My main worry is (1) stability during normal execution, (2) impacts to
>> hs_err file generation.
>>
>>
> I built windows x86 and x64, slodebug and release. Ran gtests manually.
> Our nightly builds ran (TCK, hotspot jtreg) with no problems I could
> attribute to my patch.
>
> I also wrote a small selftest (SymbolEngine::selftest()) atop of the ones
> I added to test_os.cpp which lives in symbol_api.cpp and stresses the
> SymbolEngine class from the inside. I did not add this to the webrev. For
> one because the webrev was already large and also because this test was not
> a gtest but a selftest inside the VM and I did not think you guys would
> accept this.
>
> If the code size is not too big, maybe we can add it to whitebox? That way
> we can start it using jtreg. It would be kind of like the various
> <Class>::verify() routines in the JVM, and can be restricted in non-product
> builds if necessary.
>

That would be a possibility. I have to look at how this is done.

Thanks, Thomas

> Thanks
> - Ioi
>
>
>
>> I can see the improvements in the hs_err call stack, but would there be
>> any regression in corner cases? How can we test for that?
>
>
> I think the code with my patch is way more stable than before, that was
> the aim of this patch. One could write loads of regression tests for corner
> cases.
>
> We actually do this at SAP with our own fork - we have extensive tests for
> the hs-err file generation. But from experience I know that these kind of
> tests are difficult to get 100% solid. For instance, one thing we do at SAP
> is to the VM crash in many colorful ways and check the callstacks for
> validity. However, this is fragile and depends on many factors - how the
> compiler optimizes, whether we build with debug infos etc. I hesitate to
> add tests like this to the OpenJDK which causes sporadic work for
> maintainers, everyone would hate me :)
>
> Thanks, Thomas
>
>
>> Thanks
>>
>> - Ioi
>>
>>
>>
>> On 8/13/17 8:14 AM, Thomas Stüfe wrote:
>>
>>> Dear all,
>>>
>>> May I please have reviews for the following patch.
>>>
>>> Issue: https://bugs.openjdk.java.net/browse/JDK-8185712
>>> Webrev:
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8185712-windows-i
>>> mprove-native-symbol-resolver/webrev.00/webrev/
>>>
>>> This fix makes the native symbol decoder on Windows more robust. This
>>> increases the chance of getting useful error files. It also refactors the
>>> code and contains some useful new features.
>>>
>>> ---
>>>
>>> Native symbol resolution on windows is done via the group of "SymXX" APIs
>>> exported from dbghelp.dll. Currently this is done via a layer of
>>> abstraction in decoder.cpp, and a "WindowsDecoder" class in
>>> decoder_windows.cpp. This class can be instantianted twice, so there can
>>> exist two objects of this class.
>>>
>>> The bugs:
>>>
>>> 1) Functions from dbghelp.dll are not threadsafe; calls to them must be
>>> synchronized. But dbghelp.dll could be used by the two live
>>> WindowsDecoder
>>> objects in parallel from different threads. In addition to that,
>>> dbghelp.dll functions are used from within os_windows.cpp, when writing
>>> minidumps.
>>>
>>> 2) the SymXX APIs need to be initialized by a call to SymInitialize().
>>> This
>>> can only happen once. The way it is now, each of the two WindowsDecoder
>>> objects will call SymInitialize. The second object to do this will fail
>>> and
>>> hence not be usable. In practice this means if in the VM someone called
>>> e.g. dll_from_address_name() - which will initialize and use the shared
>>> WindowsDecoder object - and then the VM crashes, the hs-err file will
>>> contain no useful stack because that would require the second
>>> WindowsDecoder object, which would get initialized after the shared one,
>>> and its initialization would fail.
>>>
>>> 3) Initialization dependencies:
>>>    - when building the pdb search path (WindowsDecoder::initialize),
>>> Arguments::get_java_home() is used to deduce the jdk bin directory
>>> containing the jdk shared objects. This will crash if invoked before the
>>> system properties are set.
>>>    - Decoder::decode() calls (via the Monitor lock) Thread::current().
>>> This
>>> means the code will not work during initialization and where
>>> Thread::current is not set (e.g. in an unattached thread). Admittedly a
>>> bit
>>> theoretical, as this only affects the shared WindowsDecoder object, which
>>> is not used during error reporting.
>>>
>>> 4) WindowsDecoder::initialize(), pdb search path: Code uses MAX_PATH in
>>> various places, which is wrong. NTFS paths can be longer than that.
>>>    - when calling SymGetSearchPath, code assumes a maximum path size of
>>> MAX_PATH for the *combined* size of all directory names, which may be way
>>> too small. Truncation is not handled, code will silently fail if output
>>> buffer is too small, resulting in the existing pdb path to be overwritten
>>> instead of being preserved.
>>>    - GetModuleFileName(): similar problem, output buffer is MAX_PATH len,
>>> which may be too small. Truncation will lead to wrong results.
>>>    - Throughout the function "strncat" is used to assemble the search
>>> path,
>>> but is used wrong and does not guard against buffer overflows (if that
>>> was
>>> the intent - otherwise, why not just use strcat?). The last argument to
>>> strncat should be the remaining space in the destination buffer, not the
>>> length of the source string.
>>>
>>> 5) WindowsDecoder::decode(): We call SymGetSymFromAddr64() using an
>>> output
>>> buffer of size MAX_PATH - which makes no sense at all, as we are
>>> retrieving
>>> a symbol name, not a file name. Truncation is not handled. This is
>>> actually
>>> dangerous, because SymGetSymFromAddr64() handles truncations sloppily, it
>>> will return success and fill the output buffer completely, so on
>>> truncation
>>> the symbol name will not be zero terminated.
>>>
>>> In addition to the bugs, there are a number of things which could be done
>>> better or simpler:
>>>
>>> a) Setting the search path could be simplified and made more useful if we
>>> would just add the directory of every loaded module to the search path
>>> (currently we just add the two jdk bin directories, which also exposes us
>>> to initialization dependency, see (3)). This would be more useful as a
>>> common convention is to put pdb files beside binaries, and that way we
>>> would catch all those. Including, but not limited to, our own jdk pdb
>>> files.
>>>
>>> b) Dlls can be loaded and unloaded, and it would be nice to have an
>>> updated
>>> pdb search path - e.g. in case a late-loaded third party JNI library
>>> crashes. So, it would be nice to run (a) whenever a library is loaded or
>>> unloaded, and for the process to be fast if nothing changed (e.g. if the
>>> new DLL was loaded from a directory which is already part of the search
>>> path).
>>>
>>> c) It would be nice to have file name and line number in the callstack,
>>> too.
>>>
>>> d) As pointed out in JDK-8144855, the function
>>> Decoder::can_decode_C_frame_in_vm() is not necessary. This should be
>>> handled in WindowsDecoder instead.
>>>
>>> ---
>>>
>>> What I did in this fix:
>>>
>>> - I pulled out dbghelp.dll handling into an own centralized singleton
>>> (DbgHelpLoader). This class takes care of loading the Dll and synchonizes
>>> access to all its functions, thus solving (1). DbgHelpLoader now replaces
>>> all places where before the DLL was loaded manually.
>>>
>>> - Atop of DbgHelpLoader there is another new singleton class,
>>> "SymbolEngine", which wraps the life cycle of the symbol APIs. This
>>> solves
>>> problem (2). In addition to that, this class is the new interface to the
>>> SymXX APIs.
>>>
>>> - this means that the existing code in decoder.cpp, which instantiates
>>> two
>>> Decode objects and synchronizes access to one of them, makes no sense for
>>> Windows. That whole layer is now bypassed in Windows - the static
>>> Decoder::xxx() functions are now directly implemented in
>>> decoder_windows.cpp and access the SymbolEngine singleton without any
>>> added
>>> layers inbetween.
>>>
>>> - SymbolEngine contains a new improved way to assemble the pdb search
>>> path
>>> as described in (a) and (b). We iterate loaded modules, extract
>>> directories
>>> and add them to the search path. Implementation takes care to not do
>>> unnecessary work (avoiding changing the path if list of loaded modules is
>>> unchanged, or if only Dlls from known directories were loaded).
>>>
>>> - We now recalculate the pdb seach path if a DLL was loaded (via
>>> os::dll_load). That makes it possible to debug third party dlls which do
>>> not reside in jdk directories and were loaded after the SymbolEngine was
>>> initialized.
>>>
>>> - When decoding a symbol from an unknown address and the decode fails, we
>>> attempt to rebuild the pdb search path and attempt again.
>>>
>>> - Care was taken for this code to be robust. It may be called
>>> pre-initialization, post-shutdown or under error conditions (low memory,
>>> out of stack etc), so no VM infrastructure was used and memory use is
>>> frugal (dynamic buffers allocated off the stack and reused were
>>> possible).
>>>
>>> - I did away with the "Decoder::can_decode_C_frame_in_vm()" code. See
>>> argumentation in JDK-8144855. To me, it makes no sense. The function is a
>>> workaround the problem that, when faced with stripped binaries which only
>>> have debug symbols for public functions, symbol info may be confusing
>>> (public symbols + large offsets). But this is not done consistently, and
>>> most programmers know not to trust symbols with large offsets.
>>>
>>> - Finally, we now have source info in the callstack :)
>>>
>>> Stack: [0x00840000,0x00890000],  sp=0x0088f884,  free space=318k
>>> Native frames: (J=compiled Java code, A=aot compiled Java code,
>>> j=interpreted, Vv=VM code, C=native code)
>>> V  [jvm.dll+0xa26903]  VMError::controlled_crash+0x2a3
>>> (vmerror.cpp:1683)
>>> V  [jvm.dll+0xa2664e]  VMError::test_error_handler+0xe
>>> (vmerror.cpp:1632)
>>> V  [jvm.dll+0x6af796]  JNI_CreateJavaVM_inner+0x196  (jni.cpp:3988)
>>> V  [jvm.dll+0x6af5c2]  JNI_CreateJavaVM+0x52  (jni.cpp:4036)
>>> V  [jvm.dll+0x2cdfc]  init_jvm+0xdc  (gtestmain.cpp:94)
>>> V  [jvm.dll+0x2d18a]  JVMInitializerListener::OnTestStart+0x7a
>>>   (gtestmain.cpp:114)
>>> V  [jvm.dll+0x829b]  testing::internal::TestEventRe
>>> peater::OnTestStart+0x5b
>>>   (gtest.cc:2979)
>>> V  [jvm.dll+0x6874]  testing::TestInfo::Run+0x54  (gtest.cc:2312)
>>> V  [jvm.dll+0x6d8f]  testing::TestCase::Run+0xbf  (gtest.cc:2445)
>>> V  [jvm.dll+0xbe9e]  testing::internal::UnitTestImpl::RunAllTests+0x25e
>>>   (gtest.cc:4316)
>>> V  [jvm.dll+0x2a410]
>>>   testing::internal::HandleSehExceptionsInMethodIfSupported<te
>>> sting::internal::UnitTestImpl,bool>+0x40
>>>   (gtest.cc:2063)
>>> V  [jvm.dll+0x2493e]
>>>   testing::internal::HandleExceptionsInMethodIfSupported<testi
>>> ng::internal::UnitTestImpl,bool>+0x5e
>>>   (gtest.cc:2114)
>>> V  [jvm.dll+0xb0e9]  testing::UnitTest::Run+0xe9  (gtest.cc:3929)
>>> V  [jvm.dll+0x2d0bf]  RUN_ALL_TESTS+0xf  (gtest.h:2289)
>>> V  [jvm.dll+0x2cc76]  runUnitTestsInner+0x1f6  (gtestmain.cpp:249)
>>> V  [jvm.dll+0x2c958]  runUnitTests+0x48  (gtestmain.cpp:319)
>>> C  [gtestLauncher.exe+0x1011]  main+0x11  (gtestlauncher.cpp:32)
>>> C  [gtestLauncher.exe+0x1182]  __tmainCRTStartup+0x122  (crtexe.c:555)
>>> C  [KERNEL32.DLL+0x138f4]
>>> C  [ntdll.dll+0x65de3]
>>> C  [ntdll.dll+0x65dae]
>>>
>>> .. which is implemented for Windows, but pipes are laid to plug in other
>>> platforms as well (Decoder::get_source_info())
>>>
>>> ----
>>>
>>> Tell me what you think.
>>>
>>> Thanks, and Kind Regards, Thomas
>>>
>>
>>
>
>