RFR(M): 8247515: OSX pc_to_symbol() lookup does not work with core files
Kevin Walls
kevin.walls at oracle.com
Fri Jul 24 09:17:19 UTC 2020
Thanks Chris - all sounds good to me. Thanks for all the MachO insights...
On 23/07/2020 22:27, Chris Plummer wrote:
> Just a minor update of some new findings (no new code change). The DB
> hash table being used by default will overwrite an existing entry, not
> duplicate it, and this is indeed what was happening. The second entry
> added was the one with a 0 offset. When I enable the R_NOOVERWRITE
> flag, it stops the overwrite and that also fixes the problem, but
> that's only because the entry with offset 0 comes last. The fix I've
> done is better since it avoids the offet 0 entry altogether, so even
> if it came first it would not get used.
>
> thanks,
>
> Chris
>
> On 7/22/20 10:25 PM, Chris Plummer wrote:
>> Hi Kevin,
>>
>> Thanks for the review. Unfortunately there was yet another bug which
>> I have now addressed. Although testing with mach5 went fine, when I
>> tried with a local build I had issues. SA couldn't even get past an
>> early part of the core file handling which involves doing some
>> adjustments related to CDS. It needs to look up a couple of hotspot
>> symbols by name to do this, and get their values (such as
>> _SharedBaseAddress). Although the symbol -> address lookup seemed to
>> work, the values retrieved from the address were garbage. After some
>> debugging I noticed the 3 symbols being looked up all had the same
>> address. Then I noticed this address was at offset 0 of the libjvm
>> segment. After a lot more debugging I found the problem. These
>> symbols were actually in the symbol table twice, once with a proper
>> offset and once with an offset of 0. I'm not sure why the ones with
>> an offset of 0 were there (other than they originated in the mach-o
>> symbol table).
>>
>> The reason this didn't always happen is because SA takes all the
>> symbols it finds and adds them to a hash table for fast symbol ->
>> address lookup. If a symbol is in there twice, it's a tossup which
>> you'll get. It could change from build to build in fact. The trigger
>> for my local build was probably how I ran configure, which likely is
>> not the same as mach5, although I'm unsure if this just gave me the
>> unlucky hashing, or if in fact it resulted in the entries with offset
>> 0. In any case, the fix is to ignore entries with offset 0. Here's
>> the updated webrev:
>>
>> http://cr.openjdk.java.net/~cjplummer/8247515/webrev.03/index.html
>>
>> All the changes since webrev.02 are in build_symtab() in symtab.c.
>> Besides ignoring entries with offset 0 to fix the bug, I also did
>> some cleanup. There used to be two loops to iterate over the symbols.
>> There wasn't really a good reason for this, so now there is just one.
>> Also, it no longer adds entries with a file offset 0, an offset into
>> the string section of 0, or an empty string. By doing this the size
>> of the libjvm symbol table went from about 240k entries to 90k. Since
>> it was originally allocated at it's full potential size, it's now
>> reallocate to the smaller size after symbol table processing is over.
>>
>> thanks,
>>
>> Chris
>>
>> On 7/22/20 2:45 AM, Kevin Walls wrote:
>>>
>>> Thanks Chris, yes looks good, I like that we check the library
>>> bounds before calling nearest_symbol.
>>>
>>> --
>>> Kevin
>>>
>>>
>>> On 21/07/2020 21:05, Chris Plummer wrote:
>>>> Hi Serguei and Kevin,
>>>>
>>>> The webrev has been updated:
>>>>
>>>> http://cr.openjdk.java.net/~cjplummer/8247515/webrev.02/index.html
>>>> https://bugs.openjdk.java.net/browse/JDK-8247515
>>>>
>>>> Two issues were addressed:
>>>>
>>>> (1) Code in symbol_for_pc() assumed the caller had first checked to
>>>> make sure that the symbol is in a library, where-as some callers
>>>> assume NULL will be returned if the symbol is not in a library.
>>>> This is the case for pstack for example, which first blindly does a
>>>> pc to symbol lookup, and only if that returns null does it then
>>>> check if the pc is in the codecache or interpreter. The logic in
>>>> symbol_for_pc() assumed that if the pc was greater than the start
>>>> address of the last library in the list, then it must be in that
>>>> library. So in stack traces the frames for compiled or interpreted
>>>> pc's showed up as the last symbol in the last library, plus some
>>>> very large offset. The fix is to now track the size of libraries so
>>>> we can do a proper range check.
>>>>
>>>> (2) There are issues with finding system libraries. See [1]
>>>> JDK-8249779. Because of this I disabled support for trying to
>>>> locate them. This was done in ps_core.c, and there are
>>>> "JDK-8249779" comment references in the code where I did this. The
>>>> end result of this is that PMap for core files won't show system
>>>> libraries, and PStack for core files won't show symbols for
>>>> addresses in system libraries. Note that currently support for PMap
>>>> and PStack is disabled for OSX, but I will shortly send out a
>>>> review to enable them for OSX core files as part of the fix for [2]
>>>> JDK-8248882.
>>>>
>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8249779
>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8248882
>>>>
>>>> thanks,
>>>>
>>>> Chris
>>>>
>>>> On 7/14/20 5:46 PM, serguei.spitsyn at oracle.com wrote:
>>>>> Okay, I'll wait for new webrev version to review.
>>>>>
>>>>> Thanks,
>>>>> Serguei
>>>>>
>>>>>
>>>>> On 7/14/20 17:40, Chris Plummer wrote:
>>>>>> Hi Serguie,
>>>>>>
>>>>>> Thanks for reviewing. This webrev is in limbo right now as I
>>>>>> discovered some issues that Kevin and I have been discussing off
>>>>>> line. One is that the code assumes the caller has first checked
>>>>>> to make sure that the symbol is in a library, where-as the actual
>>>>>> callers assume NULL will be returned if the symbol is not in a
>>>>>> library. The end result is that we end up returning a symbol,
>>>>>> even for address in the code cache or interpreter. So in stack
>>>>>> traces these frame show up as the last symbol in the last
>>>>>> library, plus some very large offset. I have a fix for that were
>>>>>> now I track the size of each library. But there is another issue
>>>>>> with the code that tries to discover all libraries in the core
>>>>>> file. It misses a very large number of system libraries. We
>>>>>> understand why, but are not sure of the solution. I might just
>>>>>> change to code to only worry about user libraries (JDK libs and
>>>>>> other JNI libs).
>>>>>>
>>>>>> Some comments below also.
>>>>>>
>>>>>> On 7/14/20 4:37 PM, serguei.spitsyn at oracle.com wrote:
>>>>>>> Hi Chris,
>>>>>>>
>>>>>>> I like the suggestion from Kevin below.
>>>>>> I'm not sure which suggestion since I updated the webrev based on
>>>>>> his initial suggestion.
>>>>>>>
>>>>>>> I have a couple of minor comments so far.
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~cjplummer/8247515/webrev.01/src/jdk.hotspot.agent/macosx/native/libsaproc/libproc_impl.c.frames.html
>>>>>>> 313 if (!lib->next || lib->next->base >= addr) {
>>>>>>> I wonder if the check above has to be:
>>>>>>> 313 if (!lib->next || lib->next->base > addr) {
>>>>>> Yes, > would be better, although this goes away with my fix that
>>>>>> tracks the size of each library.
>>>>>>>
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~cjplummer/8247515/webrev.01/src/jdk.hotspot.agent/macosx/native/libsaproc/symtab.c.frames.html
>>>>>>> 417 if (offset_from_sym >= 0) { // ignore symbols that comes
>>>>>>> after "offset"
>>>>>>> Replace: comes => come
>>>>>> Ok.
>>>>>>
>>>>>> thanks,
>>>>>>
>>>>>> Chris
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Serguei
>>>>>>>
>>>>>>>
>>>>>>> On 7/8/20 03:23, Kevin Walls wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Sure -- I was thinking lowest_offset_from_sym initialising
>>>>>>>> starting at a high positive integer (that would now be
>>>>>>>> PTRDIFF_MAX I think) to save a comparison with e.g. -1, you can
>>>>>>>> just check if the new offset is less than lowest_offset_from_sym
>>>>>>>>
>>>>>>>> With the ptrdiff_t change you made, this all looks good to me
>>>>>>>> however you decide. 8-)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 07/07/2020 21:17, Chris Plummer wrote:
>>>>>>>>> Hi Kevin,
>>>>>>>>>
>>>>>>>>> Thanks for the review. Yes, that lack of initialization of
>>>>>>>>> lowest_offset_from_sym is a bug. I'm real surprised the
>>>>>>>>> compiler didn't catch it as it will be uninitialized garbage
>>>>>>>>> the first time it is referenced. Fortunately usually the
>>>>>>>>> eventual offset is very small if not 0, so probably this never
>>>>>>>>> prevented a proper match. I think there's also another bug:
>>>>>>>>>
>>>>>>>>> 415 uintptr_t offset_from_sym = offset - sym->offset;
>>>>>>>>>
>>>>>>>>> "offset" is the passed in offset, essentially the address of
>>>>>>>>> the symbol we are interested in, but given as an offset from
>>>>>>>>> the start of the DSO. "sym->offset" is also an offset from the
>>>>>>>>> start of the DSO. It could be located before or after
>>>>>>>>> "offset". This means the math could result in a negative
>>>>>>>>> number, which when converted to unsigned would be a very large
>>>>>>>>> positive number. This happens whenever you check a symbol that
>>>>>>>>> is actually located after the address you are looking up. The
>>>>>>>>> end result is harmless, because it just means there's no way
>>>>>>>>> we will match that symbol, which is what you want, but it
>>>>>>>>> would be good to clean this up.
>>>>>>>>>
>>>>>>>>> I think what is best is to use ptrdiff_t and initialize
>>>>>>>>> lowest_offset_from_sym to -1. I've updated the webrev:
>>>>>>>>>
>>>>>>>>> http://cr.openjdk.java.net/~cjplummer/8247515/webrev.01/index.html
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> thanks,
>>>>>>>>>
>>>>>>>>> Chris
>>>>>>>>>
>>>>>>>>> On 7/7/20 4:09 AM, Kevin Walls wrote:
>>>>>>>>>> Hi Chris,
>>>>>>>>>>
>>>>>>>>>> Yes I think this looks good.
>>>>>>>>>>
>>>>>>>>>> Question: In nearest_symbol, do we need to initialize
>>>>>>>>>> lowest_offset_from_sym to something impossibly high, as if it
>>>>>>>>>> defaults to zero we never find a better/nearer result?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Kevin
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 07/07/2020 06:10, Chris Plummer wrote:
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> Please help review the following:
>>>>>>>>>>>
>>>>>>>>>>> http://cr.openjdk.java.net/~cjplummer/8247515/webrev.00/index.html
>>>>>>>>>>>
>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8247515
>>>>>>>>>>>
>>>>>>>>>>> The CR contains a description of the issues being addressed.
>>>>>>>>>>> There is also no test for this symbol lookup support yet. It
>>>>>>>>>>> will be there after I push JDK-8247516 and JDK-8247514,
>>>>>>>>>>> which are both blocked by the CR.
>>>>>>>>>>>
>>>>>>>>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8247516
>>>>>>>>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8247514
>>>>>>>>>>>
>>>>>>>>>>> thanks,
>>>>>>>>>>>
>>>>>>>>>>> Chris
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20200724/dfea502d/attachment-0001.htm>
More information about the serviceability-dev
mailing list