RFR(M): 8247515: OSX pc_to_symbol() lookup does not work with core files

Kevin Walls kevin.walls at oracle.com
Fri Jul 24 09:17:19 UTC 2020


Thanks Chris - all sounds good to me.  Thanks for all the MachO insights...


On 23/07/2020 22:27, Chris Plummer wrote:
> Just a minor update of some new findings (no new code change). The DB 
> hash table being used by default will overwrite an existing entry, not 
> duplicate it, and this is indeed what was happening. The second entry 
> added was the one with a 0 offset. When I enable the R_NOOVERWRITE 
> flag, it stops the overwrite and that also fixes the problem, but 
> that's only because the entry with offset 0 comes last. The fix I've 
> done is better since it avoids the offet 0 entry altogether, so even 
> if it came first it would not get used.
>
> thanks,
>
> Chris
>
> On 7/22/20 10:25 PM, Chris Plummer wrote:
>> Hi Kevin,
>>
>> Thanks for the review. Unfortunately there was yet another bug which 
>> I have now addressed. Although testing with mach5 went fine, when I 
>> tried with a local build I had issues. SA couldn't even get past an 
>> early part of the core file handling which involves doing some 
>> adjustments related to CDS. It needs to look up a couple of hotspot 
>> symbols by name to do this, and get their values (such as 
>> _SharedBaseAddress). Although the symbol -> address lookup seemed to 
>> work, the values retrieved from the address were garbage. After some 
>> debugging I noticed the 3 symbols being looked up all had the same 
>> address. Then I noticed this address was at offset 0 of the libjvm 
>> segment. After a lot more debugging I found the problem. These 
>> symbols were actually in the symbol table twice, once with a proper 
>> offset and once with an offset of 0. I'm not sure why the ones with 
>> an offset of 0 were there (other than they originated in the mach-o 
>> symbol table).
>>
>> The reason this didn't always happen is because SA takes all the 
>> symbols it finds and adds them to a hash table for fast symbol -> 
>> address lookup. If a symbol is in there twice, it's a tossup which 
>> you'll get. It could change from build to build in fact. The trigger 
>> for my local build was probably how I ran configure, which likely is 
>> not the same as mach5, although I'm  unsure if this just gave me the 
>> unlucky hashing, or if in fact it resulted in the entries with offset 
>> 0. In any case, the fix is to ignore entries with offset 0. Here's 
>> the updated webrev:
>>
>> http://cr.openjdk.java.net/~cjplummer/8247515/webrev.03/index.html
>>
>> All the changes since webrev.02 are in build_symtab() in symtab.c. 
>> Besides ignoring entries with offset 0 to fix the bug, I also did 
>> some cleanup. There used to be two loops to iterate over the symbols. 
>> There wasn't really a good reason for this, so now there is just one. 
>> Also, it no longer adds entries with a file offset 0, an offset into 
>> the string section of 0, or an empty string. By doing this the size 
>> of the libjvm symbol table went from about 240k entries to 90k. Since 
>> it was originally allocated at it's full potential size, it's now 
>> reallocate to the smaller size after symbol table processing is over.
>>
>> thanks,
>>
>> Chris
>>
>> On 7/22/20 2:45 AM, Kevin Walls wrote:
>>>
>>> Thanks Chris, yes looks good, I like that we check the library 
>>> bounds before calling nearest_symbol.
>>>
>>> --
>>> Kevin
>>>
>>>
>>> On 21/07/2020 21:05, Chris Plummer wrote:
>>>> Hi Serguei and Kevin,
>>>>
>>>> The webrev has been updated:
>>>>
>>>> http://cr.openjdk.java.net/~cjplummer/8247515/webrev.02/index.html
>>>> https://bugs.openjdk.java.net/browse/JDK-8247515
>>>>
>>>> Two issues were addressed:
>>>>
>>>> (1) Code in symbol_for_pc() assumed the caller had first checked to 
>>>> make sure that the symbol is in a library, where-as some callers 
>>>> assume NULL will be returned if the symbol is not in a library. 
>>>> This is the case for pstack for example, which first blindly does a 
>>>> pc to symbol lookup, and only if that returns null does it then 
>>>> check if the pc is in the codecache or interpreter. The logic in 
>>>> symbol_for_pc() assumed that if the pc was greater than the start 
>>>> address of the last library in the list, then it must be in that 
>>>> library. So in stack traces the frames for compiled or interpreted 
>>>> pc's showed up as the last symbol in the last library, plus some 
>>>> very large offset. The fix is to now track the size of libraries so 
>>>> we can do a proper range check.
>>>>
>>>> (2) There are issues with finding system libraries. See [1] 
>>>> JDK-8249779. Because of this I disabled support for trying to 
>>>> locate them. This was done in ps_core.c, and there are 
>>>> "JDK-8249779" comment references in the code where I did this. The 
>>>> end result of this is that PMap for core files won't show system 
>>>> libraries, and PStack for core files won't show symbols for 
>>>> addresses in system libraries. Note that currently support for PMap 
>>>> and PStack is disabled for OSX, but I will shortly send out a 
>>>> review to enable them for OSX core files as part of the fix for [2] 
>>>> JDK-8248882.
>>>>
>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8249779
>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8248882
>>>>
>>>> thanks,
>>>>
>>>> Chris
>>>>
>>>> On 7/14/20 5:46 PM, serguei.spitsyn at oracle.com wrote:
>>>>> Okay, I'll wait for new webrev version to review.
>>>>>
>>>>> Thanks,
>>>>> Serguei
>>>>>
>>>>>
>>>>> On 7/14/20 17:40, Chris Plummer wrote:
>>>>>> Hi Serguie,
>>>>>>
>>>>>> Thanks for reviewing. This webrev is in limbo right now as I 
>>>>>> discovered some issues that Kevin and I have been discussing off 
>>>>>> line. One is that the code assumes the caller has first checked 
>>>>>> to make sure that the symbol is in a library, where-as the actual 
>>>>>> callers assume NULL will be returned if the symbol is not in a 
>>>>>> library. The end result is that we end up returning a symbol, 
>>>>>> even for address in the code cache or interpreter. So in stack 
>>>>>> traces these frame show up as the last symbol in the last 
>>>>>> library, plus some very large offset. I have a fix for that were 
>>>>>> now I track the size of each library. But there is another issue 
>>>>>> with the code that tries to discover all libraries in the core 
>>>>>> file. It misses a very large number of system libraries. We 
>>>>>> understand why, but are not sure of the solution. I might just 
>>>>>> change to code to only worry about user libraries (JDK libs and 
>>>>>> other JNI libs).
>>>>>>
>>>>>> Some comments below also.
>>>>>>
>>>>>> On 7/14/20 4:37 PM, serguei.spitsyn at oracle.com wrote:
>>>>>>> Hi Chris,
>>>>>>>
>>>>>>> I like the suggestion from Kevin below.
>>>>>> I'm not sure which suggestion since I updated the webrev based on 
>>>>>> his initial suggestion.
>>>>>>>
>>>>>>> I have a couple of minor comments so far.
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~cjplummer/8247515/webrev.01/src/jdk.hotspot.agent/macosx/native/libsaproc/libproc_impl.c.frames.html
>>>>>>> 313 if (!lib->next || lib->next->base >= addr) {
>>>>>>> I wonder if the check above has to be:
>>>>>>> 313 if (!lib->next || lib->next->base > addr) {
>>>>>> Yes, > would be better, although this goes away with my fix that 
>>>>>> tracks the size of each library.
>>>>>>>
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~cjplummer/8247515/webrev.01/src/jdk.hotspot.agent/macosx/native/libsaproc/symtab.c.frames.html
>>>>>>> 417 if (offset_from_sym >= 0) { // ignore symbols that comes 
>>>>>>> after "offset"
>>>>>>> Replace: comes => come
>>>>>> Ok.
>>>>>>
>>>>>> thanks,
>>>>>>
>>>>>> Chris
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Serguei
>>>>>>>
>>>>>>>
>>>>>>> On 7/8/20 03:23, Kevin Walls wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Sure -- I was thinking lowest_offset_from_sym initialising 
>>>>>>>> starting at a high positive integer (that would now be 
>>>>>>>> PTRDIFF_MAX I think) to save a comparison with e.g. -1, you can 
>>>>>>>> just check if the new offset is less than lowest_offset_from_sym
>>>>>>>>
>>>>>>>> With the ptrdiff_t change you made, this all looks good to me 
>>>>>>>> however you decide. 8-)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 07/07/2020 21:17, Chris Plummer wrote:
>>>>>>>>> Hi Kevin,
>>>>>>>>>
>>>>>>>>> Thanks for the review. Yes, that lack of initialization of 
>>>>>>>>> lowest_offset_from_sym is a bug. I'm real surprised the 
>>>>>>>>> compiler didn't catch it as it will be uninitialized garbage 
>>>>>>>>> the first time it is referenced. Fortunately usually the 
>>>>>>>>> eventual offset is very small if not 0, so probably this never 
>>>>>>>>> prevented a proper match. I think there's also another bug:
>>>>>>>>>
>>>>>>>>>  415       uintptr_t offset_from_sym = offset - sym->offset;
>>>>>>>>>
>>>>>>>>> "offset" is the passed in offset, essentially the address of 
>>>>>>>>> the symbol we are interested in, but given as an offset from 
>>>>>>>>> the start of the DSO. "sym->offset" is also an offset from the 
>>>>>>>>> start of the DSO. It could be located before or after 
>>>>>>>>> "offset". This means the math could result in a negative 
>>>>>>>>> number, which when converted to unsigned would be a very large 
>>>>>>>>> positive number. This happens whenever you check a symbol that 
>>>>>>>>> is actually located after the address you are looking up. The 
>>>>>>>>> end result is harmless, because it just means there's no way 
>>>>>>>>> we will match that symbol, which is what you want, but it 
>>>>>>>>> would be good to clean this up.
>>>>>>>>>
>>>>>>>>> I think what is best is to use ptrdiff_t and initialize 
>>>>>>>>> lowest_offset_from_sym to -1. I've updated the webrev:
>>>>>>>>>
>>>>>>>>> http://cr.openjdk.java.net/~cjplummer/8247515/webrev.01/index.html 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> thanks,
>>>>>>>>>
>>>>>>>>> Chris
>>>>>>>>>
>>>>>>>>> On 7/7/20 4:09 AM, Kevin Walls wrote:
>>>>>>>>>> Hi Chris,
>>>>>>>>>>
>>>>>>>>>> Yes I think this looks good.
>>>>>>>>>>
>>>>>>>>>> Question: In nearest_symbol, do we need to initialize 
>>>>>>>>>> lowest_offset_from_sym to something impossibly high, as if it 
>>>>>>>>>> defaults to zero we never find a better/nearer result?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Kevin
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 07/07/2020 06:10, Chris Plummer wrote:
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> Please help review the following:
>>>>>>>>>>>
>>>>>>>>>>> http://cr.openjdk.java.net/~cjplummer/8247515/webrev.00/index.html 
>>>>>>>>>>>
>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8247515
>>>>>>>>>>>
>>>>>>>>>>> The CR contains a description of the issues being addressed. 
>>>>>>>>>>> There is also no test for this symbol lookup support yet. It 
>>>>>>>>>>> will be there after I push JDK-8247516 and JDK-8247514, 
>>>>>>>>>>> which are both blocked by the CR.
>>>>>>>>>>>
>>>>>>>>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8247516
>>>>>>>>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8247514
>>>>>>>>>>>
>>>>>>>>>>> thanks,
>>>>>>>>>>>
>>>>>>>>>>> Chris
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20200724/dfea502d/attachment-0001.htm>


More information about the serviceability-dev mailing list