RFR: 8242181: [Linux] Show source information when printing native stack traces in hs_err files [v5]

Wed Mar 30 06:41:36 UTC 2022

On Wed, 30 Mar 2022 06:35:27 GMT, Christian Hagedorn <chagedorn at openjdk.org> wrote:

>> Hi Christian,
>> 
>> this is impressive work. It's a big change, and I had a look at part of it. I'll continue tomorrow.
>> 
>> In general, I'm concerned with the use of both UL and ResourceArea in this code. I know the use of UL has been discussed, but still. Use of RA will prevent us from getting useful callstacks if we crash and Thread::current is NULL or invalid. I'd feel better if we were to consistently rely on an outside scratch buffer (like we usually do in error reporting). Even raw ::malloc would be better IMHO.
>> 
>> Another concern was safety, since this is a potential attack vector with manipulated Dwarf files, if someone manages to provove a crash. Maybe far fetched, but still. Would be good to get SonarCloud readings for this code e.g.
>> 
>> More remarks inline.
>> 
>> Cheers, Thomas
>
>> this is impressive work. It's a big change, and I had a look at part of it. I'll continue tomorrow.
> 
> Thanks a lot Thomas for your careful review! I'm in the process of working through your comments and will come back with an update today or later this week.
> 
>> In general, I'm concerned with the use of both UL and ResourceArea in this code. I know the use of UL has been discussed, but still. 
> 
> I agree that it is problematic but I think it would be good to keep some logging around when later coming back to the parser code (and that's the only reason I think that you ever want to turn these logs on). I can currently think of two options:
> 
> - Leave UL in and just guard it with an additional new develop flag to exclude the logs from unfiltered UL logging. This would allow us to kinda accept the risks for debugging purposes. That's not really a good design though but we could keep the log levels with their time stamps.
> - Replace all UL calls with `tty` and also guard them with a new develop flag and play around with `Verbose` and `WizardMode` to keep the different log levels. That's not great either but I think it's safer to use and we only want the logs on rare occasions anyways - so it might be acceptable to use these verbose flags even though we should generally get away from them.
> 
>> Use of RA will prevent us from getting useful callstacks if we crash and Thread::current is NULL or invalid. I'd feel better if we were to consistently rely on an outside scratch buffer (like we usually do in error reporting). Even raw ::malloc would be better IMHO.
> 
> The idea of a scratch buffer sounds good. I'll check if I can replace all the `NEW_RESOURCE_ARRAY` usages with it.
>  
>> Another concern was safety, since this is a potential attack vector with manipulated Dwarf files, if someone manages to provove a crash. Maybe far fetched, but still. Would be good to get SonarCloud readings for this code e.g.
> 
> I was also concerned about that and I'm very thankful that you've spotted some issues already! I think minimizing the risk of a potential attack should be a top priority. We should definitely add some more checks. What do you think about the usage of `_JVM_DWARF_PATH` to load a DWARF file? I'm not sure how safe it is. I originally had it enabled for debug builds only.
> 
>> We see test errors on Linux ppcle and x64 in gtests:
> 
> Could you try running it with `-Xlog:dwarf=info/debug` in order to find out why it failed? It might not have found the symbols. Is the JTreg test `TestDwarf.java` working? But there is now another problem that since using GCC 11.2 (change done for Oracle builds with [JDK-8283057](https://bugs.openjdk.java.net/browse/JDK-8283057)), it emits unsupported DWARF 5 for some DWARF sections, at least on my machine, which is unfortunate. Maybe that's also the reason you see the failures if you use GCC 11.2. Maybe we can mitigate this problem by forcing GCC to use DWARF 4 for now. Could that be done by using the `-gdwarf` GCC flag? @erikj79
> 
>> We also see Problems in runtime/ErrorHandling and in jfr/jvm/TestDumpOnCrash. Mostly, these tests now have much longer runtimes (about factor 2). With TestDumpOnCrash, both the error file writer and the test itself timeouted on some of our slower machines.
> 
> Are these timeouts on ppcle and x64? We could also try to add `-Xlog:dwarf=info/debug` to the runs to get some rough idea of the time required to parse DWARF. I'll have a look at the these tests.
> 
> Thanks,
> Christian

> @chhagedorn
> 
> > There is still some code that could be shared though like opening a DWARF file with its checks or reading an LEB 128 etc. Might be worth to investigate further if the two implementations can be merged/reused to some extent. But I propose to file a separate RFE for that. What do you think?
> 
> Yeah, let's investigate about it in another RFE.
> 
> IMHO we can share some codes about DWARF between HotSpot and SA, and also we might need DWARF-based call frame parser in HotSpot because some 3rd-party native libraries don't use base pointer (RBP) to store SP due to optimization. In SA side, it would be useful if we can check native source file and line number in mixed jstack with your change.

I see, then it makes sense to unify these parsers later.

> 
> So I want to unify DWARF parser (processor) between HotSpot and SA, but it might be long journey... thus I agree with you to file it as another RFE.

Sounds good, I'll file an RFE and link it to this RFE.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7126