RFR: JDK-8292351: tty should always live [v4]

Thu Aug 18 05:48:14 UTC 2022

On Wed, 17 Aug 2022 09:30:48 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> The default stream object tty is used in many places but its lifetime is limited. It gets born not-quite at the beginning of VM initialization and dies in DestroyVM. This leaves time windows before VM initialization and after VM cleanup where logging to tty crashes.
>> 
>> This has been bugging me in the past, especially when wanting to use tty in code that runs very early (NMT preinit system, for example), and also causes problems for code that runs post-cleanup. Mostly this affects logging and error logging.
>> 
>> tty should always be safe to write to, and that is trivial to do.
>
> Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Change comment

Hi David,

> Hi Thomas,
> 
> Yes this is consuming too much of our time - sorry about that but this does need to be aired.

Sorry for seeming impatient. If there are differences, we have to understand the needs of all and find a good compromise.

My experiences may differ from yours since my work revolve(d) a lot around the fringes of the VM - porting, app integration, observability - and in those areas you find yourself often outside the VM life span.

> The compatibility issue is not "crash versus no crash"; the issue is the ability to add new early/late logging that goes to a fixed destination that the user cannot control and which the existing control flags will not affect. Such logging should be extremely rare (as it bypasses the normal control mechanisms) so making it highly visible in the source code, e.g. by using tty_safe, may not be a bad thing. Code that runs during init, especially very early init, and during termination, is already a special code. It has to be independent of all VM services that have not yet started or may have already terminated. We have often been bitten by code that runs in a context the author had not expected or intended (especially static initializers and destructors) with implicit dependencies that only get exposed when another bug arises (as per the mallocTracker issue).

I think the difference between us is what we consider worse. For us, vanishing VMs are really bad since we usually have no or no easy access to the situation, and log files are our only tool. That is why we were so intent on hardening out error reporting. 

So I usually consider any output important and like to see it. That said, I can see that unimportant tty output getting stuck in the middle of log files can cause grief since parsers would get thrown off. We also have experienced that.

> 
> I agree there are issues with determining how far out you can push the stream destruction in the termination process but I still think it may be worth looking into in the context of the current failure. I also note that your code doesn't (and can't?) account for the possibility of races on the use of the tty during termination, so it is still not completely safe.

We can look into extending the lifespan of defaultStream, but before that, I'd like to check if this is even needed. defaultStream is surprisingly complex - it inherits from xmlStream for some reason and seems to have ties to the compiler, etc. I'm also worried about bitrot here, even if it is today safe to use, it may not be tomorrow if someone changes it. So we not only need to check, simplify, and potentially fix it, we also need guards against regressions. 

For this patch, I'm aiming for a minimally invasive patch here that is also easy to downport.

About races, that one is true. If someone concurrently uses tty while it gets destroyed, we will crash. My patch does not help here.

> 
> Early VM init could be considered as a separate issue. Really it is only things like NMT that have issues here because you want it enabled (potentially from the initial load of libjvm) but can't use any VM services at that time.

You would be surprised how much code is running before VM initialization via C++ dynamic initialization. Mostly UL and Compiler-related. That was the reason why we needed NMT preinit - because there were a lot of calls to os::malloc before initialization. I really dislike this since it randomizes initialization order and makes debugging very complicated. But it is what it is.

 > We obviously need to have a backup plan for error/crash reporting (as all bets are off by then anyway!), but early logging should be a special case - how do you even turn it on when argument processing has not yet happened? Do we still have that hack to directly read a special environment variable for NMT?

Oh, no. That was the whole reason for the NMT rewrite https://bugs.openjdk.org/browse/JDK-8256844. It made NMT late-initializable and usable in custom launchers (e.g. also during gtests). But for that, it needs some early magic to deal with above mentioned early os::malloc calls.

> 
> So, sorry, but I don't see this as a simple "slam-dunk" issue. But at the end of the day, I can only voice my concerns. If others think this is okay and approve it then so be it.

How about a compromise then?

I consider post-VM-exit the more important part. And also the more likely to fail. Here the process has a "VM history" that may cause code to access tty. For example, the NMT C-Heap corruption checker you encountered. Or things like error handling: We don't disable the signal handlers so that we will handle crashes till the bitter end. And things like AGCT, which unfortunately exist.

So, how about

- Pre init: We silently swallow all output to tty without crashing. But we give the user a way to switch output on via environment variable (that is really the only way before arg parsing), e.g. "VM_ENABLE_EARLY_TTY". If that is set, we set tty to stdout.

- Post exit: We extend the lifespan of tty "in spirit": after exit, we assign tty either to stdout or stderr, depending on what option the user had set. We default to stdout, since that is where todays tty=NULL crashes would have been written to. That would preserve the final defaultStream flush and make sure any subsequent output is written immediately since its unbuffered.

In the future, we can clean up and simplify defaultStream, make it work beyond the VM lifespan, and add regression checks for this.

Cheers, Thomas

-------------

PR: https://git.openjdk.org/jdk/pull/9874