[crac] RFR: Draft: Move more FD tracking to java layer

Wed Jun 7 13:08:30 UTC 2023

On Wed, 7 Jun 2023 10:51:41 GMT, Anton Kozlov <akozlov at openjdk.org> wrote:

> The PR develops the idea of file descriptors tracking in Java started in #43. In general, that PR had two purposes. First, it provides CheckpointExceptions in terms that are clear for Java developers, improving the experience of developing for CRac. So if a FileDescriptor causes an exception, it's possible to look at the heap dump and find references to the offending FD, or to look at the stack trace when FD was created. And second, Java FD tracking is independent of the platform, so that was the first step to bring CRaC to non-Linux platforms, but that is a bit longer road.
> 
> We can eliminate manual heap inspection, and this is proposed in this PR. A FileDescriptor does not exist on its own but it is owned by some higher-level Java object implementation. So an object can "claim" a FileDescriptor and define how and if to report the FD to the user. E.g. Socket can describe the its port and address without deep inspection of the process internals. Turns out, Socket.toString() provides enough information (but the reporting can be extended later if required).
> 
> 
> 	Suppressed: jdk.crac.impl.CheckpointOpenSocketException: Socket[addr=localhost/127.0.0.1,port=39957,localport=41464]
> 		at java.base/java.net.SocketImpl$SocketResource.lambda$beforeCheckpoint$0(SocketImpl.java:123)
> 		at java.base/jdk.crac.Core.lambda$checkpointRestore1$0(Core.java:128)
> 		... 7 more
> 
> 
> A FileDescriptor is claiming itself in case there is a bug in JDK that no higher-level object is claiming the FD. FD provides just a very short description just for debugging. With stack trace to FD (which is a very nice debugging aid!), that should be enough to find the containing object and implement claiming.
> 
> I believe this overlaps with #69, which at first glance would benefit a lot from being able to define policies in the domain objects. I'll comment on this after a closer look at the other PR.

You've shown an example with socket, but is there any case when this gets really helpful? You present a socket, but there's no extra info provided compared to what we already had. I guess that you intend to have the claiming part implemented later on for most of the 'owners', but I think that most of the time the FileDescriptor is created within owner constructor/init method, and therefore the owner would be obvious from the stack trace. One point would be for not requiring a re-run with stack trace collection, but when you figure out that FD was created by RandomAccessFile you're not really closer to the part of code you need to fix - you wouldn't scan all your dependencies for any use of RAF.
You mention investigating heap dumps - I don't think that's needed anymore when you have the stack trace. Could you present a real case (preferrably in a test!) where knowing the ownership in terms of references (as from the heap dump) is more practical?

One risk I perceive with this approach is when the FD is not owned directly, but through a chain of possible owners, e.g. FileDescriptor -> FileOutputStream -> FileWriter -> LibraryClass -> ApplicationClass. You would have to propagate the FileDescriptor through the layers, breaking encapsulation and adding more and more code, or lose the information about ownership anyway.

test/jdk/jdk/crac/fileDescriptors/OpenFileDetectionTest.java line 51:

> 49:     @Override
> 50:     public void exec() throws Exception {
> 51:         try (var file = new RandomAccessFile("/etc/passwd", "r")) {

What would happen with the FileReader? It would be better to not remove test for particular case, but add one for RandomAccessFile.

-------------

PR Review: https://git.openjdk.org/crac/pull/79#pullrequestreview-1467539190
PR Review Comment: https://git.openjdk.org/crac/pull/79#discussion_r1221514924