[code-reflection] RFR: Explicit Arena passing to Tensor construction and Session execution [v2]

Wed Mar 5 11:08:06 UTC 2025

On Wed, 5 Mar 2025 08:16:57 GMT, Adam Sotona <asotona at openjdk.org> wrote:

>>> Is there a way to allocate an individually GCed memory segment without a cost of creating the whole auto arena (and similarly an individually freed MS without a new confined arena cost)?
>> 
>> no
>
> In such case there are only two options:
> 1. Always create a new auto arena so each tensor, session and session result has its own life cycle.
> 2. Let user decide about the life cycle by passing arena to each tensor and session construction and session run.
> 
> This PR enables #2, with default to #1 for tensors construction.

Even assuming we had some way to allocate an automatic segment w/o an arena, you would still be creating a segment backed by a new GC-managed lifetime. The arena creation only makes it explicit when you are creating such a lifetime.

For automatic segments, we could even provide such an arena-free factory -- we considered it in the past, but pruned it as it didn't really add much (creating an automatic segment is still an expensive operation, as objects have to be registered against a `Cleaner` object -- this is typically a problem for `ByteBuffer`s).

For non-automatic arenas this is not an option, for several reasons:
* an arena is a *capability* -- only a client with an arena can deallocate memory. If you have a `free` segment on a memory segment, suddenly you realize that you can no longer pass a segment you created safely to an API -- you always have to defend against that API potentially closing that segment, by calling `free`. We wrestled with this problem for a long time, by allowing defensive, non-closeable views of segments -- it really didn't look like a great world to live in.
* There is a natural "one to many" relationship between lifetime and segments. It is very common to have many segments share the same lifetime. This is is a good way to discipline the use of off-heap memory -- as any time you have segments with different lifetimes interacting with each other, you can end up with a potential use-after-free issue. This is covered in more details [here](https://cr.openjdk.org/~mcimadamore/panama/why_lifetimes.html)
* The performance model of shared arenas, naturally leans towards a model where you group segments together. That is, allowing safe memory release of individual memory segments would be just too expensive to support.
* It is always possible to "drop down a level" and have something more similar to `malloc`/`free` -- by using `Linker` + `reinterpret` -- an example can be found [here](https://github.com/openjdk/jdk/pull/19308/files#diff-e1f7dec1e858c9df738ebebd894dbd10a2885e81264ece400db3620a56f33c3eR58)

> In such case there are only two options:

I believe the key question here is which lifetimes we want developers to control. The more I look at this, the more it seems to me that when creating envs, sessions or global tensors you don't really care about "deterministic deallocation", so using an automatic lifetime is a good choice.

But, when you run a model, you likely need a lot of transient resources, and it would be nice to be able to control the lifetime of those resources. 

This leads, I think, to a model/API where e.g. a session can be created w/o an arena (an automatic arena will be picked for you). The session will let you allocate session-wide tensors -- which will stay alive as long as the session is alive.

In order to run a model, you have to get some kind of `AutoCloseable` object from the session. This object would be backed by a confined arena, and would also act as a tensor allocator -- but this time, the lifetime of these tensors would be tied to that of the confined arena (in a way, the lifetime of this `AutoCloseable` is "nested" in the lifetime of the session). Example usage:

OnnxSession session = OnnxSession.of();
Tensor<Float> session_tensor = session.allocateTensor(...);
...
try (RunContext runContext = session.newRunContext()) {
    Tensor<Float> run_tensor = runContext.allocateTensor(...);
    Tensor<Float> 'result' = runContext.run(... session_tensor ... run_tensor ... );
    // can use 'result' here
} // both run_tensor and result are deallocated here

(the names are not great and are for illustrative purposes only -- it is possible that instead of exposing `RunContext` as an `AutoCloseable` it might make sense to expose it as a `run` function taking a lambda expression on `OnnxSession` -- but the essence is the same).

IMHO the above strikes a nice balance between deterministic and non-deterministic deallocation, and the resulting API doesn't seem too complex to use (no arenas need to be created or be passed anywhere). It is of course only one of the many possible ways to model some of the capabilities of ONNX with an higher-level Java API -- there are probably other (and better) options.

-------------

PR Review Comment: https://git.openjdk.org/babylon/pull/337#discussion_r1981179834