Project Loom technical questions

Sun Aug 1 08:53:34 UTC 2021

Hi Ignaz,

Another point that is important to me, is that Java needs to have efficient garbage collection for transient short lived data. That’s why we have generational garbage collectors that can scan a root set comprising stacks + remembered set, and then trace through the young generation without going through the old generation, which could be potentially large and have very little garbage.

Now if you have a million threads, that might start to dominate the amount of memory used in the system, and render a young generation GC give or take as inefficient as an old generation GC. ”Just” scanning the stacks suddenly becomes one of the costliest operations. Basically we would no longer have an efficient way of collecting short lived data any longer. In a language implementation like go, where there is no generational garbage collection, you would always collect the whole heap anyway, so that would not come at a cost that you are not already paying for. 

On a related note, the collectors that don’t have infrastructure for concurrent thread stack scanning would naturally have their GC pause times increased by an unacceptable amount. 

By moving not currently running thread stacks to the Java heap, we can allow the ones that are inactive and long lived to move into the old generation, and hence stop interfering with young collections. This allows the young generation collections to continue being more snappy; collecting faster and not blowing up pause times. That is rather difficult to achieve with kernel thread stacks. With a million threads running on like 8 cores or whatever, it seems likely that quite a few won’t be active immediately, and move into the old generation where they can stay out of trouble for the actively processed operations, allocating a bunch of short lived transient data.

So basically, by having user level threads, the threads themselves can move into the Java heap and play nicely with generational collectors and GC latencies, in a way that isn’t relevant for language implementations that don’t have generational GC. And generational GCs are great to have in a lot of workloads.

Hope this helps.

/Erik

> On 31 Jul 2021, at 17:25, Ignaz Birnstingl <ignazb at gmail.com> wrote:
> 
> Hi Ron,
> 
> Thanks for replying!
> Questions 2. and 3. are answered.
> 
>>> The default stack size for platform threads in Java is 1 MB on Linux and Mac. Kernel threads cannot resize
>>> their stack because they do not know how it’s used by the language; user-mode threads can. Loom’s virtual
>>> threads automatically grow and shrink depending on how much stack is currently used. TLABs are unrelated,
>>> and are associated with the OS threads internally used by the Java runtime rather than with virtual threads.
> If your process starts a million threads then for each thread 1 MB of stack would be reserved in its address space. Since address space in 64 bit applications is big enough that should not be a problem.
> But since the memory would initially not be used this would not contribute to the process' RSS. Or at least it should not. So it should not contribute to the "memory usage" which is considered for memory limits in container environments.
> Therefore I would argue that the memory usage for stacks should be roughly the same for kernel threads and virtual threads.
> 
> Having one million TLABs would certainly have more memory overhead than - say - 8. That is where I see the biggest benefit of using virtual threads.
> But this problem could theoretically be mitigated with core-local allocation buffers: Instead of having allocation buffers per kernel thread these would have to be per CPU core. Of course that would mean that special care would have to be taken by the JVM if/when a thread gets moved to a different CPU core.
> 
> -- 
> Ignaz