Jetty and Loom

Tue Jan 5 10:58:08 UTC 2021

P.S.

There’s another point in your email that I think deserves some more elaboration,
especially since I’ve heard is mentioned elsewhere: "CPU bound tasks can be handled.”

There might be an assumption that platform threads “handle” CPU bound tasks
“better” than virtual threads. This is misleading and might cause people to 
focus on the wrong thing.

Let’s put it to the test: pick some CPU-bound task and run it in a virtual thread
and in a platform thread, join the thread and compare the completion time.
Now repeat with 3 threads, 100, and 100,000. Virtual threads would perform
similarly to platform threads, and, overall, their scalability would actually
be much better.

Now, you mentioned it in the context of platform-thread pools, so repeat the
experiment comparing one virtual thread per task vs. a task submitted to the
pool. Virtual threads might have some added overhead, but you’ll find that it
is negligible to the point of not being worth thinking about (if it isn’t —
let us know!)

What people actually mean is that some situations may arise where CPU bound tasks
running on some small number of virtual threads, say, 100, might impact the
latency of a non-CPU-bound operation running on the 101st thread more than
in the case of platform threads.

This is fixable, but it’s as yet unclear to us how often the situation arises
in practice. If you could come up with a realistic, and relatively common example 
where this is a problem, it could also be of use to us.

— Ron

On 5 January 2021 at 09:50:06, Ron Pressler (ron.pressler at oracle.com (mailto:ron.pressler at oracle.com)) wrote:

> > I see these over claims by Loom as harmful
>
> These aren’t claims but explanation and advice aimed to direct users toward a
> frame mind appropriate to understanding the feature.
>
> > it makes people like me want to test those claims
>
> Not claims, but good! Added bonus.
>
> > if current limitations were more clearly acknowledged.
>
> What limitations? It sounds as if you’re bothered by statements of the kind,
> “with garbage collection you don’t have to think about memory management, as
> it is done for you,” which is what you think are claims are, and believe it
> doesn’t acknowledge the fact that RAM is still limited. But that limitation
> does not help people understand the difference between GC and manual memory
> management. Also, 1. I think it’s obvious, and 2. we have said that you can’t
> have an infinite number of virtual threads and that you are still limited by
> RAM.
>
> > if that limit is below the systems capacity for kernel threads, then just
> > use a pool of kernel threads
>
> As I’ve tried to explain here, https://inside.java/2020/08/07/loom-performance/
> the main benefit of virtual threads comes from their *number*. If you need
> very few threads (not below the OS capacity, but below the CPU core count), then
> there is no point. It’s like using a 64 bit variable to count up to 10.
>
> Having said that, note that what matters is a *total* limit of threads. If
> there is some operation foo() with a concurrency limit of 4, it might still
> make sense to have 50K threads, if they also do things other than call foo().
>
> In addition, I think you’re trying to “reverse engineer” virtual threads and
> make assumptions about their internal design, overheads, and the like, that
> might be mistaken and/or in flux. For example, the “additional layer of scheduling”
> often *reduces* cost. Think about it this way: a cache adds a whole other
> layer of logic above the disk, but the difference in cost between a memory
> read and a cache read could be such that the performance is improved.
>
> If you can prove you’ve found an actual issue, do that and we might be able to
> address it. We’ll gladly put your tendency to focus on the negative to good use,
> but there’s no point focusing on hypothetical negatives. The goal of virtual
> threads is to allow a better, easier management of resources, i.e. *if there is
> a way* to use X resources to do Y things, where Y >> X, then Loom should make it
> easier. If you find a situation where it doesn’t do *that*, let us know.
>
> > So I come back to, if you need to limit concurrency to 100s or 1000s, or
> > you need threadlocals as a resource cache or you have CPU bound tasks,
> > then pooling kernel threads is a still valid technique and should not be
> > forgotten.
>
> Pooling platform threads is useful if only because it is needed to schedule
> virtual threads, and no one is suggesting it should be forgotten, but it
> should be deemphasised *when discussing virtual threads*. Still, I think some
> of the things you mention are more habit than actual good advice. For example,
> why does it matter to you that your resource cache be implemented as ThreadLocals
> specifically?
>
> > Furthermore that there are reasons other than start time that may favour
> > pooling implementations over alternatives
>
> What reasons? I’m not asking you to taunt you, but to make this discussion
> more anchored.
>
> > The point is not the absolute size of the stack, but that both thread types
> > still need similar real space for real stacks, which can be the limiting
> > factor to the number of threads available. Preallocated stacks have
> > pros/cons as do dynamic stacks. I don't think either is better, they are
> > just different.
>
> I don’t understand this. It’s like saying that a string in Java and in C
> need a similar amount of RAM. Don’t you think it misses the point of a GC
> which is supposed to help you manage the RAM you do have? That it doesn’t
> help you squeeze in more strings at a time isn’t a “limitation" on its job.
>
> Platform threads and virtual threads are “just different.” They’re different
> in the number of threads they can juggle, and that number is the point. If, say,
> you find that you can sometimes have more platform threads than virtual threads,
> now that would be an interesting limitation that we’d need to fix.
>
> — Ron
>
> On 5 January 2021 at 08:25:04, Greg Wilkins (gregw at webtide.com (mailto:gregw at webtide.com)) wrote:
>
> > Alan,
> >
> > On Mon, 4 Jan 2021 at 20:44, Alan Bateman wrote:
> >
> > > I haven't seen any memes go by on these topics.
> > >
> >
> > Meme as in "as a concept, belief, or practice, that spreads from person to
> > person" , not as in pictures with text. I'm seeing contagion in the wild
> > of thoughts like "X is no longer a problem because Loom", which I believe
> > have been encouraged by statements like "just spawn another" and "forget
> > thread pools". I see these over claims by Loom as harmful, as it makes
> > people like me want to test those claims, and thus the conversation becomes
> > about what Loom can't do rather than about what it can do. It encourages
> > a focus on the negative, when there are lots of positives that could be the
> > focus if current limitations were more clearly acknowledged.
> >
> > Perhaps these claims may ultimately prove to be true, but even if so, it is
> > some time until widespread availability. Better to under promise and over
> > deliver than over promise and under deliver.
> >
> > There have been a few examples with people trying out the builds that
> > > created thread pools of virtual threads, usually by replacing a
> > > ThreadFactory for a fixed thread pool when they meant to replace the thread
> > > pool.
> > >
> >
> > I agree that pooling virtual threads is often not going to be a sensible
> > thing to do, but perhaps for different reasons.
> >
> > Sure a semaphore can be used to limit concurrency of virtual threads, but
> > if that limit is below the systems capacity for kernel threads, then just
> > use a pool of kernel threads rather than a pool of virtual threads so you
> > get other advantages:
> >
> > - Maximum stack usage is reallocated, so less GC and OOM issues
> > - ThreadLocals work as lazy resource pools
> > - No cost of a second layer of scheduling
> > - CPU bound tasks can be handled
> >
> > If however, the limit on concurrency is very large, then you can't use an
> > Executor that contains a thread pool of kernel threads, but you can use one
> > that uses a semaphore to limit the concurrency of an infinite supply of
> > virtual threads, but you have to be aware of the differences:
> >
> > - you still do have a maximal bound of stack usage, it is large and
> > dynamically allocated from the heap, and thus probably needs to be
> > explicitly tested and GC tuned.
> > - Using ThreadLocal as lazy resource pools wont work, but is a poor
> > choice anyway with large concurrency. ThreadLocals will work fine for
> > passing calling context... in fact they may be simpler to use than for
> > async APIs.
> > - There is a cost of second layer of scheduling
> > - can't handle CPU bound tasks
> >
> > So I come back to, if you need to limit concurrency to 100s or 1000s, or
> > you need threadlocals as a resource cache or you have CPU bound tasks,
> > then pooling kernel threads is a still valid technique and should not be
> > forgotten.
> >
> > [aside to Ron - I'm not saying an Executor that limits concurrency is a
> > Thread Pool, I'm saying that a Thread pool is one possible implementation
> > of an Executor that limits concurrency. Furthermore that there are reasons
> > other than start time that may favour pooling implementations over
> > alternatives]
> >
> >
> > > Virtual threads can run existing code but I would be concerned with the
> > > memory footprint if they are just used to run a lot of existing bloat. I'm
> > > looking at stack traces every day too and they are deeper that many other
> > > languages/runtimes but a 1000+ stack traces seems a bit excessive when the
> > > work to do is relatively simple. Compiled frames are very efficient and
> > > virtual threads would be much happier with thread stacks that are a few KB.
> > >
> >
> > I agree that 1000+ stacks are excessive and the blog also shows how good
> > the JVM is at optimising stack frames. However, I picked 1000 as approx
> > 25% of the default capacity of allocated kernel thread stacks. If your
> > stacks never approach 1000+, then you are only using < 25% of the
> > preallocated stack space and that is something that can be tuned to give
> > more capacity for kernel threads..... just as heap and GC can be tuned to
> > give more capacity for dynamic stacks.
> >
> > The point is not the absolute size of the stack, but that both thread types
> > still need similar real space for real stacks, which can be the limiting
> > factor to the number of threads available. Preallocated stacks have
> > pros/cons as do dynamic stacks. I don't think either is better, they are
> > just different.
> >
> > There has been some exploration and prototypes of cancellation, including
> > > exploring how both mechanisms can co-exist but we haven't been happy with
> > > it. It's a topic that the project will come back to.
> > >
> >
> > Yet a recent post
> >
> > presents deadlines and cancellation as a fait-accompli feature of Loom.
> > I think that is more over promising. Perhaps something about virtual
> > threads makes this goal more achievable, but until it has been achieved it
> > is a mistake to claim it as a feature.
> >
> > For now, Object.wait works with the FJP managedBlocker mechanism to
> > > increase parallelism during the wait. So the pool 16 carrier threads will
> > > increase when executing code where Object.wait is used.
> >
> >
> > Ah that is interesting to know. I had written a test
> >
> > that confirmed virtual threads are deferred by CPU bound tasks, but I had
> > not checked synchronised.
> >
> > So Loom can spawn new kernel Theads if Object.wait is used! What is the
> > limit of that? In a pathological case of an application doing lots of
> > Object.waiting, could it eventually end up 1:1 kernel:virtual threads? If
> > so, then I guess there is an issue of many spawned virtual threads suddenly
> > needing too many kernel threads? I can't see how either OOM or deadlock
> > can be avoided generally? Sure apps can be rewritten, but Jetty is
> > not in control of the applications deployed on it.
> >
> > > The goal is make it possible to write code that scales as well as async
> > > code does but without the the complexity of async. The intention is that
> > > debuggers, profilers, and everything else just works as people expect.
> > >
> >
> > My sense of Loom at the moment is that goal is not that far off that goal
> > for a significant range of applications.... I just don't yet see it
> > close to a totally general solution for all applications and containers.
> >
> > Sorry to again focus on the negatives. I hope our work in Jetty will
> > focus on the positives.
> >
> > cheers
> >
> > --
> > Greg Wilkins CTO http://webtide.com
>