[External] : Re: jstack, profilers and other tools

Wed Jul 20 18:49:59 UTC 2022

Waiting for requests doesn't count towards response time. Waiting for
responses from downstream does, but it is doable with rather small thread
counts - perhaps as witness to just how large a proportion of response time
that typically is.

I agree virtual threads allow to build things differently. But it's not
just the awkwardness of async API that changes. Blocking API eliminates
inversion of control and changes allocation pattern.

For example, first you allocate 16KB to read efficiently, then you are
blocked - perhaps, for seconds while the user is having a think. So 10k
connections pin 160MB waiting for a request. With async API nothing is
pinned, because you don't allocate until you know who's read-ready. This
may no longer be a problem with modern JVMs, but needs calling out.

The lack of inversion of control can be seen as inability to control who
reads when. So to speak, would you rather read 100 bytes of 10k requests
each, or would you rather read 10KB of 100 requests each? This has impact
on memory footprint and who can make progress. It makes sense to read more
requests when you run out of things to do, but doesn't make sense if you
saturated the bottleneck resource.

Alex

On Tue, 19 Jul 2022, 12:02 Ron Pressler, <ron.pressler at oracle.com> wrote:

> First, async APIs and lightweight user-mode threads were invented because
> we empirically know there are many systems where the number of threads is
> the first bottleneck on maximum throughput that they encounter. The purpose
> of async APIs/lightweight threads is to allow those systems to hit the
> other, higher, limits on capacity. Of course, there are systems that hit
> other bottlenecks first, but the number of threads limitation is known to
> be very common. If a theory tells you it shouldn’t be, then the it’s the
> theory that should be revised (hint: the portion of total latency spent
> waiting for I/O is commonly *very* high; if you want to use Little’s law
> for that calculation, note that adding concurrency for fanout increases L
> and reduces W by the same factor, so it's handy to add together the total
> time spent waiting for, say, 5-50 outgoing microservice calls as if they
> were done sequentially, and compare that against the total time spent
> composing the results; this total “wait latency” is frequently in the
> hundreds of milliseconds — far higher than the actual request latency — and
> easily two orders of magnitude higher than CPU time).
>
> Second, the number of *threads* (as opposed to the number of concurrent
> operation) has, at most, a negligible impact on contention. If we’re
> talking about low-level memory contention, then what matters is the number
> of processing cores, not the number of threads (beyond the number of
> cores), and if we’re talking about other resources, then that contention is
> part of the other limits on concurrency (and so throughput) in the system,
> and the way it is reached — be it with many threads or one — is irrelevant.
> It is true that various scheduling algorithms — whether used on threads or
> on async constructs the relevant scheduling problems and algorithms are the
> same — could reduce some overhead, but we’re talking about effects that are
> orders of magnitude lower than what can be achieved by reducing artificial
> limits on concurrency, but could matter to get the very last drop of
> performance; I go through the calculation of the effect of scheduling
> overhead here: https://inside.java/2020/08/07/loom-performance/. In
> short, the impact of scheduling can only be high if the total amount of
> time spent on scheduling is significant when compared to the time spent
> waiting for I/O.
>
> — Ron
>
>
>
> On 19 Jul 2022, at 09:22, Alex Otenko <oleksandr.otenko at gmail.com> wrote:
>
> Thanks, that's what I was trying to get across, too.
>
> Also, 10k threads per request doesn't mean that the concurrency is in
> thousands. In the thought experiment it is. In practice - ... well, if the
> systems are fine with dozens or even hundreds of threads, there should be
> no problem even doubling thread count, if it can double, or at least
> improve, throughput. In my experience this is not the case. There even are
> famous systems with self-tuning thread pool sizes, and I worked on the
> self-tuning algorithm. I have seen various apps and workloads that use that
> system and haven't seen any that would reach maximum thread count of a few
> hundred even on a fairly large machine. So whereas I never found anything
> wrong with the claim that thread count is one of the caps on throughput, I
> find the claim that allowing thread per request is going to improve
> concurrency problematic exactly because there are other caps. There surely
> are such workloads that are bottlenecked on thread count that can't grow
> into thousands, but in my practice I haven't seen a single one of this
> kind. If we had thousands of threads per CPU, they just need to be waiting
> so much that business logic must be very trivial.
>
> For example, the thought experiment with 10k threads and 0.5s response
> time. If that is executed on a 1 CPU machine, each request must be spending
> 50 microseconds on CPU, and for the rest of time waiting for something. If
> it's waiting for a lock or a pool of resource, you may be better off having
> fewer threads (coarsening contention). So it better be some network
> connection, or something of the kind. So 499.95ms it is waiting on that,
> and does request parsing, response construction, etc in 50 microseconds.
> This sort of profile is not a very common pattern.
>
> If we consider tens of CPUs for 10k threads, it starts to look far less
> impressive in terms of the number of threads.
>
> That's all about concurrency and threads as a bottleneck resource. There
> are other important uses of threads, but those are not about increasing
> concurrency.
>
>
> Ok, I reckon the topic got bashed to smithereens.
>
> Alex
>
> On Mon, 18 Jul 2022, 22:57 Ron Pressler, <ron.pressler at oracle.com> wrote:
>
>> “Concurrency rises with throughput”, which is just a mathematical fact,
>> is not the same as the claim — that no one is making — that one can *raise*
>> throughput by adding threads. However, it is the same as the claim that the
>> *maximum* throughput might rise if the *maximum* number of threads is
>> increased, because that’s just how dependent variables can work in
>> mathematics, as I’ll try explaining.
>>
>> There is no “more threads to get *better throughput*”, and there is no
>> question about “applying” Little’s law. Little’s law is simply the maths
>> that tells us how many requests are being concurrently served in some
>> system. There is no getting around it. In a system with 10K requests/s,
>> each taking 500ms on average, there *are* 5K concurrent requests. If the
>> program is written in the thread-per-request style, then it *has* at least
>> 5K threads. Now, if the rate of requests doubles to 20K req/s and the
>> system doesn’t collapse, then then there must be at least 10K threads
>> serving them.
>>
>> Note that the increase in threads doesn’t raise the throughput, but it
>> must accompany it. However, because concurrency rises with throughput, the
>> *maximum* number of threads does pose an upper bound on throughput.
>>
>> It is very important to understand the difference between “adding
>> processing units could decrease latency in a data-parallel program” and
>> “concurrency rises with throughput in a concurrent program.” In the former,
>> the units are an independent variable, and in the latter they’re not — i.e.
>> when the throughput is higher there are more threads, but adding threads
>> doesn’t increase the throughput.
>>
>> And yet, because this forms an *upper bound* on throughput, the ability
>> to have more threads is a prerequisite to raising the maximum attainable
>> throughput (with the thread-per-request style). So raising the number of
>> threads cannot possibly increase throughput, and yet raising the maximum
>> number of threads could increase maximum throughput (until it’s bounded by
>> something else). That’s just how dependent variables work when talking
>> about upper/lower bounds.
>>
>> — Ron
>>
>> On 18 Jul 2022, at 19:01, Alex Otenko <oleksandr.otenko at gmail.com> wrote:
>>
>> I think I have made it clear that I am not sceptical about the ability to
>> spawn threads in large numbers, and that all I am sceptical about is the
>> use of Little's law in the way you did. You made it look like one needs
>> thousands of threads to get better throughput, whereas typical numbers are
>> much more modest than that. In practice you can't heedlessly add more
>> threads, as at some point you get response time degrading with no
>> improvement to throughput.
>>
>> On Sun, 17 Jul 2022, 10:59 Ron Pressler, <ron.pressler at oracle.com> wrote:
>>
>>> If your thread-per-request system is getting 10K req/s (on average),
>>> each request takes 500ms (on average) to handle, and this can be sustained
>>> (i.e. the system is stable), then it doesn’t matter how much CPU or RAM is
>>> consumed, how much network bandwidth you’re using, or even how many
>>> machines you have: the (average) number of threads you’re running *is* no
>>> less than 5K (and, in practice, will usually be several times that).
>>>
>>> So it’s not that adding more threads is going to increase throughput (in
>>> fact, it won’t; having 1M threads will do nothing in this case), it’s that
>>> the number of threads is an upper bound on L (among all other upper bounds
>>> on L). Conversely, reaching a certain throughput requires some minimum
>>> number of threads.
>>>
>>> As to how many thread-per-request systems do or would hit the OS-thread
>>> boundary before they hit others, that’s an empirical question, and I think
>>> it is well-established that there are many such systems, but if you’re
>>> sceptical and think that user-mode threads/asynchronous APIs have little
>>> impact, you can just wait and see.
>>>
>>> — Ron
>>>
>>> On 16 Jul 2022, at 20:30, Alex Otenko <oleksandr.otenko at gmail.com>
>>> wrote:
>>>
>>> That's the indisputable bit. The contentious part is that adding more
>>> threads is going to increase throughput.
>>>
>>> Supposing that 10k threads are there, and you actually need them, you
>>> should get concurrency level 10k. Let's see what that means in practice.
>>>
>>> If it is a 1-CPU machine, 10k requests in flight somewhere at any given
>>> time means they are waiting for 99.99% of time. Or, out of 1 second they
>>> spend 100 microseconds on CPU, and waiting for something for the rest of
>>> the time (or, out of 100ms response time, 10 microseconds on CPU - barely
>>> enough to parse REST request). This can't be the case for the majority of
>>> workflows.
>>>
>>> Of course, having 10k threads for less than 1 second each doesn't mean
>>> you are getting concurrency thar is unattainable with fewer threads.
>>>
>>> The bottom line is that adding threads you aren't necessarily increasing
>>> concurrency.
>>>
>>> On Fri, 15 Jul 2022, 10:19 Ron Pressler, <ron.pressler at oracle.com>
>>> wrote:
>>>
>>>> The number of threads doesn’t “do” or not do you do anything. If
>>>> requests arrive at 100K per second, each takes 500ms to process, then the
>>>> number of threads you’re using *is equal to* at least 50K (assuming
>>>> thread-per-request) in a stable system, that’s all. That is the physical
>>>> meaning: the formula tells you what the quantities *are* in a stable
>>>> system.
>>>>
>>>> Because in a thread-per-request program, every concurrent request takes
>>>> up at least one thread, while the formula does not immediately tell you how
>>>> many machines are used, or what the RAM, CPU, and network bandwidth
>>>> utilisation is, it does give you a lower bound on the total number of live
>>>> threads. Conversely, the number of threads gives an upper bound on L.
>>>>
>>>> As to the rest about splitting into subtasks, that increases L and
>>>> reduces W by the same factor, so when applying Little’s law it’s handy to
>>>> treat W as the total latency, *as if* it was processed sequentially, if
>>>> we’re interested in L being the number of concurrent requests. More about
>>>> that here: https://inside.java/2020/08/07/loom-performance/
>>>> <https://urldefense.com/v3/__https://inside.java/2020/08/07/loom-performance/__;!!ACWV5N9M2RV99hQ!LwEzCaeHJaxabByuY70NvD7RKrrcLhHjnaTLRUpLB40wh_ccQ_20JK__7-UA4AfkP2a6jEOR6KehkhXmKBfKMluHMPsa$>
>>>>
>>>> — Ron
>>>>
>>>> On 15 Jul 2022, at 09:37, Alex Otenko <oleksandr.otenko at gmail.com>
>>>> wrote:
>>>>
>>>> You quickly jumped to a *therefore*.
>>>>
>>>> Newton's second law binds force, mass and acceleration. But you can't
>>>> say that you can decrease mass by increasing acceleration, if the force
>>>> remains the same. That is, the statement would be arithmetically correct,
>>>> but it would have no physical meaning.
>>>>
>>>> Adding threads allows to do more work. But you can't do more work at
>>>> will - the amount of work going through the system is a quantity
>>>> independent of your design.
>>>>
>>>> Now, what you could do at will, is split the work into sub-tasks.
>>>> Virtual threads allow to do this at very little cost. However, you still
>>>> can't talk about an increase in concurrency due to Little's law, because -
>>>> enter Amdahl - response time changes.
>>>>
>>>> Say, 100k requests get split into 10 sub tasks each, each runnable
>>>> independently. Amdahl says your response time is going down 10-fold. So you
>>>> have 100k requests times 1ms gives concurrency 100. Concurrency got
>>>> reduced. Not surprising at all, because now each request spends 10x less
>>>> time in the system.
>>>>
>>>> What about subtasks? Aren't we running more of them? Does this mean
>>>> concurrency increased?
>>>>
>>>> Yes, 100k requests begets 1m sub tasks. We can't compare concurrency,
>>>> because the definition of the unit of work changed: was W, became W/10. But
>>>> let's see anyway. So we have 1m tasks, each finished in 1ms - concurrency
>>>> is 1000. Same as before splitting the work and matching change of response
>>>> time. I treat this like I would any units of measurement change.
>>>>
>>>>
>>>> So whereas I see a lot of good from being able to spin up threads, lots
>>>> and shortlived, I don't see how you can claim concurrency increases, or
>>>> that Little's law somehow controls throughput.
>>>>
>>>>
>>>> Alex
>>>>
>>>> On Thu, 14 Jul 2022, 11:01 Ron Pressler, <ron.pressler at oracle.com>
>>>> wrote:
>>>>
>>>>> Little’s law tells us what the relationship between concurrency,
>>>>> throughput and latency is if the system is stable. It tells us that if
>>>>> latency doesn’t decrease, then concurrency rises with throughput (again, if
>>>>> the system is stable). Therefore, to support high throughput you need a
>>>>> high level of concurrency. Since the Java platform’s unit of concurrency is
>>>>> the thread, to support high throughput you need a high number of threads.
>>>>> There might be other things you also need more of, but you *at least* need
>>>>> a high number of threads.
>>>>>
>>>>> The number of threads is an *upper bound* on concurrency, because the
>>>>> platform cannot make concurrent progress on anything without a thread (with
>>>>> the caveat in the next paragraph). There might be other upper bounds, too
>>>>> (e.g. you need enough memory to concurrently store all the working data for
>>>>> your concurrent operations), but the number of threads *is* an upper bound,
>>>>> and the one virtual threads are there to remove.
>>>>>
>>>>> Of course, as JEP 425 explains, you could abandon threads altogether
>>>>> and use some other construct as your unit of concurrency, but then you lose
>>>>> platform support.
>>>>>
>>>>> In any event, virtual threads exist to support a high number of
>>>>> threads, as Little’s law requires, therefore, if you use virtual threads,
>>>>> you have a high number of them.
>>>>>
>>>>> — Ron
>>>>>
>>>>> On 14 Jul 2022, at 08:12, Alex Otenko <oleksandr.otenko at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi Ron,
>>>>>
>>>>> It looks you are unconvinced. Let me try with illustrative numbers.
>>>>>
>>>>> The users opening their laptops at 9am don't know how many threads you
>>>>> have. So throughput remains 100k ops/sec in both setups below. Suppose, in
>>>>> the first setup we have a system that is stable with 1000 threads. Little's
>>>>> law tells us that the response time cannot exceed 10ms in this case.
>>>>> Little's law does not prescribe response time, by the way; it is merely a
>>>>> consequence of the statement that the system is stable: it couldn't have
>>>>> been stable if its response time were higher.
>>>>>
>>>>> Now, let's create one thread per request. One claim is that this
>>>>> increases concurrency (and I object to this point alone). Suppose this
>>>>> means concurrency becomes 100k. Little's law says that the response time
>>>>> must be 1 second. Sorry, but that's hardly an improvement! In fact, for any
>>>>> concurrency greater than 1000 you must get response time higher than 10ms
>>>>> we've got with 1000 threads. This is not what we want. Fortunately, this is
>>>>> not what happens either.
>>>>>
>>>>> Really, thread count in the thread per request design has little to do
>>>>> with concurrency level. Concurrency level is a derived quantity. It only
>>>>> tells us how many requests are making progress at any given time in a
>>>>> system that experiences request arrival rate R and which is able to process
>>>>> them in time T. The only thing you can control through system design is
>>>>> response time T.
>>>>>
>>>>> There are good reasons to design a system that way, but Little's law
>>>>> is not one of them.
>>>>>
>>>>> On Wed, 13 Jul 2022, 14:29 Ron Pressler, <ron.pressler at oracle.com>
>>>>> wrote:
>>>>>
>>>>>> The application of Little’s law is 100% correct. Little’s law tells
>>>>>> us that the number of threads must *necessarily* rise if throughput is to
>>>>>> be high. Whether or not that alone is *sufficient* might depend on the
>>>>>> concurrency level of other resources as well. The number of threads is not
>>>>>> the only quantity that limits the L in the formula, but L cannot be higher
>>>>>> than the number of threads. Obviously, if the system’s level of concurrency
>>>>>> is bounded at a very low level — say, 10 — then having more than 10 threads
>>>>>> is unhelpful, but as we’re talking about a program that uses virtual
>>>>>> threads, we know that is not the case.
>>>>>>
>>>>>> Also, Little’s law describes *stable* systems; i.e. it says that *if*
>>>>>> the system is stable, then a certain relationship must hold. While it is
>>>>>> true that the rate of arrival might rise without bound, if the number of
>>>>>> threads is insufficient to meet it, then the system is no longer stable
>>>>>> (normally that means that queues are growing without bound).
>>>>>>
>>>>>> — Ron
>>>>>>
>>>>>> On 13 Jul 2022, at 14:00, Alex Otenko <oleksandr.otenko at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> This is an incorrect application of Little's Law. The law only posits
>>>>>> that there is a connection between quantities. It doesn't specify which
>>>>>> variables depend on which. In particular, throughput is not a free
>>>>>> variable.
>>>>>>
>>>>>> Throughput is something outside your control. 100k users open their
>>>>>> laptops at 9am and login within 1 second - that's it, you have throughput
>>>>>> of 100k ops/sec.
>>>>>>
>>>>>> Then based on response time the system is able to deliver, you can
>>>>>> tell what concurrency makes sense here. Adding threads is not going to
>>>>>> change anything - certainly not if threads are not the bottleneck resource.
>>>>>> Threads become the bottleneck when you have hardware to run them, but not
>>>>>> the threads.
>>>>>>
>>>>>> On Tue, 12 Jul 2022, 15:47 Ron Pressler, <ron.pressler at oracle.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 11 Jul 2022, at 22:13, Rob Bygrave <robin.bygrave at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> *> An existing application that migrates to using virtual threads
>>>>>>> doesn’t replace its platform threads with virtual threads*
>>>>>>>
>>>>>>> What I have been confident about to date based on the testing I've
>>>>>>> done is that we can use Jetty with a Loom based thread pool and that has
>>>>>>> worked very well. That is replacing current platform threads with virtual
>>>>>>> threads. I'm suggesting this will frequently be sub 1000 virtual threads.
>>>>>>> Ron, are you suggesting this isn't a valid use of virtual threads or am I
>>>>>>> reading too much into what you've said here?
>>>>>>>
>>>>>>>
>>>>>>> The throughput advantage to virtual threads comes from one aspect —
>>>>>>> their *number* — as explained by Little’s law. A web server employing
>>>>>>> virtual thread would not replace a pool of N platform threads with a pool
>>>>>>> of N virtual threads, as that does not increase the number of threads
>>>>>>> required to increase throughput. Rather, it replaces the pool of N virtual
>>>>>>> threads with an unpooled ExecutorService that spawns at least one new
>>>>>>> virtual thread for every HTTP serving task. Only that can increase the
>>>>>>> number of threads sufficiently to improve throughput.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > *unusual* for an application that has any virtual threads to have
>>>>>>> fewer than, say, 10,000
>>>>>>>
>>>>>>> In the case of http server use of virtual thread, I feel the use of
>>>>>>> *unusual* is too strong. That is, when we are using virtual threads
>>>>>>> for application code handling of http request/response (like Jetty + Loom),
>>>>>>> I suspect this is frequently going to operate with less than 1000
>>>>>>> concurrent requests per server instance.
>>>>>>>
>>>>>>>
>>>>>>> 1000 concurrent requests would likely translate to more than 10,000
>>>>>>> virtual threads due to fanout (JEPs 425 and 428 cover this). In fact, even
>>>>>>> without fanout, every HTTP request might wish to spawn more than one
>>>>>>> thread, for example to have one thread for reading and one for writing. The
>>>>>>> number 10,000, however, is just illustrative. Clearly, an application with
>>>>>>> virtual threads will have some large number of threads (significantly
>>>>>>> larger than applications with just platform threads), because the ability
>>>>>>> to have a large number of threads is what virtual threads are for.
>>>>>>>
>>>>>>> The important point is that tooling needs to adapt to a high number
>>>>>>> of threads, which is why we’ve added a tool that’s designed to make sense
>>>>>>> of many threads, where jstack might not be very useful.
>>>>>>>
>>>>>>> — Ron
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20220720/f7eafcd6/attachment-0001.htm>