[External] : Re: Experience using virtual threads in EA 23-loom+4-102

Wed Jul 3 18:01:06 UTC 2024

As an example, the following demonstrates that not having timeslicing can be problematic for a “fair” system. (make sure the task count is above your number of available processors). This can be worked around with enough observability and proper context checking and propagation, but sometimes it isn’t trivial, and the “hang” will lurk until a certain production load hits it.

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Semaphore;

public class Main {
    public static void main(String[] args) throws InterruptedException {

        Semaphore s = new Semaphore(0);

        // ExecutorService executor = Executors.newCachedThreadPool(); // this will work
        ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor(); // this will hang
        for(int i=0;i<16;i++) {
            executor.submit(() -> { while(true);});
        }

        Thread.sleep(5000);

        executor.submit(() -> {s.release();});

        s.acquire();

        System.out.println("all done\n");

        System.exit(0);
    }
}

> On Jul 3, 2024, at 12:47 PM, Robert Engels <robaho at icloud.com> wrote:
> 
> Tbc, I am referring to cpu bound tasks of equal duration (cpu time).
> 
>> On Jul 3, 2024, at 12:46 PM, Robert Engels <robaho at icloud.com <mailto:robaho at icloud.com>> wrote:
>> 
>> Tbh, I didn’t quite understand this:
>> 
>> "Latency is not generally improved by time sharing regardless of the number of CPUs. In some situations time sharing will make it (potentially much) better, and in others it will make it (potentially much) worse. “
>> 
>> Because it is referring to two different things in my opinion.
>> 
>> I would have stated this that:
>> 
>> “Tail latency is improved for cpu bound tasks by timesharing regardless of the number of CPUs”.
>> 
>> I don’t see how timesharing can ever make tail latency worse - as normally the context switch overhead is a very small percentage of the timeslice allotment.
>> 
>> Also, the statement:
>> 
>> "IO preempts threads both in the OS and with the virtual thread scheduler even without time sharing.”
>> 
>> is not correct according to what I know about most OSes. An OS without timeslicing will never pre-empt a completely CPU bound task - it will run to completion or be killed - those are the only options (and the latter is close to Thread.stop() which as we know is problematic).
>> 
>>> On Jul 3, 2024, at 12:39 PM, Attila Kelemen <attila.kelemen85 at gmail.com <mailto:attila.kelemen85 at gmail.com>> wrote:
>>> 
>>> I think you somewhat misunderstood Ron's comment on "same". Same means that they are progressing the same task. For example, you have a large composite task which  is made up of 100 small chunks, and then you start these 100 chunks of work in parallel. If you are fair, then what you will see is that 0%, 0% ... and suddenly 100% when all of them are completed (assuming total fairness). While in case of non-fair, you will see progress that few chunks done, yet a few more done, etc.
>>> 
>>> Robert Engels <robaho at icloud.com <mailto:robaho at icloud.com>> ezt írta (időpont: 2024. júl. 3., Sze, 18:44):
>>> I don't think that is correct - but I could be wrong.
>>> 
>>> With platform threads the min and max latency in a completely cpu bound scenario should be very close the the average with a completely fair scheduler (when all tasks/threads are submitted at the same time).
>>> 
>>> Without timesharing, the average will be the same, but the min and max latencies will be far off the average - as the tasks submitted first complete very quickly, and the tasks submitted at the end take a very long time because they need to have all of the tasks before them complete.
>>> 
>>> In regards to the the “enough cpus” comment, I only meant that if there are enough cpus and a “somewhat” balanced workload, it is unlikely that all of the cpu bounds tasks could consume all of the carrier threads given a random distribution. If you have more active tasks than cpus and the majority of the tasks are cpu bound, the IO tasks are going to suffer in a non-timesliced scenario - they will be stalled waiting for a carrier thread - even though the amount of cpu they need will be very small.
>>> 
>>> This has a lot of info on the subject https://docs.kernel.org/scheduler/sched-design-CFS.html <https://docs.kernel.org/scheduler/sched-design-CFS.html> including:
>>> 
>>> On real hardware, we can run only a single task at once, so we have to introduce the concept of “virtual runtime.” The virtual runtime of a task specifies when its next timeslice would start execution on the ideal multi-tasking CPU described above. In practice, the virtual runtime of a task is its actual runtime normalized to the total number of running tasks.
>>> 
>>> I recommend this  https://opensource.com/article/19/2/fair-scheduling-linux <https://opensource.com/article/19/2/fair-scheduling-linux> for an in-depth discussion on how the dynamic timeslices are computed.
>>> 
>>> The linux scheduler relies on timeslicing in order to have a “fair” system. I think most Java “server” type applications strive for fairness as well - i.e. long tail latencies in anything are VERY bad (thus the constant fight against long GC pauses - better to amortize those for consistency).
>>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20240703/5b88a565/attachment-0001.htm>