Custom scheduler: Using loom for deterministic simulation testing

Wed Jul 9 03:11:42 UTC 2025

Controlled concurrency testing (CCT) is not only about testing, but also
debugging. While the JDK issue mentioned in the blog post isn't technically
a bug, deterministic testing helps answer crucial questions like: Is my
application buggy? Is there an issue with my library? Am I using the wrong
library? The key advantage is that once you identify a bug, you can replay
it deterministically every time.

CCT benefits both single-process concurrent systems and distributed systems
by systematically exploring different thread interleavings, which
accelerates race condition discovery.  Go race detector is only for data
races, while many deterministic testing frameworks are designed to find a
broader range of race conditions.

Coincidentally, I'm currently interning at Antithesis
<https://antithesis.com/> where we're exploring the integration of Fray
with the deterministic hypervisor to combine the strengths of both
approaches. Fray excels at exploring thread interleavings but doesn't
handle network traffic, file I/O, and other sources of non-determinism.
Meanwhile, a deterministic hypervisor provides a deterministic environment
but relies on the kernel schedulers to run concurrent programs (could be
less efficient).

I really like the idea of using a customized user-space scheduler for CCT.
I know shuttle <https://github.com/awslabs/shuttle> and loom-rs
<https://github.com/tokio-rs/loom> are doing this. One challenge I’m
thinking about is that applications may mix both green threads with
physical threads. This could be messy because the CCT itself relies on
concurrency primitives.

>From my experience, handling all non-determinism within the JVM is
particularly challenging. When I attempted this approach in Fray, it made
the framework both cumbersome and fragile. In contrast, I'm impressed by
how elegantly Antithesis's deterministic hypervisor solves these issues for
Fray. For network and I/O operations, Fray can simply mark threads as
blocked when applications perform network operations and unblock them upon
completion while running inside the hypervisor.
https://github.com/cmu-pasta/fray/blob/main/core/src/main/kotlin/org/pastalab/fray/core/controllers/ReactiveNetworkController.kt
You may compare it against a version where Fray tries to manage the network
IO itself:
https://github.com/cmu-pasta/fray/blob/main/core/src/main/kotlin/org/pastalab/fray/core/controllers/ProactiveNetworkController.kt

For an open-source solution, RR <https://github.com/rr-debugger/rr>+Fray
sounds promising but I haven’t tried myself. Of course, this cannot test
distributed systems.

Ao

On Sat, Jul 5, 2025 at 4:26 PM Ryan Yeats <ryeats at gmail.com> wrote:

> Hi,
> I have also been looking into deterministic simulation on the side and
> belatedly saw some interest about it on the loom email list with the
> "Custom scheduler: Customize current time and timed waits" thread. I wanted
> to briefly expand on the topic of using Loom to make deterministic
> simulation testing more easily available to the java community.
>
> BLUF:  Deterministic simulation testing makes multi threaded logic single
> threaded and fuzz tests the execution order, greatly simplifying discovery
> of race conditions and deadlocks. See this blog post by Ao Li
> <https://aoli.al/blogs/jdk-bug/> for an example doing this in java and
> the paper behind it  Fray: An Efficient General-Purpose Concurrency
> Testing Platform for the JVM <https://arxiv.org/abs/2501.12618>.
>
> The above explanation omits how hard it is to actually make java or any
> language deterministic. This is where James Baker had a clever idea
> <https://jbaker.io/2022/05/09/project-loom-for-distributed-systems/> of
> using Loom, since we could replace the virtual thread scheduler with a
> deterministic one, Loom does most of the work for us without any extra
> frameworks. This works brilliantly, however there are three main caveats:
>
>    1. Its not easy to replace the virtual thread scheduler
>    2. The aforementioned scheduler has no understanding of delays so any
>    Thread.sleep or Object.wait introduces continuations back into the
>    scheduled execution pool non-deterministically. This affects the execution
>    order so bugs can't always be reproduced.
>    3. IO also introduces continuations back to the scheduler execution
>    pool non-deterministically.
>
> The first issue can currently be solved by using reflection to make the virtual
> thread constructor public
> <https://github.com/ryeats/loom-dst/blob/main/src/main/java/org/example/SchedulableVirtualThreadFactory.java>. The
> second issue can currently be solved by controlling system time by
> replacing the byte code for all calls to System.nanoTime(),
> System.onCurrentTimeMillis() and Instant.now() using an agent at runtime
> <https://github.com/cmu-pasta/fray/blob/main/instrumentation/base/src/main/kotlin/org/pastalab/fray/instrumentation/base/visitors/TimeInstrumenter.kt>.
> The third likely has no general solution but I am interested in hearing
> ideas.
>
> Even if there is no solution to IO non-determinism and developers have to
> stub out all IO, deterministic simulation is still an incredibly promising
> tool for making concurrent and distributed systems programing much simpler
> and safer. I think because of this use case there would be a lot of benefit
> if the Loom API allowed instrumenting the scheduler, even better if we
> could access the DELAYED_TASK_SCHEDULERS which understand delays.
>
> Thank you for your time, I have been super excited to use virtual threads
> and amazed by the ingenuity that brought them to java.
>
> Ryan
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20250708/35b8cb59/attachment-0001.htm>