RFR: 8372584: [Linux]: Replace reading proc to get thread user CPU time with clock_gettime [v7]

Tue Dec 23 13:36:28 UTC 2025

On Wed, Dec 3, 2025 at 10:35 AM Kevin Walls <kevinw at openjdk.org> wrote:

> On Tue, 2 Dec 2025 20:59:41 GMT, Jonas Norlinder <jnorlinder at openjdk.org>
> wrote:
>
> >> Since kernel v2.6.12 the Linux ABI have had support for encoding the
> clock types in the last three bits. Setting bit to 001 (CPUCLOCK_VIRT) will
> result in the kernel returning only user time. POSIX compliant
> implementations of pthread_getcpuclockid for the Linux kernel defaults to
> construct a clockid that with 010 (CPUCLOCK_SCHED) set, which return
> system+user time, which is what the POSIX standard mandates, see
> POSIX.1-2024/IEEE Std 1003.1-2024 §3.90. This patch joins the family of
> glibc, musl etc.  that utilities this bit pattern.
> >>
> >> This PR also results in improved performance and thus a reduced
> observer effect, especially for the 100th percentile (max).
> >>
> >> Before patch:
> >>
> >> Benchmark                  Mode      Cnt  Score    Error  Units
> >> CPUTime.execute          sample  7506555  0.008 ±  0.001  ms/op
> >> CPUTime.execute:p0.00    sample           0.008           ms/op
> >> CPUTime.execute:p0.50    sample           0.008           ms/op
> >> CPUTime.execute:p0.90    sample           0.008           ms/op
> >> CPUTime.execute:p0.95    sample           0.008           ms/op
> >> CPUTime.execute:p0.99    sample           0.012           ms/op
> >> CPUTime.execute:p0.999   sample           0.015           ms/op
> >> CPUTime.execute:p0.9999  sample           0.021           ms/op
> >> CPUTime.execute:p1.00    sample           1.030           ms/op
> >>
> >>
> >> After patch:
> >>
> >> Benchmark                  Mode      Cnt   Score    Error  Units
> >> CPUTime.execute          sample  8984189  ≈ 10⁻³           ms/op
> >> CPUTime.execute:p0.00    sample           ≈ 10⁻³           ms/op
> >> CPUTime.execute:p0.50    sample           ≈ 10⁻³           ms/op
> >> CPUTime.execute:p0.90    sample           ≈ 10⁻³           ms/op
> >> CPUTime.execute:p0.95    sample           ≈ 10⁻³           ms/op
> >> CPUTime.execute:p0.99    sample            0.001           ms/op
> >> CPUTime.execute:p0.999   sample            0.001           ms/op
> >> CPUTime.execute:p0.9999  sample            0.006           ms/op
> >> CPUTime.execute:p1.00    sample            0.054           ms/op
> >>
> >>
> >> Testing: `java/lang/management/ThreadMXBean/ThreadUserTime.java` and
> the added microbenchmark.
> >
> > Jonas Norlinder has updated the pull request incrementally with one
> additional commit since the last revision:
> >
> >   Align signature to standard
>
> Looks good - I remember that fix for parsing the program binary name
> containing brackets, good to have it gone.
>
> -------------
>
> Marked as reviewed by kevinw (Reviewer).
>
> PR Review:
> https://git.openjdk.org/jdk/pull/28556#pullrequestreview-3534064399
>

Apologies for reviving an old treat. I was experimenting with this change,
and I believe there is a further optimisation opportunity: When clockid has
TID set to 0, then the kernel treats it as 'the current task' (=which is
what getCurrentThreadUserTime() requires) and avoids a radix lookup
required for an arbitrary TID.

The change:
https://github.com/jerrinot/jdk/compare/master...jerrinot:jdk:jh_faster_getCurrentThreadUserTime
The benchmark from https://github.com/openjdk/jdk/pull/28556 (switched to
nanos + more iterations + fork count):

Before:
Benchmark                                             Mode      Cnt
Score   Error  Units
ThreadMXBeanBench.getCurrentThreadUserTime          sample  4347067
 81.746 ± 0.510  ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.00    sample
69.000          ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.50    sample
80.000          ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.90    sample
90.000          ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.95    sample
90.000          ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.99    sample
90.000          ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.999   sample
 230.000          ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.9999  sample
1980.000          ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p1.00    sample
653312.000          ns/op

After:
Benchmark                                             Mode      Cnt
Score   Error  Units
ThreadMXBeanBench.getCurrentThreadUserTime          sample  5081223
 70.813 ± 0.325  ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.00    sample
59.000          ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.50    sample
70.000          ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.90    sample
70.000          ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.95    sample
70.000          ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.99    sample
80.000          ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.999   sample
 170.000          ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.9999  sample
1830.000          ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p1.00    sample
425472.000          ns/op

There is around 13% latency improvement on average.
It increases coupling to kernel internals a bit further, but the original
patch already does that by poking the lower bits + Linux has a strong
policy on ABI stability.

Would you be interested in merging a similar patch?

Cheers,
Jaromir Hamala

-- 
“Perfection is achieved, not when there is nothing more to add, but when
there is nothing left to take away.”
Antoine de Saint Exupéry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-runtime-dev/attachments/20251223/a1ae5ed7/attachment-0001.htm>