RFR: 8372584: [Linux]: Replace reading proc to get thread user CPU time with clock_gettime [v7]
Jonas Norlinder
jonas.norlinder at oracle.com
Mon Jan 5 09:38:01 UTC 2026
Hi Jaromir,
That sounds interesting :), as long as we are confident that your observation is part of the user ABI. Feel free to submit a PR and I will happily review it. Also add a link or reasoning to confirm that it is part of the user ABI.
Thank you,
Jonas
From: hotspot-runtime-dev <hotspot-runtime-dev-retn at openjdk.org> on behalf of Jaromir Hamala <jaromir.hamala at gmail.com>
Date: Tuesday, 23 December 2025 at 14:36
To: Kevin Walls <kevinw at openjdk.org>
Cc: hotspot-runtime-dev at openjdk.org <hotspot-runtime-dev at openjdk.org>
Subject: Re: RFR: 8372584: [Linux]: Replace reading proc to get thread user CPU time with clock_gettime [v7]
On Wed, Dec 3, 2025 at 10:35 AM Kevin Walls <kevinw at openjdk.org<mailto:kevinw at openjdk.org>> wrote:
On Tue, 2 Dec 2025 20:59:41 GMT, Jonas Norlinder <jnorlinder at openjdk.org<mailto:jnorlinder at openjdk.org>> wrote:
>> Since kernel v2.6.12 the Linux ABI have had support for encoding the clock types in the last three bits. Setting bit to 001 (CPUCLOCK_VIRT) will result in the kernel returning only user time. POSIX compliant implementations of pthread_getcpuclockid for the Linux kernel defaults to construct a clockid that with 010 (CPUCLOCK_SCHED) set, which return system+user time, which is what the POSIX standard mandates, see POSIX.1-2024/IEEE Std 1003.1-2024 §3.90. This patch joins the family of glibc, musl etc. that utilities this bit pattern.
>>
>> This PR also results in improved performance and thus a reduced observer effect, especially for the 100th percentile (max).
>>
>> Before patch:
>>
>> Benchmark Mode Cnt Score Error Units
>> CPUTime.execute sample 7506555 0.008 ± 0.001 ms/op
>> CPUTime.execute:p0.00 sample 0.008 ms/op
>> CPUTime.execute:p0.50 sample 0.008 ms/op
>> CPUTime.execute:p0.90 sample 0.008 ms/op
>> CPUTime.execute:p0.95 sample 0.008 ms/op
>> CPUTime.execute:p0.99 sample 0.012 ms/op
>> CPUTime.execute:p0.999 sample 0.015 ms/op
>> CPUTime.execute:p0.9999 sample 0.021 ms/op
>> CPUTime.execute:p1.00 sample 1.030 ms/op
>>
>>
>> After patch:
>>
>> Benchmark Mode Cnt Score Error Units
>> CPUTime.execute sample 8984189 ≈ 10⁻³ ms/op
>> CPUTime.execute:p0.00 sample ≈ 10⁻³ ms/op
>> CPUTime.execute:p0.50 sample ≈ 10⁻³ ms/op
>> CPUTime.execute:p0.90 sample ≈ 10⁻³ ms/op
>> CPUTime.execute:p0.95 sample ≈ 10⁻³ ms/op
>> CPUTime.execute:p0.99 sample 0.001 ms/op
>> CPUTime.execute:p0.999 sample 0.001 ms/op
>> CPUTime.execute:p0.9999 sample 0.006 ms/op
>> CPUTime.execute:p1.00 sample 0.054 ms/op
>>
>>
>> Testing: `java/lang/management/ThreadMXBean/ThreadUserTime.java` and the added microbenchmark.
>
> Jonas Norlinder has updated the pull request incrementally with one additional commit since the last revision:
>
> Align signature to standard
Looks good - I remember that fix for parsing the program binary name containing brackets, good to have it gone.
-------------
Marked as reviewed by kevinw (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/28556#pullrequestreview-3534064399
Apologies for reviving an old treat. I was experimenting with this change, and I believe there is a further optimisation opportunity: When clockid has TID set to 0, then the kernel treats it as 'the current task' (=which is what getCurrentThreadUserTime() requires) and avoids a radix lookup required for an arbitrary TID.
The change: https://github.com/jerrinot/jdk/compare/master...jerrinot:jdk:jh_faster_getCurrentThreadUserTime
The benchmark from https://github.com/openjdk/jdk/pull/28556 (switched to nanos + more iterations + fork count):
Before:
Benchmark Mode Cnt Score Error Units
ThreadMXBeanBench.getCurrentThreadUserTime sample 4347067 81.746 ± 0.510 ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.00 sample 69.000 ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.50 sample 80.000 ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.90 sample 90.000 ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.95 sample 90.000 ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.99 sample 90.000 ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.999 sample 230.000 ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.9999 sample 1980.000 ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p1.00 sample 653312.000 ns/op
After:
Benchmark Mode Cnt Score Error Units
ThreadMXBeanBench.getCurrentThreadUserTime sample 5081223 70.813 ± 0.325 ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.00 sample 59.000 ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.50 sample 70.000 ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.90 sample 70.000 ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.95 sample 70.000 ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.99 sample 80.000 ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.999 sample 170.000 ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p0.9999 sample 1830.000 ns/op
ThreadMXBeanBench.getCurrentThreadUserTime:p1.00 sample 425472.000 ns/op
There is around 13% latency improvement on average.
It increases coupling to kernel internals a bit further, but the original patch already does that by poking the lower bits + Linux has a strong policy on ABI stability.
Would you be interested in merging a similar patch?
Cheers,
Jaromir Hamala
--
“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”
Antoine de Saint Exupéry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-runtime-dev/attachments/20260105/3d537e63/attachment-0001.htm>
More information about the hotspot-runtime-dev
mailing list