<div dir="ltr"><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Wed, Dec 3, 2025 at 10:35 AM Kevin Walls <<a href="mailto:kevinw@openjdk.org">kevinw@openjdk.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Tue, 2 Dec 2025 20:59:41 GMT, Jonas Norlinder <<a href="mailto:jnorlinder@openjdk.org" target="_blank">jnorlinder@openjdk.org</a>> wrote:<br>

<br>

>> Since kernel v2.6.12 the Linux ABI have had support for encoding the clock types in the last three bits. Setting bit to 001 (CPUCLOCK_VIRT) will result in the kernel returning only user time. POSIX compliant implementations of pthread_getcpuclockid for the Linux kernel defaults to construct a clockid that with 010 (CPUCLOCK_SCHED) set, which return system+user time, which is what the POSIX standard mandates, see POSIX.1-2024/IEEE Std 1003.1-2024 §3.90. This patch joins the family of glibc, musl etc.  that utilities this bit pattern.<br>

>> <br>

>> This PR also results in improved performance and thus a reduced observer effect, especially for the 100th percentile (max).<br>

>> <br>

>> Before patch:<br>

>> <br>

>> Benchmark                  Mode      Cnt  Score    Error  Units<br>

>> CPUTime.execute          sample  7506555  0.008 ±  0.001  ms/op<br>

>> CPUTime.execute:p0.00    sample           0.008           ms/op<br>

>> CPUTime.execute:p0.50    sample           0.008           ms/op<br>

>> CPUTime.execute:p0.90    sample           0.008           ms/op<br>

>> CPUTime.execute:p0.95    sample           0.008           ms/op<br>

>> CPUTime.execute:p0.99    sample           0.012           ms/op<br>

>> CPUTime.execute:p0.999   sample           0.015           ms/op<br>

>> CPUTime.execute:p0.9999  sample           0.021           ms/op<br>

>> CPUTime.execute:p1.00    sample           1.030           ms/op<br>

>> <br>

>> <br>

>> After patch:<br>

>> <br>

>> Benchmark                  Mode      Cnt   Score    Error  Units<br>

>> CPUTime.execute          sample  8984189  ≈ 10⁻³           ms/op<br>

>> CPUTime.execute:p0.00    sample           ≈ 10⁻³           ms/op<br>

>> CPUTime.execute:p0.50    sample           ≈ 10⁻³           ms/op<br>

>> CPUTime.execute:p0.90    sample           ≈ 10⁻³           ms/op<br>

>> CPUTime.execute:p0.95    sample           ≈ 10⁻³           ms/op<br>

>> CPUTime.execute:p0.99    sample            0.001           ms/op<br>

>> CPUTime.execute:p0.999   sample            0.001           ms/op<br>

>> CPUTime.execute:p0.9999  sample            0.006           ms/op<br>

>> CPUTime.execute:p1.00    sample            0.054           ms/op<br>

>> <br>

>> <br>

>> Testing: `java/lang/management/ThreadMXBean/ThreadUserTime.java` and the added microbenchmark.<br>

><br>

> Jonas Norlinder has updated the pull request incrementally with one additional commit since the last revision:<br>

> <br>

>   Align signature to standard<br>

<br>

Looks good - I remember that fix for parsing the program binary name containing brackets, good to have it gone.<br>

<br>

-------------<br>

<br>

Marked as reviewed by kevinw (Reviewer).<br>

<br>

PR Review: <a href="https://git.openjdk.org/jdk/pull/28556#pullrequestreview-3534064399" rel="noreferrer" target="_blank">https://git.openjdk.org/jdk/pull/28556#pullrequestreview-3534064399</a><br>

</blockquote></div><div><br clear="all"></div><div>Apologies for reviving an old treat. I was experimenting with this change, and I believe there is a further optimisation opportunity: When clockid has TID set to 0, then the kernel treats it as 'the current task' (=which is what getCurrentThreadUserTime() requires) and avoids a radix lookup required for an arbitrary TID. <br></div><div><br></div><div>The change: <a href="https://github.com/jerrinot/jdk/compare/master...jerrinot:jdk:jh_faster_getCurrentThreadUserTime">https://github.com/jerrinot/jdk/compare/master...jerrinot:jdk:jh_faster_getCurrentThreadUserTime</a> <br></div><div></div><div>The benchmark from <a href="https://github.com/openjdk/jdk/pull/28556">https://github.com/openjdk/jdk/pull/28556</a> (switched to nanos + more iterations + fork count):</div><div><br></div><div>Before:</div><div>Benchmark                                             Mode      Cnt       Score   Error  Units<br>ThreadMXBeanBench.getCurrentThreadUserTime          sample  4347067      81.746 ± 0.510  ns/op<br>ThreadMXBeanBench.getCurrentThreadUserTime:p0.00    sample               69.000          ns/op<br>ThreadMXBeanBench.getCurrentThreadUserTime:p0.50    sample               80.000          ns/op<br>ThreadMXBeanBench.getCurrentThreadUserTime:p0.90    sample               90.000          ns/op<br>ThreadMXBeanBench.getCurrentThreadUserTime:p0.95    sample               90.000          ns/op<br>ThreadMXBeanBench.getCurrentThreadUserTime:p0.99    sample               90.000          ns/op<br>ThreadMXBeanBench.getCurrentThreadUserTime:p0.999   sample              230.000          ns/op<br>ThreadMXBeanBench.getCurrentThreadUserTime:p0.9999  sample             1980.000          ns/op<br>ThreadMXBeanBench.getCurrentThreadUserTime:p1.00    sample           653312.000          ns/op<br><br>After:<br>Benchmark                                             Mode      Cnt       Score   Error  Units<br>ThreadMXBeanBench.getCurrentThreadUserTime          sample  5081223      70.813 ± 0.325  ns/op<br>ThreadMXBeanBench.getCurrentThreadUserTime:p0.00    sample               59.000          ns/op<br>ThreadMXBeanBench.getCurrentThreadUserTime:p0.50    sample               70.000          ns/op<br>ThreadMXBeanBench.getCurrentThreadUserTime:p0.90    sample               70.000          ns/op<br>ThreadMXBeanBench.getCurrentThreadUserTime:p0.95    sample               70.000          ns/op<br>ThreadMXBeanBench.getCurrentThreadUserTime:p0.99    sample               80.000          ns/op<br>ThreadMXBeanBench.getCurrentThreadUserTime:p0.999   sample              170.000          ns/op<br>ThreadMXBeanBench.getCurrentThreadUserTime:p0.9999  sample             1830.000          ns/op<br>ThreadMXBeanBench.getCurrentThreadUserTime:p1.00    sample           425472.000          ns/op</div><div><br></div><div>There is around 13% latency improvement on average. <br></div><div>It increases coupling to kernel internals a bit further, but the original patch already does that by poking the lower bits + Linux has a strong policy on ABI stability. <br></div><div><br></div><div>Would you be interested in merging a similar patch? <br></div><div><br></div><div>Cheers,</div><div>Jaromir Hamala</div><div><br></div><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature">“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”<br>Antoine de Saint Exupéry</div></div>