RFR: 7903722: JMH: Add xctrace-based perfnorm profiler for macOS

Filipp Zhinkin fzhinkin at openjdk.org
Tue May 7 06:06:26 UTC 2024


Implementation of a perfnorm-alike profiler for macOS based on `xctrace` command line tool bundled with Xcode.

While the profiler is tested and seems to be working well, I consider it rather a preliminary version and open to a discussion on what and how it should measure.

Currently, the profiler only supports PMU counters sampling using `CPU Counters` instrument provided by the Instruments app / xctrace.
Unfortunately, `CPU Counters` instrument has no default settings, unlike `Time Profiler` and `CPU Profiler` instruments used by the recently merged `xctraceasm` profiler.
To use `CPU Counters`, a user has to create a template in the Instruments UI, select PMU events, save the template and then supply to `xctracenorm` as an argument.

This workflow not only prevents use of the profiler without preliminary manual configuration, but also tends to be annoying when it comes to measuring multiple events, as xctrace, unlike perf_events, does not support events multiplexing.

Thankfully, command-line-based configuration and default parameters could be emulated by building a custom Instruments package that imports data from `CPU Counters` and also supplies all required parameters.
As you can guess, there's no way to get information about supported PMU events directly from xctrace, but it could be fetched from KPEP database files, stored is `/usr/share/kpep`.
`xctracenorm` relies on that data to validate events specified by a user, if any, and also to print a help message that gives some insights into what could be sampled.

To sum up, there are a few things that were implemented to make `xctracenorm` profiler works:
- CPU model deletion using `sysctl`;
- KPEP file parsing to extract information about the PMU and all supported events;
- selected performance events validation;
- Instruments package building (generate XML, call a builder tool), packages are cached in `~/Library/Caches/org.openjdk.jmh`;
- xctrace execution, resulting samples extraction, and aggregation;
- samples postprocessing to calculate some additional metrics, like CPI and branch missprediction ratio.

Currently, if a user didn't specify any additional options, `xctracenorm` will sample instructions, cycles, branches and mispredicted branches events. 
These were selected as events that should be supported in all hardware macOS runs on; only 4 events were selected for the same reason.

Profiling results look like this on M2-based MacBook:

java -jar ./benchmarks.jar -prof xctracenorm -f 1 JMHSample_35_Profilers.Atomic
...
Benchmark                                                                 Mode  Cnt   Score   Error                               Units
JMHSample_35_Profilers.Atomic.test                                        avgt    5   4.055 ± 0.185                               ns/op
JMHSample_35_Profilers.Atomic.test:BRANCH_MISPRED_NONSPEC                 avgt       ≈ 10⁻⁵                                        #/op
JMHSample_35_Profilers.Atomic.test:Branch miss ratio                      avgt       ≈ 10⁻⁶          BRANCH_MISPRED_NONSPEC/INST_BRANCH
JMHSample_35_Profilers.Atomic.test:CORE_ACTIVE_CYCLE                      avgt       10.541                                        #/op
JMHSample_35_Profilers.Atomic.test:CPI                                    avgt        0.351                  CORE_ACTIVE_CYCLE/INST_ALL
JMHSample_35_Profilers.Atomic.test:INST_ALL                               avgt       30.031                                        #/op
JMHSample_35_Profilers.Atomic.test:INST_BRANCH                            avgt        3.850                                        #/op
JMHSample_35_Profilers.Atomic.test:INST_BRANCH density (of instructions)  avgt        0.128                        INST_BRANCH/INST_ALL
JMHSample_35_Profilers.Atomic.test:IPC                                    avgt        2.849                  INST_ALL/CORE_ACTIVE_CYCLE



Here are some alternatives to existing implementation (of default parameters mode, mostly):
- Create an Instruments template and bundle it with JMH instead of generating a package dynamically: templates have a proprietary and obscure format, it's unclear if a template created on one device will work on other devices, or whether it continues working well after Xcode update. Also, dynamic package generation facilitates PMU events selection in CLI when running benchmarks, which seems to be much more convenient that setting up templates in UI.
- Don't parse KPEP files: data from these files used for validation, and also allows to gather info about what could be sampled without going to a separate tool. Validation is necessary as xctrace simply crashes when something is wrong with selected events.

Open questions and things that require some fixes:
* [ ] Currently, only PMU sampling was supported, however, profiler may also sample some software events like context switches, virtual memory operations and syscall statistics. It's unclear if all that should ever be supported.
* [ ] Some parts of the profiling process, namely KPEP parsing and Instruments package building are covered by tests in `jmh-core-it` module (`XCTraceSupportTest`). Tests functions are package-private in `jmh-core`, so to test them on `it` module I had to call everything through reflection, which definitely doesn't look good. I'm not sure how to both keep the API private and test in another module, so I'm looking forward to any advice on that.
* [ ] `CPU Counters` doesn't work inside VMs (at least, because there are no kpep-files for VM CPU ids, so xctrace can't load those files to fetch info about PMU events; and even if a KPEP file is there, there's no way to sample PMU counters inside VM), so all positive profiler's tests are currently skipped inside GH agents. I'm not sure what could be done here to improve testability. 
* [x] When running tests locally, surefire forks used in `jmh-core-it` cause interference between `xctraceasm` and `xctracenorm` tests leading to test failures. I'm currently looking into how to overcome the issue and serialize these tests' execution. In the worst case, tests for both profilers could be placed in a single test class, I guess.

Speaking of testing, I manually ran basic scenarios (listing profilers, printing a help message, listing supported events, and, finally, profiling) on Intel-, M1- and M2-based MacBooks.
Results and a script I used to collect data could be found here: https://gist.github.com/fzhinkin/f7c5db00f5e3417191d66994ed880818

-------------

Commit messages:
 - 7903722: Add extra tests
 - 7903722: Scan all possible KPEP file locations
 - 7903722: Serialize xctrace tests execution
 - 7903722: simplified code, added missing docs, supported branch events
 - 7903722: Improve events preprocessing
 - 7903722: Refactor KPEP database loading
 - 7903722: compute AS Arm64 instructions density metrics
 - 7903722: check if all listed events could be sampled simultaneously
 - 7903722: Add xctracenorm profiler
 - 7903722: Get rid of custom event aliases
 - ... and 4 more: https://git.openjdk.org/jmh/compare/6d6ce631...ecba1544

Changes: https://git.openjdk.org/jmh/pull/131/files
  Webrev: https://webrevs.openjdk.org/?repo=jmh&pr=131&range=00
  Issue: https://bugs.openjdk.org/browse/CODETOOLS-7903722
  Stats: 5800 lines in 16 files changed: 5779 ins; 0 del; 21 mod
  Patch: https://git.openjdk.org/jmh/pull/131.diff
  Fetch: git fetch https://git.openjdk.org/jmh.git pull/131/head:pull/131

PR: https://git.openjdk.org/jmh/pull/131


More information about the jmh-dev mailing list