RFR: 8344009: Improve compiler memory statistics

Thu Feb 13 16:14:02 UTC 2025

Greetings,

This is a rewrite of the Compiler Memory Statistic. The primary new feature is the capability to track allocations by C2 phases. This will allow for a much faster, more thorough analysis of footprint issues. 

Tracking Arena memory movement is not trivial since one needs to follow the ebb and flow of allocations over nested C2 phases. A phase typically allocates more than it releases, accruing new nodes and resource area. A phase can also release more than allocated when Arenas carried over from other phases go out of scope in this phase. Finally, it can have high temporary peaks that vanish before the phase ends.

I wanted to track that information correctly and display it clearly in a way that is easy to understand.

The patch implements per-phase tracking by instrumenting the `TracePhase` stack object (thanks to @rwestrel for this idea).

The nice thing with this technique is that it also allows for quick analysis of a suspected hot spot (eg, the inside of a loop): drop a TracePhase in there with a speaking name, and you can see the allocations inside that phase.

The statistic gives us two new forms of output:

1) At the moment the compilation memory *peaked*, we now get a detailed breakdown of that peak usage per phase:

Arena Usage by Arena Type and compilation phase, at arena usage peak of 58817816:
    Phase                         Total        ra      node      comp      type     index   reglive  regsplit     cienv     other
    none                        1205512    155104    982984     33712         0         0         0         0         0     33712
    parse                      11685376    720016   6578728   1899064         0         0         0         0   1832888    654680
    optimizer                    916584         0    556416         0         0         0         0         0         0    360168
    escapeAnalysis              1983400         0   1276392    707008         0         0         0         0         0         0
    connectionGraph              720016         0         0    621832         0         0         0         0     98184         0
    macroEliminate               196448         0    196448         0         0         0         0         0         0         0
    iterGVN                      327440         0    196368    131072         0         0         0         0         0         0
    incrementalInline           3992816         0   3043704    621832         0         0         0         0    261824     65456
    incrementalInline_igvn       458512         0    163640    294872         0         0         0         0         0         0
    incrementalInline_inline      32728         0         0     32728         0         0         0         0         0         0
                                 337880         0    108504    229376         0         0         0         0         0         0
    idealLoop                   2499568         0    566696   1932872         0         0         0         0         0         0
    idealLoopVerify              327600         0         0    327600         0         0         0         0         0         0
    ccp                           65456         0     32728         0         0         0         0         0         0     32728
    macroExpand                 1898544         0   1570944    327600         0         0         0         0         0         0
    graphReshape                 347920         0    347920         0         0         0         0         0         0         0
    matcher                     9480400   1817928   6417448   1245024         0         0         0         0         0         0
    postselect_cleanup           163800    163800         0         0         0         0         0         0         0         0
    scheduler                    458192     32728    425464         0         0         0         0         0         0         0
    regalloc                     178072    178072         0         0         0         0         0         0         0         0
    ctorChaitin                   39432     39432         0         0         0         0         0         0         0         0
    regAllocSplit               1865496     32728   1832768         0         0         0         0         0         0         0
    chaitinCoalesce1            1277112    196608   1080504         0         0         0         0         0         0         0
    peephole                      32728         0     32728         0         0         0         0         0         0         0
    output                     17868312  17868312         0         0         0         0         0         0         0         0
    shorten branches             458472     65456     32728    360288         0         0         0         0         0         0

This is pretty self-explanatory. In this example, when the compilation hit its peak of 58MB, it shows how much (column `Total`) we have allocated on behalf of each separate C2 phase that finished. Note that "none" means allocations happening outside of a TracePhase scope. The columns following `Total` show the breakup, in this particular phase, into individual arena types (resourcearea, node arena etc).

2) We also get a detailed timeline of phase execution and the gradual memory buildup. This also shows large spikes of memory that were confined inside the phase, and that we never got to see before:

    Allocation timelime by phase:
    Phase seq. number                             Bytes                  Nodes
    >0                            (outside)    102120 (+102120)       3 (+3) 
     >1                               parse  11787496 (+11685376)   7151 (+7148) 
    <0 (cont.)                    (outside)  11787496 (+0)         7151 (+0) 
     >2                           optimizer  11787496 (+0)         7151 (+0) 
      >3                            iterGVN  12180392 (+392896)    6313 (-838) 
     <2 (cont.)                   optimizer  12180392 (+0)         6313 (+0) 
      >4                  incrementalInline  12180392 (+0)         6313 (+0) 
       >5          incrementalInline_inline  12213120 (+32728)     6330 (+17) 
      <4 (cont.)          incrementalInline  12213120 (+0)         6330 (+0) 
       >6             incrementalInline_pru  12213120 (+0)         6287 (-43) 
      <4 (cont.)          incrementalInline  12213120 (+0)         6287 (+0) 
       >7            incrementalInline_igvn  12213120 (+0)         6286 (-1) 
      <4 (cont.)          incrementalInline  16631400 (+4418280)  17200 (+10914) 
       >8             incrementalInline_pru  16631400 (+0)        10185 (-7015) 
      <4 (cont.)          incrementalInline  16631400 (+0)        10185 (+0) 
       >9            incrementalInline_igvn  17122640 (+491240)    9374 (-811) 
      <4 (cont.)          incrementalInline  17122640 (+0)         9374 (+0) 
     <2 (cont.)                   optimizer  17122640 (+0)         9396 (+22) 
      >10             incrementalInline_pru  17122640 (+0)         9360 (-36) 
     <2 (cont.)                   optimizer  17122640 (+0)         9360 (+0) 
      >11            incrementalInline_igvn  17122640 (+0)         9353 (-7) 
     <2 (cont.)                   optimizer  17745072 (+622432)    9318 (-35) 
      >12                                    18082952 (+337880)    9317 (-1) 
     <2 (cont.)                   optimizer  18082952 (+0)         9317 (+0) 
      >13                         idealLoop  18420712 (+337760)    9247 (-70)         significant temporary peak: 19762880 (+1679928)

 < SNIP SNIP >

     <41 (cont.)                   regalloc  52599072 (+0)        23604 (+0) 
      >71                     chaitinSelect  52599072 (+0)        23604 (+0) 
     <41 (cont.)                   regalloc  52599072 (+0)        23604 (+0) 
      >72              postAllocCopyRemoval  52599072 (+0)        19285 (-4319)       significant temporary peak: 58359200 (+5760128)
     <41 (cont.)                   regalloc  52599072 (+0)        19285 (+0) 
      >73                    mergeMultidefs  52599072 (+0)        19285 (+0) 
     <41 (cont.)                   regalloc  52599072 (+0)        19285 (+0) 
      >74                       fixupSpills  52599072 (+0)        19256 (-29) 
     <41 (cont.)                   regalloc  40458304 (-12140768)  19256 (+0) 
    <0 (cont.)                    (outside)  40458304 (+0)        19256 (+0) 
     >75                      blockOrdering  40458304 (+0)        19268 (+12) 
    <0 (cont.)                    (outside)  40458304 (+0)        19268 (+0) 
     >76                           peephole  40491032 (+32728)    19332 (+64) 
    <0 (cont.)                    (outside)  40491032 (+0)        19332 (+0) 
     >77                             output  40491032 (+0)        19336 (+4) 
      >78                  shorten branches  40949504 (+458472)   19239 (-97) 
     <77 (cont.)                     output  40949504 (+0)        19240 (+1)          significant temporary peak: 58817816 (+17868312)
      >79                        bldOopMaps  48294864 (+7345360)  19240 (+0) 
     <77 (cont.)                     output  48294864 (+0)        19240 (+0) 
      >80                       fill buffer  50635256 (+2340392)  19574 (+334) 
     <77 (cont.)                     output  50635256 (+0)        19574 (+0) 
      >81                      install_code  50635256 (+0)        19574 (+0) 
     <77 (cont.)                     output  50634272 (-984)      19574 (+0) 
    <0 (cont.)                    (outside)   2225624 (-48408648)      0 (-19574) 

The timeline shows the individual phases in the order in which they were executed. The left side is the tree of possibly nested TracePhase scopes. Every phase execution (since a phase can run multiple times) now has a unique phase sequence number. The tree shows those phase sequence numbers, and if a child phase ends, you can see that we are back in the outer phase again ("cont."). Behind that, we see the phase name, allocations made during that phase (Bytes), and the number of nodes allocated. If the phase caused a large temporary spike, it shows up as a "significant temporary peak".

Here you can find a full printout of this information on a run of springboot petclinic with `-XX:CompileCommand=memstat,*.*` via `jcmd spring Compiler.memory verbose`: https://gist.github.com/tstuefe/9d00e7129d4cdf5d1dbc294a80cbe4ed

----

We also now print more detailed information if the JVM runs against a MemLimit and crashes (option `-XX:CompileCommand=memlimit`). Note that we run with a memlimit of 1GB by default in debug JVMs, and we already found a couple of footprint issues with this setting. See an example of a generated hs-err file here: https://gist.github.com/tstuefe/777ecd68b313097a8e9020ac9fea239a

----

Performance costs: Not really much to tell here; we only do a tiny bit more work now per arena chunk allocation and after the compilation ended, but nothing crazy. Nevertheless, I tested before and after this patch with SpecJBB, nothing rose above background noise.

Footprint costs: The new information causes the per-compilation data to be a lot more bulky. That is why I reduced the information stored long-term: we now only store the 32 most expensive compilations. To see the cost of every compilation in its gory details, use the `-XX:CompileCommand=memstat,*.*,print` option. Of course, you can always reduce the scope of your tracking, e.g. limit it to a certain class only.

---

Tests:
- I tested manually on x64 Linux. GHAs ran (including the new tests).
- Tests at SAP are green

-------------

Commit messages:
 - more code grooming and comments
 - grooming
 - wip
 - Improved compiler memory statistics

Changes: https://git.openjdk.org/jdk/pull/23530/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23530&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8344009
  Stats: 1753 lines in 28 files changed: 1187 ins; 237 del; 329 mod
  Patch: https://git.openjdk.org/jdk/pull/23530.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23530/head:pull/23530

PR: https://git.openjdk.org/jdk/pull/23530