RFR: 8327247: C2 uses up to 2GB of RAM to compile complex string concat in extreme cases [v3]

Sun Apr 14 14:36:07 UTC 2024

On Sat, 13 Apr 2024 17:54:54 GMT, Brett Okken <duke at openjdk.org> wrote:

>> Possibly. I tried a few simple variants that initialized the `StringBuilder` capacity at various guesses, such as sum of constant sizes + some factor based on args, but across a diverse set of micros that gives both some wins and some regressions. Getting the estimation just right is pretty tricky, especially when size of the arguments are arbitrary (like for any String/Object arg).
>
> What are the scenarios which had regressions? 
> Given the conservative growth for StringBuilder, it surprises me a bit that any scenario would regress.

I took a second look and it turns out that there were neither regressions nor improvements in my test, only a few false positives: For C2 the SB chain is recognized by the (fragile) C2 OptimizeStringConcat pass and transformed into a shape where the initial size in java bytecode - if any - is ignored.

With C1 having an initial size can have a significant effect. One way to provoke a regression there is to have a huge constant suffix while underestimating the size of the operands, which can lead to excessive allocation:

Name                             Cnt     Base    Error       Test    Error   Unit  Change
StringConcat.concat23StringConst   5  385,268 ±  5,238    341,251 ±  2,606  ns/op   1,13x (p = 0,000*)
  :gc.alloc.rate                     6039,095 ± 82,309  12764,169 ± 97,146 MB/sec   2,11x (p = 0,000*)
  :gc.alloc.rate.norm                2440,003 ±  0,000   4568,002 ±  0,000   B/op   1,87x (p = 0,000*)

Still a bit better on throughput from less actual copying.

*If* the `StringBuilder` is sized sufficiently (constants + args * N) things look much better, of course: 

-XX:TieredStopAtLevel=1
Name                             Cnt     Base    Error      Test    Error   Unit  Change
StringConcat.concat23StringConst   5  385,268 ±  5,238   259,628 ±  2,515  ns/op   1,48x (p = 0,000*)
  :gc.alloc.rate                     6039,095 ± 82,309  8902,803 ± 86,563 MB/sec   1,47x (p = 0,000*)
  :gc.alloc.rate.norm                2440,003 ±  0,000  2424,002 ±  0,000   B/op   0,99x (p = 0,000*)

For most cases having a size based on the sum of the constants plus some small factor per argument seem to be a decent heuristic - for C1. Since this adds just one bytecode to the generated method I guess it wouldn't hurt.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/18690#discussion_r1564753612