RFR: 8333893: Optimization for StringBuilder append boolean & null [v4]

Emanuel Peter epeter at openjdk.org
Wed Jun 12 11:42:13 UTC 2024


On Tue, 11 Jun 2024 11:35:28 GMT, Shaojin Wen <duke at openjdk.org> wrote:

>> After PR https://github.com/openjdk/jdk/pull/16245, C2 optimizes stores into primitive arrays by combining values ​​into larger stores.
>> 
>> This PR rewrites the code of appendNull and append(boolean) methods so that these two methods can be optimized by C2.
>
> Shaojin Wen has updated the pull request incrementally with one additional commit since the last revision:
> 
>   revert

I am now doing some work for you. Please read it very carefully, and try to retrace everything yourself. I expect you to do this kind of analysis yourself next time and present it to me (if you decide to continue working on this PR).

I run your test:
`./java -XX:+TraceMergeStores -Xbatch -XX:CompileCommand=printcompilation,*::* -XX:+PrintInlining AppendNullTest.java`

I see:

10329 1994    b  4       AppendNullTest$StringBuilder::appendNull (75 bytes)
[TraceMergeStores]: Replace
 404  StoreB  === 388 282 402 339  [[ 429 415 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=8; !jvms: AppendNullTest$StringUTF16::putChar @ bci:15 (line 115) AppendNullTest$StringUTF16::putCharsAt @ bci:3 (line 87) AppendNullTest$StringBuilder::appendNull @ bci:63 (line 25)
 429  StoreB  === 414 404 427 23  [[ 473 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=8; !jvms: AppendNullTest$StringUTF16::putChar @ bci:24 (line 116) AppendNullTest$StringUTF16::putCharsAt @ bci:3 (line 87) AppendNullTest$StringBuilder::appendNull @ bci:63 (line 25)
 473  StoreB  === 414 429 471 340  [[ 497 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=8; !jvms: AppendNullTest$StringUTF16::putChar @ bci:15 (line 115) AppendNullTest$StringUTF16::putCharsAt @ bci:11 (line 88) AppendNullTest$StringBuilder::appendNull @ bci:63 (line 25)
 497  StoreB  === 414 473 495 23  [[ 541 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=8; !jvms: AppendNullTest$StringUTF16::putChar @ bci:24 (line 116) AppendNullTest$StringUTF16::putCharsAt @ bci:11 (line 88) AppendNullTest$StringBuilder::appendNull @ bci:63 (line 25)
 541  StoreB  === 414 497 539 341  [[ 565 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=8; !jvms: AppendNullTest$StringUTF16::putChar @ bci:15 (line 115) AppendNullTest$StringUTF16::putCharsAt @ bci:20 (line 89) AppendNullTest$StringBuilder::appendNull @ bci:63 (line 25)
 565  StoreB  === 414 541 563 23  [[ 609 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=8; !jvms: AppendNullTest$StringUTF16::putChar @ bci:24 (line 116) AppendNullTest$StringUTF16::putCharsAt @ bci:20 (line 89) AppendNullTest$StringBuilder::appendNull @ bci:63 (line 25)
 609  StoreB  === 414 565 607 341  [[ 633 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=8; !jvms: AppendNullTest$StringUTF16::putChar @ bci:15 (line 115) AppendNullTest$StringUTF16::putCharsAt @ bci:29 (line 90) AppendNullTest$StringBuilder::appendNull @ bci:63 (line 25)
 633  StoreB  === 414 609 631 23  [[ 16 ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8;  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=8; !jvms: AppendNullTest$StringUTF16::putChar @ bci:24 (line 116) AppendNullTest$StringUTF16::putCharsAt @ bci:29 (line 90) AppendNullTest$StringBuilder::appendNull @ bci:63 (line 25)
[TraceMergeStores]: with
 758  ConL  === 0  [[ 759 ]]  #long:30399761348886638
 759  StoreL  === 414 282 402 758  [[ ]]  @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=8; mismatched  Memory: @byte[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=8;
                              @ 9   AppendNullTest$StringBuilder::ensureCapacityInternal (39 bytes)   inline (hot)
                                @ 24   AppendNullTest$StringBuilder::newCapacity (43 bytes)   failed to inline: too big
                                @ 32   java.util.Arrays::copyOf (33 bytes)   failed to inline: low call site frequency
                              @ 18   AppendNullTest$StringBuilder::isLatin1 (13 bytes)   inline (hot)
                              @ 63   AppendNullTest$StringUTF16::putCharsAt (33 bytes)   inline (hot)
                                @ 3   AppendNullTest$StringUTF16::putChar (26 bytes)   inline (hot)
                                @ 11   AppendNullTest$StringUTF16::putChar (26 bytes)   inline (hot)
                                @ 20   AppendNullTest$StringUTF16::putChar (26 bytes)   inline (hot)
                                @ 29   AppendNullTest$StringUTF16::putChar (26 bytes)   inline (hot)


So. We have a compilation of `AppendNullTest$StringBuilder::appendNull`. As you can see a number of things are inlined (recursively), and some are not inlined (with reasons why).
For example, `AppendNullTest$StringUTF16::putCharsAt` is inlined, and from that method, we inline `AppendNullTest$StringUTF16::putChar` 4 times.
So now this compilation unit has all of the combine code, and we start optimizing in C2.
We see that there is one successful `MergeStore` optimization, namely with the `StoreB` from `AppendNullTest$StringUTF16::putChar @ bci:15 (line 115)` and `AppendNullTest$StringUTF16::putChar @ bci:15 (line 115)`. Since the `putChar` was inlined 4 times, we have 4 pairs of them, in total 8 `StoreB`. These are combined together into a single `StoreL`. Success.

since you created a `StringBuilder sb = new StringBuilder(true);` where `isLatin1()` is always `false`, we actually will only compile the code in the `false` path, and add a `uncommon trap` on the `true` path. That is why we would not expect any optimizations to happen on the `true` path, we are not compiling it!

As a proof, I will run this:
`./java -XX:+TraceMergeStores -Xbatch -XX:CompileCommand=printcompilation,*::* -XX:+PrintInlining -XX:CompileCommand=printideal,*::appendNull AppendNullTest.java > log.log`

Now I can see this IR node in the `PrintIdeal` part of the log:
`322  CallStaticJava  === 318 281 309 8 9 (321 10 25 683 1 650 ) [[ 323 ]] # Static uncommon_trap(reason='unstable_if' action='reinterpret' debug_id='0')  void ( int ) C=0.000100 AppendNullTest$StringBuilder::isLatin1 @ bci:4 (line 77) reexecute AppendNullTest$StringBuilder::appendNull @ bci:18 (line 19) !jvms: AppendNullTest$StringBuilder::isLatin1 @ bci:4 (line 77) AppendNullTest$StringBuilder::appendNull @ bci:18 (line 19)`
It says `uncommon_trap(reason='unstable_if'...)`. Unstable if means we decide to only compile one side of the branch, if we ever encounter the other branch at runtime, we deoptimize, i.e. we drop out of the C2 code back to the interpreter.

Now let us look at the assembly code:
`./java -XX:+TraceMergeStores -Xbatch -XX:CompileCommand=printcompilation,*::* -XX:+PrintInlining -XX:CompileCommand=printideal,*::appendNull -XX:CompileCommand=print,*::appendNull AppendNullTest.java > log.log`

There are always two sections, one with the "OptoAssembly" and one with "Assembly". They correspond to each other, the former is more high level, the latter more low-level. I like to look at the "Assembly" one, because that is what we actually end up running on the hardware, in my case this is x86-64 assembly.
After looking a bit around, I found this section:

  0x00007f05ccc5d57a:   movslq %edx,%rdx                    ;*bastore {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - AppendNullTest$StringUTF16::putChar at 15 (line 115)
                                                            ; - AppendNullTest$StringUTF16::putCharsAt at 3 (line 87)
                                                            ; - AppendNullTest$StringBuilder::appendNull at 63 (line 25)
  0x00007f05ccc5d57d:   cmp    %r11d,%edi
  0x00007f05ccc5d580:   jae    0x00007f05ccc5d6fc
 ;; B7: #       out( N384 ) <- in( B6 )  Freq: 0.999995
  0x00007f05ccc5d586:   mov    %r10d,0xc(%rbx)              ;*putfield count {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - AppendNullTest$StringBuilder::appendNull at 70 (line 27)
  0x00007f05ccc5d58a:   movabs $0x6c006c0075006e,%r10
  0x00007f05ccc5d594:   mov    %r10,0x10(%r8,%rdx,1)        ;*synchronization entry
                                                            ; - AppendNullTest$StringBuilder::appendNull at -1 (line 16)

This is interesting, because `movabs $0x6c006c0075006e,%r10` loads the constant into the `r10` register, and this is the same constant that our `MergeStores` optimization generated, i.e. `0x6c006c0075006e == 30399761348886638`.
But there is more, I think (but not completely sure her) that the `cmp` is a bounds check on the `byte[]` array.

Side note: if we do "normal" accesses on byte arrays, we bounds check. If we use `Unsafe` there is usually no bounds check. The bounds check can sometimes incur some overhead of course, but generally they are very very predictable for the branch predictor (another low level detail of the CPU that you can read up on your own if you are interested).

Does this make sense so far, or do you have questions?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/19626#issuecomment-2162793445


More information about the core-libs-dev mailing list