RFR: 8333893: Optimization for StringBuilder append boolean & null

Tue Jun 11 06:30:13 UTC 2024

On Mon, 10 Jun 2024 23:10:05 GMT, Shaojin Wen <duke at openjdk.org> wrote:

>> @wenshao 
>>> I think the performance of the Unsafe branch may be the best data for the C2 optimizer. @eme64 can help me see if C2 can do it?
>> 
>> Have you tried to see if the optimization actually was done/taken? You can use the `TraceMergeStores,` flag. Can you present the generated assembly code of the benchmarks, and explain the difference based on the generated assembly code? You can run JMH penchmarks with `perf`. These two blogs may help you:
>> 
>> http://psy-lob-saw.blogspot.com/2015/07/jmh-perfasm.html
>> https://shipilev.net/blog/2016/arrays-wisdom-ancients/#_meet_jmh_prof_perfasm
>> 
>> @liach I don't think it makes a difference if it is `int` or `byte` constants. Or what exactly is the code change you are proposing?
>
> @eme64 It seems that when the following code uses StringUTF16.putChar, C2's optimization is not as good as the manual merging and storage effect.
> 
> class AbstractStringBuilder {
>     private AbstractStringBuilder appendNull() {
>     	// ...
> 		StringUTF16.putCharsAt(val, count, 'n', 'u', 'l', 'l');
> 		// ...
> 	}
> 
> 	public AbstractStringBuilder append(boolean b) {
> 	    // ...
> 		StringUTF16.putCharsAt(val, count, 't', 'r', 'u', 'e');
> 		// ...
> 		StringUTF16.putCharsAt(val, count, 'f', 'a', 'l', 's', 'e');
> 		// ...
> 	}
> }
> 
> class StringUTF16 {
>    public static void putCharsAt(byte[] value, int i, char c1, char c2, char c3, char c4) {
>         putChar(value, i    , c1);
>         putChar(value, i + 1, c2);
>         putChar(value, i + 2, c3);
>         putChar(value, i + 3, c4);
>     }
> 
>     @IntrinsicCandidate
>     // intrinsic performs no bounds checks
>     static void putChar(byte[] val, int index, int c) {
>         assert index >= 0 && index < length(val) : "Trusted caller missed bounds check";
>         index <<= 1;
>         val[index++] = (byte)(c >> HI_BYTE_SHIFT);
>         val[index]   = (byte)(c >> LO_BYTE_SHIFT);
>     }
> }
> 
> 
> The code for manually merging storage is as follows, without using Unsafe:
> 
> class AbstractStringBuilder {
>     static final long NULL_UTF16;
>     static final long TRUE_UTF16;
>     static final long FALS_UTF16;
> 
>     static {
>         byte[] bytes = new byte[8];
> 
>         StringUTF16.putCharsAt(bytes, 0, 'n', 'u', 'l', 'l');
>         NULL_UTF16 = getLong(bytes, 0);
> 
>         StringUTF16.putCharsAt(bytes, 0, 't', 'r', 'u', 'e');
>         TRUE_UTF16 = getLong(bytes, 0);
> 
>         StringUTF16.putCharsAt(bytes, 0, 'f', 'a', 'l', 's');
>         FALS_UTF16 = getLong(bytes, 0);
>     }
> 
>     private static long getLong(byte[] bytes, int offset) {
>         return (((long)bytes[offset    ] & 0xff)      ) |
>                (((long)bytes[offset + 1] & 0xff) <<  8) |
>                (((long)bytes[offset + 2] & 0xff) << 16) |
>                (((long)bytes[offset + 3] & 0xff) << 24) |
>                (((long)bytes[offset + 4] & 0xff) << 32) |
>                (((long)bytes[offset + 5] & 0xff) << 40) |
>                (((long)bytes[offset + 6] & 0xff) << 48) |
>                (((long)bytes[offset + 7] & 0xff) << 56);
>     }
> 
>     private static void setLong(byte[] array, int offset, long value) {
>         array[offset]     = (byte)  value;
>         array[offset + 1] = (byte) (value >> 8);
>         array[offset + 2] = (byte) (value >> 16);
>         array[offset + 3] = (byte) (value >> 24);
>         arra...

@wenshao 
> @eme64 It seems that when the following code uses StringUTF16.putChar, C2's optimization is not as good as the manual merging and storage effect.

As I asked above, you will need to provide some evidence / generated assembly / perf data, and logs from `TraceMergeStores`. I currently do not have time to produce these myself, and I think they would be crucial to determine where the missing performance has gone. See my earlier comment:
https://github.com/openjdk/jdk/pull/19626#issuecomment-2158533469

And please also try @cl4es advide here:
https://github.com/openjdk/jdk/pull/19626#issuecomment-2159509806

And sure, maybe you need some public API for setting multiple bytes at once, which the `MergeStores` optimization can optimize. I'm a C2 engineer, so I leave that up to the library folks ;)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/19626#issuecomment-2159895757