RFR: 8352316: More MergeStoreBench

Thu Mar 27 07:15:07 UTC 2025

On Thu, 27 Mar 2025 05:17:44 GMT, Shaojin Wen <swen at openjdk.org> wrote:

>> I'm a developer of fastjson2. According to third-party benchmarks from https://github.com/fabienrenaud/java-json-benchmark, our library demonstrates the best performance. I would like to contribute some of these optimization techniques to OpenJDK, ideally by having C2 (the JIT compiler) directly support them.
>> 
>> Below is an example related to this PR. We have a JavaBean that needs to be serialized to a JSON string:
>> 
>> 
>> * JavaBean
>> 
>> class Bean {
>> 	public int value;
>> }
>> 
>> 
>> * Target JSON Output
>> 
>> {"value":123}
>> 
>> 
>> * CodeGen-Generated JSONSerializer
>> fastjson2 uses ASM to generate a serializer class like the following. The methods writeNameValue0, writeNameValue1, and writeNameValue2 are candidate implementations. Among them, writeNameValue2 is the fastest when the field name length is 8, as it leverages UNSAFE.putLong for direct memory operations:
>> 
>> 
>> class BeanJSONSerializer {
>> 	private static final String name = ""value":";
>> 	private static final byte[] nameBytes = name.getBytes();
>> 	private satic final long nameLong = UNSAFE.getLong(nameBytes, ARRAY_BYTE_BASE_OFFSET);	
>> 
>> 	int writeNameValue0(byte[] bytes, int off, int value) {
>> 		name.getBytes(0, 8, bytes, off);
>> 		off += 8;
>> 		return writeInt32(bytes, off, value);
>> 	}
>> 
>> 	int writeNameValue1(byte[] bytes, int off, int value) {
>> 		System.arraycopy(nameBytes, 0, bytes, off, 8);
>> 		off += 8;
>> 		return writeInt32(bytes, off, value);
>> 	}
>> 
>> 
>> 	int writeNameValue2(byte[] bytes, int off, int value) {
>> 		UNSAFE.putLong(bytes, ARRAY_BYTE_BASE_OFFSET + off, nameLong);
>> 		off += 8;
>> 		return writeInt32(bytes, off, value);
>> 	}
>> }
>> 
>> 
>> We propose that the C2 compiler could optimize cases where the field name length is 4 or 8 bytes by automatically using direct memory operations similar to writeNameValue2. This would eliminate the need for manual unsafe operations in user code and improve serialization performance for common patterns.
>
>> @wenshao Do you have any insight from this benchmark? What was your motivation for it?
>> 
>> I also wonder if an IR test for some of the cases would be helpful. IR tests give us more info about what the compiler produced, and if there is a change in VM behaviour the IR test catches it in regular testing. Benchmarks are not run regularly, and regressions would therefore not be caught.
> 
> I submitted this benchmark to prove that the performance of System.arraycopy or String.getBytes can be improved by Unsafe.putInt/putLong. I hope C2 can do this optimization automatically.

@wenshao 

> I hope C2 can do this optimization automatically.

Did you check if it does or does not do that? Can you investigate what the generated code is for `String.getBytes`? Does that not create an allocation, which would make things much slower? And it may even do some more complicated encoding things, which is a lot of overhead. So that would explain your performance result, at least partially, right?

I'm also not convinced that you are comparing apples to apples here.

Benchmark                                     Mode  Cnt      Score      Error  Units
MergeStoreBench.putNull_arraycopy             avgt    5   8029.622 ±   60.856  ns/op

This does an array copy, so an array load AND an array store, right?

This one even has to do allocations, loads and stores (though you need to investigate and check):

MergeStoreBench.putNull_getBytes              avgt    5   6171.538 ±    5.845  ns/op

On the other hand, this does NOT have to do an array load or allocations, just a simple store:

MergeStoreBench.putNull_unsafePutInt          avgt    5    235.302 ±    2.004  ns/op

Is there actually a benchmark in this series that makes use of individual byte stores that get merged to an int store? Because that is the whole point of MergeStores, right?

Do you really need to use `String.getBytes`? I mean maybe with proper escape analysis etc the whole allocation could be avoided. But that would require a much deeper analysis.

Back to this:

> I hope C2 can do this optimization automatically.

Can you investigate what code it generates, and what kinds of optimizations are missing to make it close in performance to the `Unsafe` benchmark?

I don't have time to do all the deep investigations myself. But feel free to ask me if you have more questions.

@wenshao Since we don't seem to be comparing apples to apples here, it would be even more important to leave comments at the benchmarks to say what operations (loads, stores, allocations, etc) are happening. And what we know is optimized, and what we think could be optimized in the future.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24108#issuecomment-2756960435
PR Comment: https://git.openjdk.org/jdk/pull/24108#issuecomment-2756965014