RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store

Tue Jan 23 08:51:30 UTC 2024

On Tue, 23 Jan 2024 03:18:39 GMT, Dean Long <dlong at openjdk.org> wrote:

>>> I took a quick look and I don't see where the code checks that all the candidate stores are using MemNode::unordered and that there aren't memory barriers in between.
>> 
>> @dean-long How would the graph look like if there were memory barriers? Would there not be something on the memory graph or control graph which is neither a Store nor a RangeCheck? Any what exactly is your concern about `MemNode::unordered`? Would you mind explaining or giving some examples?
>
>> How would the graph look like if there were memory barriers? Would there not be something on the memory graph or control graph which is neither a Store nor a RangeCheck?
> 
> Yes, I think so.  See for example GraphKit::insert_mem_bar().
> 
>> Any what exactly is your concern about `MemNode::unordered`? Would you mind explaining or giving some examples?
> 
> Any stores with explicit Release semantics, for example, probably expect that ordering to preserved.  It looks like such a store can be generated using Unsafe.putByteRelease().  So I would think that the only stores that can take part in this optimization should be using "unordered".  How would your optimization apply to the following cases?
> 
> case 1:
> Unsafe.putByteRelease(array1, offset+0, 'A');
> Unsafe.putByteRelease(array1, offset+1, 'A');
> 
> case2:
> Unsafe.putByteRelease(array1, offset+0, 'A');
> Unsafe.putByteRelease(array1, offset+3, 'A');
> Unsafe.putByteRelease(array1, offset+1, 'A');
> 
> In case 1, I don't think it's safe to turn this into a 16-bit store unless the store is atomic, otherwise another thread might see the writes in the wrong order.  Likewise for case 2, another thread shouldn't be able to observe the writes in the wrong order.

@dean-long Ok, but if there is something else on the memory graph than the Stores, or something else on the control-graph than the RangeCheck, then the optimization bails out. I also created an example with `Unsafe.putByteRelease`, and some other variants:

import jdk.internal.misc.Unsafe;

public class Test {
    static final Unsafe UNSAFE = Unsafe.getUnsafe();

    public static void main(String[] strArr) {
        byte[] a = new byte[100];
        for (int i = 0; i < 10_000; i++) {
            test1(a);
            test2(a);
            test3(a);
            test4(a);
        }
    }

    static void test1(byte[] a) {
        UNSAFE.putByte(a, UNSAFE.ARRAY_BYTE_BASE_OFFSET + 0, (byte)0xf1);
        UNSAFE.putByte(a, UNSAFE.ARRAY_BYTE_BASE_OFFSET + 1, (byte)0xf2);
        UNSAFE.putByte(a, UNSAFE.ARRAY_BYTE_BASE_OFFSET + 2, (byte)0xf3);
        UNSAFE.putByte(a, UNSAFE.ARRAY_BYTE_BASE_OFFSET + 3, (byte)0xf4);
    }

    static void test2(byte[] a) {
        UNSAFE.putByteVolatile(a, UNSAFE.ARRAY_BYTE_BASE_OFFSET + 0, (byte)0xf1);
        UNSAFE.putByteVolatile(a, UNSAFE.ARRAY_BYTE_BASE_OFFSET + 1, (byte)0xf2);
        UNSAFE.putByteVolatile(a, UNSAFE.ARRAY_BYTE_BASE_OFFSET + 2, (byte)0xf3);
        UNSAFE.putByteVolatile(a, UNSAFE.ARRAY_BYTE_BASE_OFFSET + 3, (byte)0xf4);
    }

    static void test3(byte[] a) {
        UNSAFE.putByteRelease(a, UNSAFE.ARRAY_BYTE_BASE_OFFSET + 0, (byte)0xf1);
        UNSAFE.putByteRelease(a, UNSAFE.ARRAY_BYTE_BASE_OFFSET + 1, (byte)0xf2);
        UNSAFE.putByteRelease(a, UNSAFE.ARRAY_BYTE_BASE_OFFSET + 2, (byte)0xf3);
        UNSAFE.putByteRelease(a, UNSAFE.ARRAY_BYTE_BASE_OFFSET + 3, (byte)0xf4);
    }

    static void test4(byte[] a) {
        UNSAFE.putByteOpaque(a, UNSAFE.ARRAY_BYTE_BASE_OFFSET + 0, (byte)0xf1);
        UNSAFE.putByteOpaque(a, UNSAFE.ARRAY_BYTE_BASE_OFFSET + 1, (byte)0xf2);
        UNSAFE.putByteOpaque(a, UNSAFE.ARRAY_BYTE_BASE_OFFSET + 2, (byte)0xf3);
        UNSAFE.putByteOpaque(a, UNSAFE.ARRAY_BYTE_BASE_OFFSET + 3, (byte)0xf4);
    }
}

`java --add-modules java.base --add-exports java.base/jdk.internal.misc=ALL-UNNAMED --add-exports java.base/jdk.internal.util=ALL-UNNAMED -XX:CompileCommand=compileonly,Test::test* -XX:CompileCommand=printcompilation,Test::test* -XX:+TraceMergeStores -XX:-PrintIdeal Test.java`

Only `test1` applies the optimization. If you enable `PrintIdeal`, then you can see that all others have `MemBarCPUOrder` nodes, which are on both the memory and control path. That's the reason my algorithm does not merge them.

I'll add this to the regression tests.

Your "case 2" would not optimize anyway, since I do not re-order anything, I only merge the stores if they are adjacent with increasing indices.

Does this satisfy you, or are you worried about other cases too?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1905577461