SPARC <was> Re: RFR 8151163 All Buffer implementations should leverage Unsafe unaligned accessors

Wed Mar 30 09:36:23 UTC 2016

Hi,

Some performance analysis on SPARC revealed two anomalies:

1) the first was an embarrassing bug in the buffer views calculating the incorrect offset, which has been rectified in the latest webrev:

http://cr.openjdk.java.net/~psandoz/jdk9/JDK-8151163-buffer-unsafe-unaligned-access/webrev/src/java.base/share/classes/java/nio/ByteBufferAs-X-Buffer.java.template.sdiff.html

(We need to add stronger tests for views e.g. access is consistent with the byte buffer being viewed.)

2) the second is a performance regression with direct ByteBuffer.getLong for unaligned access that results in access of 8 bytes, ~ a 2x regression. When access is aligned there is ~ a 4x improvement. One could argue on average the gains far outweigh the losses :-) and on x86 there are only gains.

It’s java.nio.Bits.getLong(long a, boolean bigEndian) vs. Unsafe.getLongUnaligned(Object o, long offset, boolean bigEndian) [1].

Unsafe.getLongUnaligned needs to perform three alignment checks before accessing bytes and then optionally perform a byte swap. Bits.getLong always access bytes and composing using the defined endianness and requires no additional byte swap.

Since SPARC is Big Endian and buffers are by default Big Endian i can rule any byte swapping, but it potentially could increase the regression if Little Endian is chosen [2].

When access is performed in loops this can cost, as the alignment checks are not hoisted out. Theoretically could for regular 2, 4, 8 strides through the buffer contents. For such cases alignment of the base address can be checked. Not sure how complicated that would be to support.

I lack knowledge of the SPARC instruction set to know if we could do something clever as an intrinsic.

No regression on SPARC is observed for buffer views. This is because Bits.getLong(ByteBuffer bb, int bi, boolean bigEndian) is used and byte accesses go through ByteBuffer and it currently appears SPARC does not optimize that so well (compared to say that on x86). If it did then i think the original updates to managed ByteBuffers to leverage Unsafe.get*Unaligned would have also introduced a regression, and perhaps there is still that risk in certain cases.

I am reluctant to proceed with this patch until more analysis on SPARC is performed from which we can make a more informed decision.

Paul.

[1]

Specifically, in Bits:

static long getLong(long a, boolean bigEndian) {
    return bigEndian ? getLongB(a) : getLongL(a);
}

static long getLongB(long a) {
    return makeLong(_get(a    ),
                    _get(a + 1),
                    _get(a + 2),
                    _get(a + 3),
                    _get(a + 4),
                    _get(a + 5),
                    _get(a + 6),
                    _get(a + 7));
}

private static long makeLong(byte b7, byte b6, byte b5, byte b4,
                             byte b3, byte b2, byte b1, byte b0)
{
    return ((((long)b7       ) << 56) |
            (((long)b6 & 0xff) << 48) |
            (((long)b5 & 0xff) << 40) |
            (((long)b4 & 0xff) << 32) |
            (((long)b3 & 0xff) << 24) |
            (((long)b2 & 0xff) << 16) |
            (((long)b1 & 0xff) <<  8) |
            (((long)b0 & 0xff)      ));
}

and in Unsafe,

private long getLong(long a) {
    long x = unsafe.getLongUnaligned(null, a, bigEndian);
    return (x);
}

public final long getLongUnaligned(Object o, long offset, boolean bigEndian) {
    return convEndian(bigEndian, getLongUnaligned(o, offset));
}

@HotSpotIntrinsicCandidate
public final long getLongUnaligned(Object o, long offset) {
    if ((offset & 7) == 0) {
        return getLong(o, offset);
    } else if ((offset & 3) == 0) {
        return makeLong(getInt(o, offset),
                        getInt(o, offset + 4));
    } else if ((offset & 1) == 0) {
        return makeLong(getShort(o, offset),
                        getShort(o, offset + 2),
                        getShort(o, offset + 4),
                        getShort(o, offset + 6));
    } else {
        return makeLong(getByte(o, offset),
                        getByte(o, offset + 1),
                        getByte(o, offset + 2),
                        getByte(o, offset + 3),
                        getByte(o, offset + 4),
                        getByte(o, offset + 5),
                        getByte(o, offset + 6),
                        getByte(o, offset + 7));
    }
}

private static long makeLong(byte i0, byte i1, byte i2, byte i3, byte i4, byte i5, byte i6, byte i7) {
    return ((toUnsignedLong(i0) << pickPos(56, 0))
          | (toUnsignedLong(i1) << pickPos(56, 8))
          | (toUnsignedLong(i2) << pickPos(56, 16))
          | (toUnsignedLong(i3) << pickPos(56, 24))
          | (toUnsignedLong(i4) << pickPos(56, 32))
          | (toUnsignedLong(i5) << pickPos(56, 40))
          | (toUnsignedLong(i6) << pickPos(56, 48))
          | (toUnsignedLong(i7) << pickPos(56, 56)));
}

[2]

The comment in sparc.ad suggest a spill might occur that could cost.

instruct bytes_reverse_long(iRegL dst, stackSlotL src) %{
  match(Set dst (ReverseBytesL src));

  // Op cost is artificially doubled to make sure that load or store
  // instructions are preferred over this one which requires a spill
  // onto a stack slot.
  ins_cost(2*DEFAULT_COST + MEMORY_REF_COST);
  format %{ "LDXA   $src, $dst\t!asi=primary_little" %}

  ins_encode %{
    __ set($src$$disp + STACK_BIAS, O7);
    __ ldxa($src$$base$$Register, O7, Assembler::ASI_PRIMARY_LITTLE, $dst$$Register);
  %}
  ins_pipe( iload_mem );
%}

> On 10 Mar 2016, at 18:06, Paul Sandoz <Paul.Sandoz at oracle.com> wrote:
> 
> 
>> On 8 Mar 2016, at 19:27, Paul Sandoz <Paul.Sandoz at oracle.com> wrote:
>> 
>>>> The changes in this webrev take advantage of those for JDK-8149469
>>>> and apply the unsafe double addressing scheme so certain byte buffer
>>>> view implementations can work across heap and direct buffers. This
>>>> should improve the performance on x86 for:
>>> 
>>> I understand the idea, but I think we would need to verify this before
>>> pushing.
>>> 
>> 
>> Admittedly i am leaning on the rational/motivation for the previous changes to use Unsafe unaligned accessors.
>> 
>> I less confident about the impact on non-x86 platforms.
>> 
>> I have some VarHandles related benchmark code [*] i can use to get some numbers.
>> 
> 
> Here is some prelimiary perf numbers for x86 so far:
> 
>  http://cr.openjdk.java.net/~psandoz/jdk9/JDK-8151163-buffer-unsafe-unaligned-access/perf/ArrayViewTest.java
> 
> Observations:
> 
> - no regression for aligned and misaligned access
> 
> - big performance boost for LongBuffer view accesses over a managed ByteBuffer.
> 
> - there are some curious smaller differences between ByteBuffer.getLong, LongBuffer.get, direct and managed [*], which may come down to the addressing mode used for access in unrolled loops (and how variables are hoisted out of the loop). Need to analyse the generated code. Those are not blockers and could be swept up in another issue.
> 
> I need to run this on SPARC, but so far it is looking good.
> 
> Paul.
> 
> [*] bb_direct_long_view < bb_direct_long = bb_managed_long < bb_managed_long_view
>