VarHandles & LMAX Disruptor

Mon Aug 3 01:53:02 UTC 2015

Hi,

Having played with it a bit more, I can get the bound check removed under
one specific arrangement of the code, the following is for the
RingBufferFields.elementAt:

1)  return (E) entries[((int) sequence) & (entries.length - 1)];

This works and I get the pattern mentioned Paul's example.  There is a
drawback to implementing it this way, details below.

2)  return (E) entries[(int) sequence & (entries.length - 1)];

Note the position of the brackets.  In this case entries.length is being
promoted to a long before applying the mask, then being cast down to an
int, which seems to trip up the compiler.

3)  return (E) entries[((int) sequence) & indexMask];  // indexMask is a
final field of value entries.length - 1

I suspect this is because the compiler can't assume that final fields are
actually final.

4)  return (E) entries[BUFFER_PAD + (((int) sequence) & (entries.length -
((BUFFER_PAD * 2) + 1)))];

I'm guessing it gets just too complex for the compiler to figure our that
the resulting value will be less that entries.length.

One of the niggles with approach #1, is that I need to reference
entries.length at the call site and can't use a separate field to store the
value.  This can cause issue in a concurrent ring-buffer-like structures as
the array length can end up sharing a cache line with the array itself.  So
if a producer thread writes into one of the first 15 array slots (worst
case), the whole line including the part containing the length field will
be invalid on the consumer's core resulting in an additional cache miss in
certain cases.  This is probably less of an issue for the default set up of
the disruptor as we don't actually replace the references inside of the
entries array, but a number of fast queue implementations would suffer.
The developer is caught having to make a trade-off between reducing false
sharing and removing bounds checking.

Mike.

On 1 August 2015 at 00:18, Paul Sandoz <paul.sandoz at oracle.com> wrote:

>
> On 31 Jul 2015, at 07:43, Michael Barker <mikeb01 at gmail.com> wrote:
>
> > Hi Paul,
> >
> > I've had a look at the patch that you mentioned and AFAICT it doesn't
> seem to be able to eliminate the bounds check, even if I remove the array
> padding.  Just to confirm my analysis the assembler around the aaload
> instruction looks like:
> >
> > 0x00007f627d6b3aea: cmp    %ebx,%edx
> > 0x00007f627d6b3aec: jae    0x00007f627d6b3e0c
> > 0x00007f627d6b3af2: mov    %r8,%r13
> > 0x00007f627d6b3af5: mov    %r11d,%r8d
> > 0x00007f627d6b3af8: mov    0x10(%rax,%rdx,4),%r11d  ;*aaload
> >
> > I'm guessing that the cmp/jae pair is the bounds check?
> >
>
> Yes, i am assuming that is generated code of RingBuffer.elementAt (aaload)
> and not MultiProducerSequencer. I suspect the padding is throwing the
> compiler off the scent (it does not know that indexMask is less than
> entries.length - BUFFER_INDEX).
>
> For an array access such as:
>
>   int j = i & (a.length - 1);
>   X x = a[j];
>
> I would expect to observe something like:
>
>   test   %edi,%edi
>   jbe    …
>
> The bound check should get strength reduced to checking if the array
> length is zero, and i would expect such a test to be hoisted out of any
> loop (and get folded into another dominating array length check, if any).
>
> Here is an example, given this benchmark method:
>
> @Benchmark
> public int relaxed_r_aa() {
>     Value[] _receiver = receiver;
>     int sum = 0;
>     for (int i = start; i < end; i++) {
>         int j = i & (_receiver.length - 1);
>         sum += _receiver[j].i;
>     }
>     return sum;
> }
>
> Without the patch the following code is generated (when loop unrolling is
> switched off):
>
> : mov    0xc(%r12,%rbx,8),%r9d  ;*arraylength
> : mov    %r9d,%r11d
> : dec    %r11d
> : lea    (%r12,%rbx,8),%rcx
> : xor    %eax,%eax          ;*iload_3
>
> <loop>: mov    %r11d,%ebp
> : and    %r10d,%ebp         ;*iand
> : cmp    %r9d,%ebp
> : jae    <bounds-fail>
> : mov    0x10(%rcx,%rbp,4),%edi  ;*aaload
> : add    0xc(%r12,%rdi,8),%eax  ;*iadd
> : inc    %r10d              ;*iinc
> : cmp    %r8d,%r10d
> : jl     <loop>  ;*if_icmpge
>
>
> When the patch is applied the following code is generated:
>
> : mov    0xc(%r12,%rbp,8),%r9d  ;*arraylength
> : test   %r9d,%r9d
> : jbe    <bounds-fail>
> : dec    %r9d
> : lea    (%r12,%rbp,8),%r8
> : data32 nopw 0x0(%rax,%rax,1)
> : data32 data32 xchg %ax,%ax  ;*iload_3
>
> <loop>: mov    %r9d,%ebx
> : and    %r11d,%ebx
> : mov    0x10(%r8,%rbx,4),%ebx  ;*aaload
> : add    0xc(%r12,%rbx,8),%eax  ;*iadd
> : inc    %r11d              ;*iinc
> : cmp    %r10d,%r11d
> : jl     <loop>  ;*if_icmpge
>
>
>
> > A quick note on the patch, it merged, but didn't compile.  There seemed
> to be a signature change on the signature of ConINode::make, so I changed
> line 1311 in subnode.cpp from 'return ConINode::make(phase->C, 1);' to
> 'return ConINode::make(1);'.  That let it compile, but I don't really
> understand the code so I'm not sure if it is semantically correct.
> >
>
> Oops sorry about that, i uploaded a new revision of the patch the same fix
> that you applied.
>
> Paul.
>
>