VarHandles & LMAX Disruptor

Tue Aug 4 15:15:36 UTC 2015

Hi Michael,

Thanks for the detailed response.

Not sure how we can solve this particular use-case, contended arrays, without either HotSpot getting smarter on such bounds checks or supporting alternative array layouts.

You could try out #3 by setting the experimental flag:

  -XX:+UnlockExperimentalVMOptions -XX:+TrustFinalNonStaticFields

But i don’t know if HotSpot will correctly associate indexMask with the array length.

Paul.

On 3 Aug 2015, at 03:53, Michael Barker <mikeb01 at gmail.com> wrote:

> Hi,
> 
> Having played with it a bit more, I can get the bound check removed under one specific arrangement of the code, the following is for the RingBufferFields.elementAt:
> 
> 1)  return (E) entries[((int) sequence) & (entries.length - 1)];
> 
> This works and I get the pattern mentioned Paul's example.  There is a drawback to implementing it this way, details below.
> 
> 2)  return (E) entries[(int) sequence & (entries.length - 1)];
> 
> Note the position of the brackets.  In this case entries.length is being promoted to a long before applying the mask, then being cast down to an int, which seems to trip up the compiler.
> 
> 3)  return (E) entries[((int) sequence) & indexMask];  // indexMask is a final field of value entries.length - 1
> 
> I suspect this is because the compiler can't assume that final fields are actually final.
> 
> 4)  return (E) entries[BUFFER_PAD + (((int) sequence) & (entries.length - ((BUFFER_PAD * 2) + 1)))];
> 
> I'm guessing it gets just too complex for the compiler to figure our that the resulting value will be less that entries.length.
> 
> One of the niggles with approach #1, is that I need to reference entries.length at the call site and can't use a separate field to store the value.  This can cause issue in a concurrent ring-buffer-like structures as the array length can end up sharing a cache line with the array itself.  So if a producer thread writes into one of the first 15 array slots (worst case), the whole line including the part containing the length field will be invalid on the consumer's core resulting in an additional cache miss in certain cases.  This is probably less of an issue for the default set up of the disruptor as we don't actually replace the references inside of the entries array, but a number of fast queue implementations would suffer.  The developer is caught having to make a trade-off between reducing false sharing and removing bounds checking.
> 
> Mike.
> 
> 
> On 1 August 2015 at 00:18, Paul Sandoz <paul.sandoz at oracle.com> wrote:
> 
> On 31 Jul 2015, at 07:43, Michael Barker <mikeb01 at gmail.com> wrote:
> 
> > Hi Paul,
> >
> > I've had a look at the patch that you mentioned and AFAICT it doesn't seem to be able to eliminate the bounds check, even if I remove the array padding.  Just to confirm my analysis the assembler around the aaload instruction looks like:
> >
> > 0x00007f627d6b3aea: cmp    %ebx,%edx
> > 0x00007f627d6b3aec: jae    0x00007f627d6b3e0c
> > 0x00007f627d6b3af2: mov    %r8,%r13
> > 0x00007f627d6b3af5: mov    %r11d,%r8d
> > 0x00007f627d6b3af8: mov    0x10(%rax,%rdx,4),%r11d  ;*aaload
> >
> > I'm guessing that the cmp/jae pair is the bounds check?
> >
> 
> Yes, i am assuming that is generated code of RingBuffer.elementAt (aaload) and not MultiProducerSequencer. I suspect the padding is throwing the compiler off the scent (it does not know that indexMask is less than entries.length - BUFFER_INDEX).
> 
> For an array access such as:
> 
>   int j = i & (a.length - 1);
>   X x = a[j];
> 
> I would expect to observe something like:
> 
>   test   %edi,%edi
>   jbe    …
> 
> The bound check should get strength reduced to checking if the array length is zero, and i would expect such a test to be hoisted out of any loop (and get folded into another dominating array length check, if any).
> 
> Here is an example, given this benchmark method:
> 
> @Benchmark
> public int relaxed_r_aa() {
>     Value[] _receiver = receiver;
>     int sum = 0;
>     for (int i = start; i < end; i++) {
>         int j = i & (_receiver.length - 1);
>         sum += _receiver[j].i;
>     }
>     return sum;
> }
> 
> Without the patch the following code is generated (when loop unrolling is switched off):
> 
> : mov    0xc(%r12,%rbx,8),%r9d  ;*arraylength
> : mov    %r9d,%r11d
> : dec    %r11d
> : lea    (%r12,%rbx,8),%rcx
> : xor    %eax,%eax          ;*iload_3
> 
> <loop>: mov    %r11d,%ebp
> : and    %r10d,%ebp         ;*iand
> : cmp    %r9d,%ebp
> : jae    <bounds-fail>
> : mov    0x10(%rcx,%rbp,4),%edi  ;*aaload
> : add    0xc(%r12,%rdi,8),%eax  ;*iadd
> : inc    %r10d              ;*iinc
> : cmp    %r8d,%r10d
> : jl     <loop>  ;*if_icmpge
> 
> 
> When the patch is applied the following code is generated:
> 
> : mov    0xc(%r12,%rbp,8),%r9d  ;*arraylength
> : test   %r9d,%r9d
> : jbe    <bounds-fail>
> : dec    %r9d
> : lea    (%r12,%rbp,8),%r8
> : data32 nopw 0x0(%rax,%rax,1)
> : data32 data32 xchg %ax,%ax  ;*iload_3
> 
> <loop>: mov    %r9d,%ebx
> : and    %r11d,%ebx
> : mov    0x10(%r8,%rbx,4),%ebx  ;*aaload
> : add    0xc(%r12,%rbx,8),%eax  ;*iadd
> : inc    %r11d              ;*iinc
> : cmp    %r10d,%r11d
> : jl     <loop>  ;*if_icmpge
> 
> 
> 
> > A quick note on the patch, it merged, but didn't compile.  There seemed to be a signature change on the signature of ConINode::make, so I changed line 1311 in subnode.cpp from 'return ConINode::make(phase->C, 1);' to 'return ConINode::make(1);'.  That let it compile, but I don't really understand the code so I'm not sure if it is semantically correct.
> >
> 
> Oops sorry about that, i uploaded a new revision of the patch the same fix that you applied.
> 
> Paul.
> 
>