Probably a bug

Kirill A. Korinsky kirill at korins.ky
Tue Feb 18 16:04:58 UTC 2020


Looks like it is Shenandoah bug

Why I think so?

I've covered AbstractNodeQueue by very detailed logging and has this:

DEBUG AbstractNodeQueue: - (Node(2395, tid: 38, at: 322151611089285)) setNext(), newNext: Node(2994, tid: 38, at: 322151611097285)
DEBUG AbstractNodeQueue: - (AbstractNodeQueue(2235, tid: 38, at: 322151611115585)) add, n: Node(2994, tid: 38, at: 322151611123985); o: Node(2395, tid: 38, at: 322151611132085)

    public final void add(final T value) {
        final Node<T> n = new Node<T>(value);
        Node<T> o = getAndSet(n); // unsafe.getAndSetObject
        o.setNext(n);
        LOGGER.debug("(" + this + ") " + "add, n: " + n + "; value: " + value + "; o: " + o);
    }

    protected final void setNext(final Node<T> newNext) {
        Unsafe.instance.putOrderedObject(this, nextOffset, newNext);
    }

    public String toString() {
        return "AbstractNodeQueue(" + nodeNumber + ", tid: " + Thread.currentThread().getId() + ", at: " + System.nanoTime() + ")";
    }

    public String toString() {
        return "Node(" + nodeNumber + ", tid: " + Thread.currentThread().getId() + ", at: " + System.nanoTime() + ")";
    }


DEBUG AbstractNodeQueue: - (AbstractNodeQueue(2235, tid: 38, at: 322151751689585)) pollNode, next: null; tail: Node(2395, tid: 38, at: 322151751698185); get(): Node(2994, tid: 38, at: 322151751707485)

    public final Node<T> pollNode() {
      final Node<T> tail = (Node<T>) Unsafe.instance.getObjectVolatile(this, tailOffset);
      Node<T> next = tail.next();
      long spins = 0;
      LOGGER.debug("(" + this + ") " + "pollNode, next: " + next + "; tail: " + tail + "; get(): " + get());
      if (next == null && get() != tail) {
          // if tail != head this is not going to change until producer makes progress
          // we can avoid reading the head and just spin on next until it shows up
          do {
              spins++;
              next = tail.next();
          } while (next == null);
      }


The thread is looping inside do...while

As soon as I replaced it `getObjectVolatile` and `putOrderedObject` to direct access to volatile nothing is changed.

DEBUG AbstractNodeQueue: - (Node(2526, tid: 38, at: 323848958634985, next: null)) setNext(), newNext: Node(3116, tid: 38, at: 323848958641285, next: null)
DEBUG AbstractNodeQueue: - (AbstractNodeQueue(2367, tid: 38, at: 323848958656285)) add, n: Node(3116, tid: 38, at: 323848958664185, next: null); o: Node(2526, tid: 38, at: 323848958672785, next: Node(3116, tid: 38, at: 323848958679985, next: null))
DEBUG AbstractNodeQueue: - (AbstractNodeQueue(2367, tid: 38, at: 323849089231985)) pollNode, next: null; tail: Node(2526, tid: 38, at: 323849089241685, next: null); get(): Node(3116, tid: 38, at: 323849089251185, next: null)

(here I've added to log next to show current value)

I've tried to reimplement Node class to `public static class Node<T> extends AtomicReference<Node<T>> ` doesn't help.

I've tried just a field and it doesn't help.

-- 
wbr, Kirill



> On 17. Feb 2020, at 15:59, Kirill A. Korinsky <kirill at korins.ky> wrote:
> 
> Good day,
> 
> I'd like to ask for advice because it looks like I've discovered something that might be related to Shenandoah bug.
> 
> I haven't got any proof that it is inside Shenandoah, nor a simple test case to reproduce it.
> 
> It appears inside Akka and you can read my hunting with Akka team here: https://github.com/akka/akka/issues/28601 <https://github.com/akka/akka/issues/28601>
> 
> As summary:
>  - it appears as infinite loop inside Akka queue that is lock-free linked-queue that's implemented via getObjectVolatile(), getAndSet() and few more atomic/unsafe calls.
>  - if I've enabled any debugging such as XX:+ShenandoahVerify the bug is disappear => I can't provide any hs_err_log :(
>  - it exists on OpenJDK-8 from fedora 31 and at shipilev/openjdk-shenandoah:8-fastdebug
>  - it is very difficult to achieve and it is very fragile. In real life, it appears only at one and bigger cluster, at my synthetic test case it requires to bootstrap an application and uses the unreachable Akka system
>  - to achieve this bug I should have a lot of garbage inside heap that produced by bootstrapping an application when it builds its index. The index has size 0,5gb..1gb (and the heap is 2gb) and the size depends on DB that is continuously updating, and the bug is achievable at any possibly size of the index.
>  - if I switch to G1 for example it disappears.
> 
> Right now I have two possible sources of this bug:
>  - very strange race condition inside Akka.
>  - a bug inside Shenandoah that is related to the missed barrier or deeper.
> 
> To eliminate or confirm Shenandoah related possibility I need some advice on how to do it because I can't prepare easy to reproduce code :(
> 
> -- 
> wbr, Kirill
> 



More information about the shenandoah-dev mailing list