A question about memory order on x86_64

Mon Jun 21 20:49:13 UTC 2021

hi, David,

Thanks. This is my blindspot. Let me go through your description. There are
3 threads. M is the main thread which is doing logConfiguration changes.
A is the async log thread, which is flushing. L is a user thread which is
emitting logs.

M: flush1/sem_wait ---- ---- ---- ---- ---- ---- ---- flush1/sem_post ----
flush2/sem_wait ---- flush2/sem.post ---- delete_output() ----
L: ---- ---- enqueue(log1) ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
A: ---- ---- ---- ---- ---- dequeue(log1) ---- run/sem_wait ---- ---- ----
---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
write_blocking(log1)

This is the timeline you described, isn't it? A key missing information is
that sem_wait() (POSIX semaphore) doesn't respect FIFO order.
I thought Thread-A has invoked run/sem_wait() first. If flush1/sem_post()
is called, kernel must wake up thread A immediately. It may NOT  the case,
right?
It's possible that thread-A is descheduled and thread-M can keep going.
Under the hook, I see that glibc implements sem_wait using syscall futex.

In my test, I stupidly put two set_log_config() in a loop. Therefore, M
invokes two flush() in a row!
https://github.com/openjdk/jdk/pull/4408/files#diff-25062d200033077d44dacb5aae6b8141316b4bc10149c061cf047fe7db250d98R306

Let me try to simulate this and see if I have a fix.

Thanks,
-lx

On Sun, Jun 20, 2021 at 6:54 PM David Holmes <david.holmes at oracle.com>
wrote:

> Hi Xin,
>
> I'm looking at your code again and apologies if we covered this
> previously but I'm not seeing how things are supposed to work exactly.
> Looking at the write() function :
>
> void AsyncLogWriter::write() {
>    AsyncLogBuffer logs;
>    bool own_io = false;
>
>    { // critical region
>      AsyncLogLocker locker;
>
>      _buffer.pop_all(&logs);
>      // append meta-messages of dropped counters
>      AsyncLogMapIterator dropped_counters_iter(logs);
>      _stats.iterate(&dropped_counters_iter);
>      own_io = _io_sem.trywait();
>    }
>
>    LinkedListIterator<AsyncLogMessage> it(logs.head());
>    if (!own_io) {
>      _io_sem.wait();
>    }
>
>    while (!it.is_empty()) {
>      AsyncLogMessage* e = it.next();
>      char* msg = e->message();
>
>      if (msg != nullptr) {
>        e->output()->write_blocking(e->decorations(), msg);
>
> Supposed that a flush() operation has called write() and it has reached
> the point where it is about to release the io_sem at the end and gets
> descheduled. Other threads then perform logging operations that include
> a specific output N. The async log thread executes the code above, and
> stashes those log messages off in the local "logs" and then blocks on
> io_sem_wait. Let the original flush release the io_sem, and a second
> flush() commence due to the log configuration update that will delete
> output N. That flush also completes and releases the io_sem. The async
> log thread now acquires io_sem and attempts to write to the output that
> was just deleted.
>
> David
> -----
>
> On 21/06/2021 6:50 am, Liu, Xin wrote:
> > Hi, Hotspot experts,
> >
> > I am debugging a PR which uses lock-free programming. David Holmes
> > helped me submit this patch to the internal regression infra(thanks a
> > lot). We got intermittent failures at tier4~7. There are two distinct
> > crash sights, but they're similar. I guess the reason is asynclog thread
> > observed different memory state from other threads.
> >
> > https://github.com/openjdk/jdk/pull/4408#issuecomment-863788283
> >
> > I have spent a lot efforts on reproducing this problem. Eventually, I
> > got a crash in docker and luckily coredump file and hs_err report were
> > all saved.
> >
> > In core dump file, I see this stacktraces in asynclog thread.
> >
> > (gdb) bt
> > ...
> > #24 0x00007f76e1ab0e90 in flockfile () from /lib64/libc.so.6
> > #25 0x00007f76e3d0c253 in FileLocker (file=0xbabababababababa,
> > this=<synthetic pointer>) at
> > /jdk/src/hotspot/share/logging/logFileStreamOutput.cpp:125
> > #26 LogFileStreamOutput::write (this=0x55ea4d504bd0, decorations=...,
> > msg=0x7f7624079b00 "ConcurrentLogsite 3 emits a log") at
> > /jdk/src/hotspot/share/logging/logFileStreamOutput.cpp:125
> > #27 0x00007f76e3d0b572 in LogFileOutput::write_blocking
> > (this=0x55ea4d504bd0, decorations=..., msg=0x7f7624079b00
> > "ConcurrentLogsite 3 emits a log") at
> > /jdk/src/hotspot/share/logging/logFileOutput.cpp:308
> > #28 0x00007f76e3d05bbd in write (this=0x55ea4d2e1af0) at
> > /jdk/src/hotspot/share/logging/logAsyncWriter.hpp:106
> > #29 AsyncLogWriter::run (this=0x55ea4d2e1af0) at
> > /jdk/src/hotspot/share/logging/logAsyncWriter.cpp:163
> > #30 0x00007f76e44696a0 in Thread::call_run (this=0x55ea4d2e1af0) at
> > /jdk/src/hotspot/share/runtime/thread.cpp:394
> > #31 0x00007f76e3fdf9b4 in thread_native_entry (thread=0x55ea4d2e1af0) at
> > /jdk/src/hotspot/os/linux/os_linux.cpp:720
> > #32 0x00007f76e162cea5 in start_thread () from /lib64/libpthread.so.0
> > #33 0x00007f76e1b439fd in clone () from /lib64/libc.so.6
> >
> > The corresponding code is here. _stream is 0xbabababababababa?
> >
> > int LogFileStreamOutput::write(const LogDecorations& decorations, const
> > char* msg) {
> >    const bool use_decorations = !_decorators.is_empty();
> >
> >    int written = 0;
> >    FileLocker flocker(_stream);
> >    if (use_decorations)
> >
> > Here is the relevant disassembly code.
> > ...prolog
> > 0x7f76e3d0c223 <LogFileStreamOutput::write(LogDecorations const&, char
> > const*)+19>      mov    %rdi,%rbx
> > 0x7f76e3d0c226 <LogFileStreamOutput::write(LogDecorations const&, char
> > const*)+22>      sub    $0x128,%rsp
> > 0x7f76e3d0c22d <LogFileStreamOutput::write(LogDecorations const&, char
> > const*)+29>      mov    0x28(%rdi),%r15
> > 0x7f76e3d0c231 <LogFileStreamOutput::write(LogDecorations const&, char
> > const*)+33>      mov    0x20(%rdi),%r12d
> > 0x7f76e3d0c235 <LogFileStreamOutput::write(LogDecorations const&, char
> > const*)+37>      mov    %rdx,-0x148(%rbp)
> > 0x7f76e3d0c23c <LogFileStreamOutput::write(LogDecorations const&, char
> > const*)+44>      mov    %r15,%rdi
> > 0x7f76e3d0c23f <LogFileStreamOutput::write(LogDecorations const&, char
> > const*)+47>      mov    %fs:0x28,%rax
> > 0x7f76e3d0c248 <LogFileStreamOutput::write(LogDecorations const&, char
> > const*)+56>      mov    %rax,-0x38(%rbp)
> > 0x7f76e3d0c24c <LogFileStreamOutput::write(LogDecorations const&, char
> > const*)+60>      xor    %eax,%eax
> > 0x7f76e3d0c24e <LogFileStreamOutput::write(LogDecorations const&, char
> > const*)+62>      callq  0x7f76e3feb090 <os::flockfile(_IO_FILE*)>
> > 0x7f76e3d0c253 <LogFileStreamOutput::write(LogDecorations const&, char
> > const*)+67>      test   %r12d,%r12d
> >
> >
> > We can see that r15 = load 0x28(%rdi) and its value is bad. however, in
> > memory, Its value is correct. this->_stream is 0x55ea4d0f8d00.
> >
> > (gdb) p /x $r15
> > $13 = 0xbabababababababa
> >
> > (gdb) p /x $rbx
> > $27 = 0x55ea4d504bd0
> >
> > (gdb) x /1g $rbx + 0x28
> > 0x55ea4d504bf8: 0x000055ea4d0f8d00
> >
> > In main thread, we can see that output->_stream is well defined.
> >
> > (gdb) p output
> > $24 = (LogOutput *) 0x55ea4d504bd0
> >
> > (gdb) p ((LogFileOutput*)output)->_stream
> > $23 = (FILE *) 0x55ea4d0f8d00
> >
> >
> > -------------------------------------------------------------------------
> >
> > Can I say it's a memory ordering issue on x86_64? TBH, I can't explain
> > how that happens.
> >
> > Early moment in main thread, LogFileOutput::initialize() was called and
> > _stream was initialized. After then, the program at least has gone
> through:
> >
> > 1. Atomic::add (lock prefix instruction, which is implied store-load
> > barrier?).
> > 2. sem_wait/post (acquire-release semantics).
> >
> > Therefore, I believe the current thread must see the written value of
> > LogFileOutput::initialize(). ie. _stream should be (FILE *)
> > 0x55ea4d0f8d00, not 0xbababa...
> >
> > Last but not least question. To further understand this problem and
> > verify patch, I think the key is to find a reproducible. Any thoughts on
> > how to reproduce this kinda of problem?  So far, I only came across this
> > crash once on my side. I used a docker image of Oracle Linux 7.9 (glibc
> > 2.17) with cpuset 0~2.  Beside a core dump and a crash report, I can't
> > reproduce it anymore. I think I must overlook some important factors
> > here.  Really appreciate some head-ups.
> >
> > I notice that Oracle is using 3-gen Zen architecture to run tests. Is it
> > anything different from conventional x86_64? In AMD's manual[1], 7.4.2,
> > Table-3, here is the description about lock-prefix instruction.
> >
> > k  — All loads and stores associated with the I/O and locked
> > instructions complete to memory (no bufferedstores) before a load or
> > store from a subsequent instruction is issued.
> >
> > It seems to me that it doesn't imply "bidirectional full memory barrier"
> > or storeload barrier. it just has acquire semantic. I read the story
> > behind this change and Aleksey's article[2]. I'm just not sure the
> > following code is equivalent to "mfence" on EPYC machines because AMD's
> > manual is different from Intel's.
> >
> > inline void OrderAccess::fence() {
> >     // always use locked addl since mfence is sometimes expensive
> > #ifdef AMD64
> >    __asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");
> > #else
> >    __asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", "memory");
> > #endif
> >    compiler_barrier();
> > }
> >
> >
> > [1] https://www.amd.com/system/files/TechDocs/24593.pdf
> > [2]
> >
> https://shipilev.net/blog/2014/on-the-fence-with-dependencies/#_footnote_1
> >
> > thanks,
> > --lx
> >
> >
> >
>