A question about memory order on x86_64

Mon Jun 21 22:02:58 UTC 2021

On 22/06/2021 6:49 am, Liu Xin wrote:
> hi, David,
> 
> Thanks. This is my blindspot. Let me go through your description. There 
> are 3 threads. M is the main thread which is doing logConfiguration changes.
> A is the async log thread, which is flushing. L is a user thread which 
> is emitting logs.
> 
> M: flush1/sem_wait ---- ---- ---- ---- ---- ---- ---- flush1/sem_post 
> ---- flush2/sem_wait ---- flush2/sem.post <http://sem.post> ---- 
> delete_output() ----
> L: ---- ---- enqueue(log1) ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
> A: ---- ---- ---- ---- ---- dequeue(log1) ---- run/sem_wait ---- ---- 
> ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 
> write_blocking(log1)
> 
> This is the timeline you described, isn't it?

Probably close enough.

> A key missing information 
> is that sem_wait() (POSIX semaphore) doesn't respect FIFO order.
> I thought Thread-A has invoked run/sem_wait() first. If 
> flush1/sem_post() is called, kernel must wake up thread A immediately. 
> It may NOT  the case, right?

Right there are no guarantees that when M calls signal/post to wakeup A, 
that A will actually get to run and consume the token and so return from 
sem_wait. If the implementation is based on futex then it is likely 
similar to mutex operation and the newly woken thread will contend with 
all other threads to try and claim the semaphore. In such cases if M 
calls sem_wait very shortly after sem_post then it will likely reclaim 
the semaphore.

> It's possible that thread-A is descheduled and thread-M can keep going. 
> Under the hook, I see that glibc implements sem_wait using syscall futex.
> 
> In my test, I stupidly put two set_log_config() in a loop. Therefore, M 
> invokes two flush() in a row!
> https://github.com/openjdk/jdk/pull/4408/files#diff-25062d200033077d44dacb5aae6b8141316b4bc10149c061cf047fe7db250d98R306 
> <https://github.com/openjdk/jdk/pull/4408/files#diff-25062d200033077d44dacb5aae6b8141316b4bc10149c061cf047fe7db250d98R306>
> 
> Let me try to simulate this and see if I have a fix.

I'm beginning to think that the only safe thing to do here is to 
lock-out configuration changes while any write is in progress.

David
-----

> Thanks,
> -lx
> 
> 
> 
> 
> On Sun, Jun 20, 2021 at 6:54 PM David Holmes <david.holmes at oracle.com 
> <mailto:david.holmes at oracle.com>> wrote:
> 
>     Hi Xin,
> 
>     I'm looking at your code again and apologies if we covered this
>     previously but I'm not seeing how things are supposed to work exactly.
>     Looking at the write() function :
> 
>     void AsyncLogWriter::write() {
>         AsyncLogBuffer logs;
>         bool own_io = false;
> 
>         { // critical region
>           AsyncLogLocker locker;
> 
>           _buffer.pop_all(&logs);
>           // append meta-messages of dropped counters
>           AsyncLogMapIterator dropped_counters_iter(logs);
>           _stats.iterate(&dropped_counters_iter);
>           own_io = _io_sem.trywait();
>         }
> 
>         LinkedListIterator<AsyncLogMessage> it(logs.head());
>         if (!own_io) {
>           _io_sem.wait();
>         }
> 
>         while (!it.is_empty()) {
>           AsyncLogMessage* e = it.next();
>           char* msg = e->message();
> 
>           if (msg != nullptr) {
>             e->output()->write_blocking(e->decorations(), msg);
> 
>     Supposed that a flush() operation has called write() and it has reached
>     the point where it is about to release the io_sem at the end and gets
>     descheduled. Other threads then perform logging operations that include
>     a specific output N. The async log thread executes the code above, and
>     stashes those log messages off in the local "logs" and then blocks on
>     io_sem_wait. Let the original flush release the io_sem, and a second
>     flush() commence due to the log configuration update that will delete
>     output N. That flush also completes and releases the io_sem. The async
>     log thread now acquires io_sem and attempts to write to the output that
>     was just deleted.
> 
>     David
>     -----
> 
>     On 21/06/2021 6:50 am, Liu, Xin wrote:
>      > Hi, Hotspot experts,
>      >
>      > I am debugging a PR which uses lock-free programming. David Holmes
>      > helped me submit this patch to the internal regression infra(thanks a
>      > lot). We got intermittent failures at tier4~7. There are two distinct
>      > crash sights, but they're similar. I guess the reason is asynclog
>     thread
>      > observed different memory state from other threads.
>      >
>      > https://github.com/openjdk/jdk/pull/4408#issuecomment-863788283
>     <https://github.com/openjdk/jdk/pull/4408#issuecomment-863788283>
>      >
>      > I have spent a lot efforts on reproducing this problem. Eventually, I
>      > got a crash in docker and luckily coredump file and hs_err report
>     were
>      > all saved.
>      >
>      > In core dump file, I see this stacktraces in asynclog thread.
>      >
>      > (gdb) bt
>      > ...
>      > #24 0x00007f76e1ab0e90 in flockfile () from /lib64/libc.so.6
>      > #25 0x00007f76e3d0c253 in FileLocker (file=0xbabababababababa,
>      > this=<synthetic pointer>) at
>      > /jdk/src/hotspot/share/logging/logFileStreamOutput.cpp:125
>      > #26 LogFileStreamOutput::write (this=0x55ea4d504bd0, decorations=...,
>      > msg=0x7f7624079b00 "ConcurrentLogsite 3 emits a log") at
>      > /jdk/src/hotspot/share/logging/logFileStreamOutput.cpp:125
>      > #27 0x00007f76e3d0b572 in LogFileOutput::write_blocking
>      > (this=0x55ea4d504bd0, decorations=..., msg=0x7f7624079b00
>      > "ConcurrentLogsite 3 emits a log") at
>      > /jdk/src/hotspot/share/logging/logFileOutput.cpp:308
>      > #28 0x00007f76e3d05bbd in write (this=0x55ea4d2e1af0) at
>      > /jdk/src/hotspot/share/logging/logAsyncWriter.hpp:106
>      > #29 AsyncLogWriter::run (this=0x55ea4d2e1af0) at
>      > /jdk/src/hotspot/share/logging/logAsyncWriter.cpp:163
>      > #30 0x00007f76e44696a0 in Thread::call_run (this=0x55ea4d2e1af0) at
>      > /jdk/src/hotspot/share/runtime/thread.cpp:394
>      > #31 0x00007f76e3fdf9b4 in thread_native_entry
>     (thread=0x55ea4d2e1af0) at
>      > /jdk/src/hotspot/os/linux/os_linux.cpp:720
>      > #32 0x00007f76e162cea5 in start_thread () from /lib64/libpthread.so.0
>      > #33 0x00007f76e1b439fd in clone () from /lib64/libc.so.6
>      >
>      > The corresponding code is here. _stream is 0xbabababababababa?
>      >
>      > int LogFileStreamOutput::write(const LogDecorations& decorations,
>     const
>      > char* msg) {
>      >    const bool use_decorations = !_decorators.is_empty();
>      >
>      >    int written = 0;
>      >    FileLocker flocker(_stream);
>      >    if (use_decorations)
>      >
>      > Here is the relevant disassembly code.
>      > ...prolog
>      > 0x7f76e3d0c223 <LogFileStreamOutput::write(LogDecorations const&,
>     char
>      > const*)+19>      mov    %rdi,%rbx
>      > 0x7f76e3d0c226 <LogFileStreamOutput::write(LogDecorations const&,
>     char
>      > const*)+22>      sub    $0x128,%rsp
>      > 0x7f76e3d0c22d <LogFileStreamOutput::write(LogDecorations const&,
>     char
>      > const*)+29>      mov    0x28(%rdi),%r15
>      > 0x7f76e3d0c231 <LogFileStreamOutput::write(LogDecorations const&,
>     char
>      > const*)+33>      mov    0x20(%rdi),%r12d
>      > 0x7f76e3d0c235 <LogFileStreamOutput::write(LogDecorations const&,
>     char
>      > const*)+37>      mov    %rdx,-0x148(%rbp)
>      > 0x7f76e3d0c23c <LogFileStreamOutput::write(LogDecorations const&,
>     char
>      > const*)+44>      mov    %r15,%rdi
>      > 0x7f76e3d0c23f <LogFileStreamOutput::write(LogDecorations const&,
>     char
>      > const*)+47>      mov    %fs:0x28,%rax
>      > 0x7f76e3d0c248 <LogFileStreamOutput::write(LogDecorations const&,
>     char
>      > const*)+56>      mov    %rax,-0x38(%rbp)
>      > 0x7f76e3d0c24c <LogFileStreamOutput::write(LogDecorations const&,
>     char
>      > const*)+60>      xor    %eax,%eax
>      > 0x7f76e3d0c24e <LogFileStreamOutput::write(LogDecorations const&,
>     char
>      > const*)+62>      callq  0x7f76e3feb090 <os::flockfile(_IO_FILE*)>
>      > 0x7f76e3d0c253 <LogFileStreamOutput::write(LogDecorations const&,
>     char
>      > const*)+67>      test   %r12d,%r12d
>      >
>      >
>      > We can see that r15 = load 0x28(%rdi) and its value is bad.
>     however, in
>      > memory, Its value is correct. this->_stream is 0x55ea4d0f8d00.
>      >
>      > (gdb) p /x $r15
>      > $13 = 0xbabababababababa
>      >
>      > (gdb) p /x $rbx
>      > $27 = 0x55ea4d504bd0
>      >
>      > (gdb) x /1g $rbx + 0x28
>      > 0x55ea4d504bf8: 0x000055ea4d0f8d00
>      >
>      > In main thread, we can see that output->_stream is well defined.
>      >
>      > (gdb) p output
>      > $24 = (LogOutput *) 0x55ea4d504bd0
>      >
>      > (gdb) p ((LogFileOutput*)output)->_stream
>      > $23 = (FILE *) 0x55ea4d0f8d00
>      >
>      >
>      >
>     -------------------------------------------------------------------------
>      >
>      > Can I say it's a memory ordering issue on x86_64? TBH, I can't
>     explain
>      > how that happens.
>      >
>      > Early moment in main thread, LogFileOutput::initialize() was
>     called and
>      > _stream was initialized. After then, the program at least has
>     gone through:
>      >
>      > 1. Atomic::add (lock prefix instruction, which is implied store-load
>      > barrier?).
>      > 2. sem_wait/post (acquire-release semantics).
>      >
>      > Therefore, I believe the current thread must see the written value of
>      > LogFileOutput::initialize(). ie. _stream should be (FILE *)
>      > 0x55ea4d0f8d00, not 0xbababa...
>      >
>      > Last but not least question. To further understand this problem and
>      > verify patch, I think the key is to find a reproducible. Any
>     thoughts on
>      > how to reproduce this kinda of problem?  So far, I only came
>     across this
>      > crash once on my side. I used a docker image of Oracle Linux 7.9
>     (glibc
>      > 2.17) with cpuset 0~2.  Beside a core dump and a crash report, I
>     can't
>      > reproduce it anymore. I think I must overlook some important factors
>      > here.  Really appreciate some head-ups.
>      >
>      > I notice that Oracle is using 3-gen Zen architecture to run
>     tests. Is it
>      > anything different from conventional x86_64? In AMD's manual[1],
>     7.4.2,
>      > Table-3, here is the description about lock-prefix instruction.
>      >
>      > k  — All loads and stores associated with the I/O and locked
>      > instructions complete to memory (no bufferedstores) before a load or
>      > store from a subsequent instruction is issued.
>      >
>      > It seems to me that it doesn't imply "bidirectional full memory
>     barrier"
>      > or storeload barrier. it just has acquire semantic. I read the story
>      > behind this change and Aleksey's article[2]. I'm just not sure the
>      > following code is equivalent to "mfence" on EPYC machines because
>     AMD's
>      > manual is different from Intel's.
>      >
>      > inline void OrderAccess::fence() {
>      >     // always use locked addl since mfence is sometimes expensive
>      > #ifdef AMD64
>      >    __asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");
>      > #else
>      >    __asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", "memory");
>      > #endif
>      >    compiler_barrier();
>      > }
>      >
>      >
>      > [1] https://www.amd.com/system/files/TechDocs/24593.pdf
>     <https://www.amd.com/system/files/TechDocs/24593.pdf>
>      > [2]
>      >
>     https://shipilev.net/blog/2014/on-the-fence-with-dependencies/#_footnote_1
>     <https://shipilev.net/blog/2014/on-the-fence-with-dependencies/#_footnote_1>
>      >
>      > thanks,
>      > --lx
>      >
>      >
>      >
>