A question about memory order on x86_64

Mon Jun 21 01:53:36 UTC 2021

Hi Xin,

I'm looking at your code again and apologies if we covered this 
previously but I'm not seeing how things are supposed to work exactly. 
Looking at the write() function :

void AsyncLogWriter::write() {
   AsyncLogBuffer logs;
   bool own_io = false;

   { // critical region
     AsyncLogLocker locker;

     _buffer.pop_all(&logs);
     // append meta-messages of dropped counters
     AsyncLogMapIterator dropped_counters_iter(logs);
     _stats.iterate(&dropped_counters_iter);
     own_io = _io_sem.trywait();
   }

   LinkedListIterator<AsyncLogMessage> it(logs.head());
   if (!own_io) {
     _io_sem.wait();
   }

   while (!it.is_empty()) {
     AsyncLogMessage* e = it.next();
     char* msg = e->message();

     if (msg != nullptr) {
       e->output()->write_blocking(e->decorations(), msg);

Supposed that a flush() operation has called write() and it has reached 
the point where it is about to release the io_sem at the end and gets 
descheduled. Other threads then perform logging operations that include 
a specific output N. The async log thread executes the code above, and 
stashes those log messages off in the local "logs" and then blocks on 
io_sem_wait. Let the original flush release the io_sem, and a second 
flush() commence due to the log configuration update that will delete 
output N. That flush also completes and releases the io_sem. The async 
log thread now acquires io_sem and attempts to write to the output that 
was just deleted.

David
-----

On 21/06/2021 6:50 am, Liu, Xin wrote:
> Hi, Hotspot experts,
> 
> I am debugging a PR which uses lock-free programming. David Holmes
> helped me submit this patch to the internal regression infra(thanks a
> lot). We got intermittent failures at tier4~7. There are two distinct
> crash sights, but they're similar. I guess the reason is asynclog thread
> observed different memory state from other threads.
> 
> https://github.com/openjdk/jdk/pull/4408#issuecomment-863788283
> 
> I have spent a lot efforts on reproducing this problem. Eventually, I
> got a crash in docker and luckily coredump file and hs_err report were
> all saved.
> 
> In core dump file, I see this stacktraces in asynclog thread.
> 
> (gdb) bt
> ...
> #24 0x00007f76e1ab0e90 in flockfile () from /lib64/libc.so.6
> #25 0x00007f76e3d0c253 in FileLocker (file=0xbabababababababa,
> this=<synthetic pointer>) at
> /jdk/src/hotspot/share/logging/logFileStreamOutput.cpp:125
> #26 LogFileStreamOutput::write (this=0x55ea4d504bd0, decorations=...,
> msg=0x7f7624079b00 "ConcurrentLogsite 3 emits a log") at
> /jdk/src/hotspot/share/logging/logFileStreamOutput.cpp:125
> #27 0x00007f76e3d0b572 in LogFileOutput::write_blocking
> (this=0x55ea4d504bd0, decorations=..., msg=0x7f7624079b00
> "ConcurrentLogsite 3 emits a log") at
> /jdk/src/hotspot/share/logging/logFileOutput.cpp:308
> #28 0x00007f76e3d05bbd in write (this=0x55ea4d2e1af0) at
> /jdk/src/hotspot/share/logging/logAsyncWriter.hpp:106
> #29 AsyncLogWriter::run (this=0x55ea4d2e1af0) at
> /jdk/src/hotspot/share/logging/logAsyncWriter.cpp:163
> #30 0x00007f76e44696a0 in Thread::call_run (this=0x55ea4d2e1af0) at
> /jdk/src/hotspot/share/runtime/thread.cpp:394
> #31 0x00007f76e3fdf9b4 in thread_native_entry (thread=0x55ea4d2e1af0) at
> /jdk/src/hotspot/os/linux/os_linux.cpp:720
> #32 0x00007f76e162cea5 in start_thread () from /lib64/libpthread.so.0
> #33 0x00007f76e1b439fd in clone () from /lib64/libc.so.6
> 
> The corresponding code is here. _stream is 0xbabababababababa?
> 
> int LogFileStreamOutput::write(const LogDecorations& decorations, const
> char* msg) {
>    const bool use_decorations = !_decorators.is_empty();
> 
>    int written = 0;
>    FileLocker flocker(_stream);
>    if (use_decorations)
> 
> Here is the relevant disassembly code.
> ...prolog
> 0x7f76e3d0c223 <LogFileStreamOutput::write(LogDecorations const&, char
> const*)+19>      mov    %rdi,%rbx
> 0x7f76e3d0c226 <LogFileStreamOutput::write(LogDecorations const&, char
> const*)+22>      sub    $0x128,%rsp
> 0x7f76e3d0c22d <LogFileStreamOutput::write(LogDecorations const&, char
> const*)+29>      mov    0x28(%rdi),%r15
> 0x7f76e3d0c231 <LogFileStreamOutput::write(LogDecorations const&, char
> const*)+33>      mov    0x20(%rdi),%r12d
> 0x7f76e3d0c235 <LogFileStreamOutput::write(LogDecorations const&, char
> const*)+37>      mov    %rdx,-0x148(%rbp)
> 0x7f76e3d0c23c <LogFileStreamOutput::write(LogDecorations const&, char
> const*)+44>      mov    %r15,%rdi
> 0x7f76e3d0c23f <LogFileStreamOutput::write(LogDecorations const&, char
> const*)+47>      mov    %fs:0x28,%rax
> 0x7f76e3d0c248 <LogFileStreamOutput::write(LogDecorations const&, char
> const*)+56>      mov    %rax,-0x38(%rbp)
> 0x7f76e3d0c24c <LogFileStreamOutput::write(LogDecorations const&, char
> const*)+60>      xor    %eax,%eax
> 0x7f76e3d0c24e <LogFileStreamOutput::write(LogDecorations const&, char
> const*)+62>      callq  0x7f76e3feb090 <os::flockfile(_IO_FILE*)>
> 0x7f76e3d0c253 <LogFileStreamOutput::write(LogDecorations const&, char
> const*)+67>      test   %r12d,%r12d
> 
> 
> We can see that r15 = load 0x28(%rdi) and its value is bad. however, in
> memory, Its value is correct. this->_stream is 0x55ea4d0f8d00.
> 
> (gdb) p /x $r15
> $13 = 0xbabababababababa
> 
> (gdb) p /x $rbx
> $27 = 0x55ea4d504bd0
> 
> (gdb) x /1g $rbx + 0x28
> 0x55ea4d504bf8: 0x000055ea4d0f8d00
> 
> In main thread, we can see that output->_stream is well defined.
> 
> (gdb) p output
> $24 = (LogOutput *) 0x55ea4d504bd0
> 
> (gdb) p ((LogFileOutput*)output)->_stream
> $23 = (FILE *) 0x55ea4d0f8d00
> 
> 
> -------------------------------------------------------------------------
> 
> Can I say it's a memory ordering issue on x86_64? TBH, I can't explain
> how that happens.
> 
> Early moment in main thread, LogFileOutput::initialize() was called and
> _stream was initialized. After then, the program at least has gone through:
> 
> 1. Atomic::add (lock prefix instruction, which is implied store-load
> barrier?).
> 2. sem_wait/post (acquire-release semantics).
> 
> Therefore, I believe the current thread must see the written value of
> LogFileOutput::initialize(). ie. _stream should be (FILE *)
> 0x55ea4d0f8d00, not 0xbababa...
> 
> Last but not least question. To further understand this problem and
> verify patch, I think the key is to find a reproducible. Any thoughts on
> how to reproduce this kinda of problem?  So far, I only came across this
> crash once on my side. I used a docker image of Oracle Linux 7.9 (glibc
> 2.17) with cpuset 0~2.  Beside a core dump and a crash report, I can't
> reproduce it anymore. I think I must overlook some important factors
> here.  Really appreciate some head-ups.
> 
> I notice that Oracle is using 3-gen Zen architecture to run tests. Is it
> anything different from conventional x86_64? In AMD's manual[1], 7.4.2,
> Table-3, here is the description about lock-prefix instruction.
> 
> k  — All loads and stores associated with the I/O and locked
> instructions complete to memory (no bufferedstores) before a load or
> store from a subsequent instruction is issued.
> 
> It seems to me that it doesn't imply "bidirectional full memory barrier"
> or storeload barrier. it just has acquire semantic. I read the story
> behind this change and Aleksey's article[2]. I'm just not sure the
> following code is equivalent to "mfence" on EPYC machines because AMD's
> manual is different from Intel's.
> 
> inline void OrderAccess::fence() {
>     // always use locked addl since mfence is sometimes expensive
> #ifdef AMD64
>    __asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");
> #else
>    __asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", "memory");
> #endif
>    compiler_barrier();
> }
> 
> 
> [1] https://www.amd.com/system/files/TechDocs/24593.pdf
> [2]
> https://shipilev.net/blog/2014/on-the-fence-with-dependencies/#_footnote_1
> 
> thanks,
> --lx
> 
> 
>