A question about memory order on x86_64

Sun Jun 20 20:50:03 UTC 2021

Hi, Hotspot experts,

I am debugging a PR which uses lock-free programming. David Holmes
helped me submit this patch to the internal regression infra(thanks a
lot). We got intermittent failures at tier4~7. There are two distinct
crash sights, but they're similar. I guess the reason is asynclog thread
observed different memory state from other threads.

https://github.com/openjdk/jdk/pull/4408#issuecomment-863788283

I have spent a lot efforts on reproducing this problem. Eventually, I
got a crash in docker and luckily coredump file and hs_err report were
all saved.

In core dump file, I see this stacktraces in asynclog thread.

(gdb) bt
...
#24 0x00007f76e1ab0e90 in flockfile () from /lib64/libc.so.6
#25 0x00007f76e3d0c253 in FileLocker (file=0xbabababababababa,
this=<synthetic pointer>) at
/jdk/src/hotspot/share/logging/logFileStreamOutput.cpp:125
#26 LogFileStreamOutput::write (this=0x55ea4d504bd0, decorations=...,
msg=0x7f7624079b00 "ConcurrentLogsite 3 emits a log") at
/jdk/src/hotspot/share/logging/logFileStreamOutput.cpp:125
#27 0x00007f76e3d0b572 in LogFileOutput::write_blocking
(this=0x55ea4d504bd0, decorations=..., msg=0x7f7624079b00
"ConcurrentLogsite 3 emits a log") at
/jdk/src/hotspot/share/logging/logFileOutput.cpp:308
#28 0x00007f76e3d05bbd in write (this=0x55ea4d2e1af0) at
/jdk/src/hotspot/share/logging/logAsyncWriter.hpp:106
#29 AsyncLogWriter::run (this=0x55ea4d2e1af0) at
/jdk/src/hotspot/share/logging/logAsyncWriter.cpp:163
#30 0x00007f76e44696a0 in Thread::call_run (this=0x55ea4d2e1af0) at
/jdk/src/hotspot/share/runtime/thread.cpp:394
#31 0x00007f76e3fdf9b4 in thread_native_entry (thread=0x55ea4d2e1af0) at
/jdk/src/hotspot/os/linux/os_linux.cpp:720
#32 0x00007f76e162cea5 in start_thread () from /lib64/libpthread.so.0
#33 0x00007f76e1b439fd in clone () from /lib64/libc.so.6

The corresponding code is here. _stream is 0xbabababababababa?

int LogFileStreamOutput::write(const LogDecorations& decorations, const
char* msg) {
  const bool use_decorations = !_decorators.is_empty();

  int written = 0;
  FileLocker flocker(_stream);
  if (use_decorations)

Here is the relevant disassembly code.
...prolog
0x7f76e3d0c223 <LogFileStreamOutput::write(LogDecorations const&, char
const*)+19>      mov    %rdi,%rbx
0x7f76e3d0c226 <LogFileStreamOutput::write(LogDecorations const&, char
const*)+22>      sub    $0x128,%rsp
0x7f76e3d0c22d <LogFileStreamOutput::write(LogDecorations const&, char
const*)+29>      mov    0x28(%rdi),%r15
0x7f76e3d0c231 <LogFileStreamOutput::write(LogDecorations const&, char
const*)+33>      mov    0x20(%rdi),%r12d
0x7f76e3d0c235 <LogFileStreamOutput::write(LogDecorations const&, char
const*)+37>      mov    %rdx,-0x148(%rbp)
0x7f76e3d0c23c <LogFileStreamOutput::write(LogDecorations const&, char
const*)+44>      mov    %r15,%rdi
0x7f76e3d0c23f <LogFileStreamOutput::write(LogDecorations const&, char
const*)+47>      mov    %fs:0x28,%rax
0x7f76e3d0c248 <LogFileStreamOutput::write(LogDecorations const&, char
const*)+56>      mov    %rax,-0x38(%rbp)
0x7f76e3d0c24c <LogFileStreamOutput::write(LogDecorations const&, char
const*)+60>      xor    %eax,%eax
0x7f76e3d0c24e <LogFileStreamOutput::write(LogDecorations const&, char
const*)+62>      callq  0x7f76e3feb090 <os::flockfile(_IO_FILE*)>
0x7f76e3d0c253 <LogFileStreamOutput::write(LogDecorations const&, char
const*)+67>      test   %r12d,%r12d

We can see that r15 = load 0x28(%rdi) and its value is bad. however, in
memory, Its value is correct. this->_stream is 0x55ea4d0f8d00.

(gdb) p /x $r15
$13 = 0xbabababababababa

(gdb) p /x $rbx
$27 = 0x55ea4d504bd0

(gdb) x /1g $rbx + 0x28
0x55ea4d504bf8: 0x000055ea4d0f8d00

In main thread, we can see that output->_stream is well defined.

(gdb) p output
$24 = (LogOutput *) 0x55ea4d504bd0

(gdb) p ((LogFileOutput*)output)->_stream
$23 = (FILE *) 0x55ea4d0f8d00

-------------------------------------------------------------------------

Can I say it's a memory ordering issue on x86_64? TBH, I can't explain
how that happens.

Early moment in main thread, LogFileOutput::initialize() was called and
_stream was initialized. After then, the program at least has gone through:

1. Atomic::add (lock prefix instruction, which is implied store-load
barrier?).
2. sem_wait/post (acquire-release semantics).

Therefore, I believe the current thread must see the written value of
LogFileOutput::initialize(). ie. _stream should be (FILE *)
0x55ea4d0f8d00, not 0xbababa...

Last but not least question. To further understand this problem and
verify patch, I think the key is to find a reproducible. Any thoughts on
how to reproduce this kinda of problem?  So far, I only came across this
crash once on my side. I used a docker image of Oracle Linux 7.9 (glibc
2.17) with cpuset 0~2.  Beside a core dump and a crash report, I can't
reproduce it anymore. I think I must overlook some important factors
here.  Really appreciate some head-ups.

I notice that Oracle is using 3-gen Zen architecture to run tests. Is it
anything different from conventional x86_64? In AMD's manual[1], 7.4.2,
Table-3, here is the description about lock-prefix instruction.

k  — All loads and stores associated with the I/O and locked
instructions complete to memory (no bufferedstores) before a load or
store from a subsequent instruction is issued.

It seems to me that it doesn't imply "bidirectional full memory barrier"
or storeload barrier. it just has acquire semantic. I read the story
behind this change and Aleksey's article[2]. I'm just not sure the
following code is equivalent to "mfence" on EPYC machines because AMD's
manual is different from Intel's.

inline void OrderAccess::fence() {
   // always use locked addl since mfence is sometimes expensive
#ifdef AMD64
  __asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");
#else
  __asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", "memory");
#endif
  compiler_barrier();
}

[1] https://www.amd.com/system/files/TechDocs/24593.pdf
[2]
https://shipilev.net/blog/2014/on-the-fence-with-dependencies/#_footnote_1

thanks,
--lx