A question about memory order on x86_64

Mon Jun 21 23:28:20 UTC 2021

On 22/06/2021 8:02 am, David Holmes wrote:
> On 22/06/2021 6:49 am, Liu Xin wrote:
>> hi, David,
>>
>> Thanks. This is my blindspot. Let me go through your description. 
>> There are 3 threads. M is the main thread which is doing 
>> logConfiguration changes.
>> A is the async log thread, which is flushing. L is a user thread which 
>> is emitting logs.
>>
>> M: flush1/sem_wait ---- ---- ---- ---- ---- ---- ---- flush1/sem_post 
>> ---- flush2/sem_wait ---- flush2/sem.post <http://sem.post> ---- 
>> delete_output() ----
>> L: ---- ---- enqueue(log1) ---- ---- ---- ---- ---- ---- ---- ---- 
>> ---- ----
>> A: ---- ---- ---- ---- ---- dequeue(log1) ---- run/sem_wait ---- ---- 
>> ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 
>> write_blocking(log1)
>>
>> This is the timeline you described, isn't it?
> 
> Probably close enough.
> 
>> A key missing information is that sem_wait() (POSIX semaphore) doesn't 
>> respect FIFO order.
>> I thought Thread-A has invoked run/sem_wait() first. If 
>> flush1/sem_post() is called, kernel must wake up thread A immediately. 
>> It may NOT  the case, right?
> 
> Right there are no guarantees that when M calls signal/post to wakeup A, 
> that A will actually get to run and consume the token and so return from 
> sem_wait. If the implementation is based on futex then it is likely 
> similar to mutex operation and the newly woken thread will contend with 
> all other threads to try and claim the semaphore. In such cases if M 
> calls sem_wait very shortly after sem_post then it will likely reclaim 
> the semaphore.
> 
>> It's possible that thread-A is descheduled and thread-M can keep 
>> going. Under the hook, I see that glibc implements sem_wait using 
>> syscall futex.
>>
>> In my test, I stupidly put two set_log_config() in a loop. Therefore, 
>> M invokes two flush() in a row!
>> https://github.com/openjdk/jdk/pull/4408/files#diff-25062d200033077d44dacb5aae6b8141316b4bc10149c061cf047fe7db250d98R306 
>> <https://github.com/openjdk/jdk/pull/4408/files#diff-25062d200033077d44dacb5aae6b8141316b4bc10149c061cf047fe7db250d98R306> 
>>
>>
>> Let me try to simulate this and see if I have a fix.
> 
> I'm beginning to think that the only safe thing to do here is to 
> lock-out configuration changes while any write is in progress.

But we can't do that as we could block indefinitely while writing.

Backing up a step lets ignore async logging for a moment. How does 
synchronous logging deal with this problem? What ensures that a log 
request can't use an output that is being concurrently deleted by a 
configuration change?

David

> David
> -----
> 
>> Thanks,
>> -lx
>>
>>
>>
>>
>> On Sun, Jun 20, 2021 at 6:54 PM David Holmes <david.holmes at oracle.com 
>> <mailto:david.holmes at oracle.com>> wrote:
>>
>>     Hi Xin,
>>
>>     I'm looking at your code again and apologies if we covered this
>>     previously but I'm not seeing how things are supposed to work 
>> exactly.
>>     Looking at the write() function :
>>
>>     void AsyncLogWriter::write() {
>>         AsyncLogBuffer logs;
>>         bool own_io = false;
>>
>>         { // critical region
>>           AsyncLogLocker locker;
>>
>>           _buffer.pop_all(&logs);
>>           // append meta-messages of dropped counters
>>           AsyncLogMapIterator dropped_counters_iter(logs);
>>           _stats.iterate(&dropped_counters_iter);
>>           own_io = _io_sem.trywait();
>>         }
>>
>>         LinkedListIterator<AsyncLogMessage> it(logs.head());
>>         if (!own_io) {
>>           _io_sem.wait();
>>         }
>>
>>         while (!it.is_empty()) {
>>           AsyncLogMessage* e = it.next();
>>           char* msg = e->message();
>>
>>           if (msg != nullptr) {
>>             e->output()->write_blocking(e->decorations(), msg);
>>
>>     Supposed that a flush() operation has called write() and it has 
>> reached
>>     the point where it is about to release the io_sem at the end and gets
>>     descheduled. Other threads then perform logging operations that 
>> include
>>     a specific output N. The async log thread executes the code above, 
>> and
>>     stashes those log messages off in the local "logs" and then blocks on
>>     io_sem_wait. Let the original flush release the io_sem, and a second
>>     flush() commence due to the log configuration update that will delete
>>     output N. That flush also completes and releases the io_sem. The 
>> async
>>     log thread now acquires io_sem and attempts to write to the output 
>> that
>>     was just deleted.
>>
>>     David
>>     -----
>>
>>     On 21/06/2021 6:50 am, Liu, Xin wrote:
>>      > Hi, Hotspot experts,
>>      >
>>      > I am debugging a PR which uses lock-free programming. David Holmes
>>      > helped me submit this patch to the internal regression 
>> infra(thanks a
>>      > lot). We got intermittent failures at tier4~7. There are two 
>> distinct
>>      > crash sights, but they're similar. I guess the reason is asynclog
>>     thread
>>      > observed different memory state from other threads.
>>      >
>>      > https://github.com/openjdk/jdk/pull/4408#issuecomment-863788283
>>     <https://github.com/openjdk/jdk/pull/4408#issuecomment-863788283>
>>      >
>>      > I have spent a lot efforts on reproducing this problem. 
>> Eventually, I
>>      > got a crash in docker and luckily coredump file and hs_err report
>>     were
>>      > all saved.
>>      >
>>      > In core dump file, I see this stacktraces in asynclog thread.
>>      >
>>      > (gdb) bt
>>      > ...
>>      > #24 0x00007f76e1ab0e90 in flockfile () from /lib64/libc.so.6
>>      > #25 0x00007f76e3d0c253 in FileLocker (file=0xbabababababababa,
>>      > this=<synthetic pointer>) at
>>      > /jdk/src/hotspot/share/logging/logFileStreamOutput.cpp:125
>>      > #26 LogFileStreamOutput::write (this=0x55ea4d504bd0, 
>> decorations=...,
>>      > msg=0x7f7624079b00 "ConcurrentLogsite 3 emits a log") at
>>      > /jdk/src/hotspot/share/logging/logFileStreamOutput.cpp:125
>>      > #27 0x00007f76e3d0b572 in LogFileOutput::write_blocking
>>      > (this=0x55ea4d504bd0, decorations=..., msg=0x7f7624079b00
>>      > "ConcurrentLogsite 3 emits a log") at
>>      > /jdk/src/hotspot/share/logging/logFileOutput.cpp:308
>>      > #28 0x00007f76e3d05bbd in write (this=0x55ea4d2e1af0) at
>>      > /jdk/src/hotspot/share/logging/logAsyncWriter.hpp:106
>>      > #29 AsyncLogWriter::run (this=0x55ea4d2e1af0) at
>>      > /jdk/src/hotspot/share/logging/logAsyncWriter.cpp:163
>>      > #30 0x00007f76e44696a0 in Thread::call_run 
>> (this=0x55ea4d2e1af0) at
>>      > /jdk/src/hotspot/share/runtime/thread.cpp:394
>>      > #31 0x00007f76e3fdf9b4 in thread_native_entry
>>     (thread=0x55ea4d2e1af0) at
>>      > /jdk/src/hotspot/os/linux/os_linux.cpp:720
>>      > #32 0x00007f76e162cea5 in start_thread () from 
>> /lib64/libpthread.so.0
>>      > #33 0x00007f76e1b439fd in clone () from /lib64/libc.so.6
>>      >
>>      > The corresponding code is here. _stream is 0xbabababababababa?
>>      >
>>      > int LogFileStreamOutput::write(const LogDecorations& decorations,
>>     const
>>      > char* msg) {
>>      >    const bool use_decorations = !_decorators.is_empty();
>>      >
>>      >    int written = 0;
>>      >    FileLocker flocker(_stream);
>>      >    if (use_decorations)
>>      >
>>      > Here is the relevant disassembly code.
>>      > ...prolog
>>      > 0x7f76e3d0c223 <LogFileStreamOutput::write(LogDecorations const&,
>>     char
>>      > const*)+19>      mov    %rdi,%rbx
>>      > 0x7f76e3d0c226 <LogFileStreamOutput::write(LogDecorations const&,
>>     char
>>      > const*)+22>      sub    $0x128,%rsp
>>      > 0x7f76e3d0c22d <LogFileStreamOutput::write(LogDecorations const&,
>>     char
>>      > const*)+29>      mov    0x28(%rdi),%r15
>>      > 0x7f76e3d0c231 <LogFileStreamOutput::write(LogDecorations const&,
>>     char
>>      > const*)+33>      mov    0x20(%rdi),%r12d
>>      > 0x7f76e3d0c235 <LogFileStreamOutput::write(LogDecorations const&,
>>     char
>>      > const*)+37>      mov    %rdx,-0x148(%rbp)
>>      > 0x7f76e3d0c23c <LogFileStreamOutput::write(LogDecorations const&,
>>     char
>>      > const*)+44>      mov    %r15,%rdi
>>      > 0x7f76e3d0c23f <LogFileStreamOutput::write(LogDecorations const&,
>>     char
>>      > const*)+47>      mov    %fs:0x28,%rax
>>      > 0x7f76e3d0c248 <LogFileStreamOutput::write(LogDecorations const&,
>>     char
>>      > const*)+56>      mov    %rax,-0x38(%rbp)
>>      > 0x7f76e3d0c24c <LogFileStreamOutput::write(LogDecorations const&,
>>     char
>>      > const*)+60>      xor    %eax,%eax
>>      > 0x7f76e3d0c24e <LogFileStreamOutput::write(LogDecorations const&,
>>     char
>>      > const*)+62>      callq  0x7f76e3feb090 <os::flockfile(_IO_FILE*)>
>>      > 0x7f76e3d0c253 <LogFileStreamOutput::write(LogDecorations const&,
>>     char
>>      > const*)+67>      test   %r12d,%r12d
>>      >
>>      >
>>      > We can see that r15 = load 0x28(%rdi) and its value is bad.
>>     however, in
>>      > memory, Its value is correct. this->_stream is 0x55ea4d0f8d00.
>>      >
>>      > (gdb) p /x $r15
>>      > $13 = 0xbabababababababa
>>      >
>>      > (gdb) p /x $rbx
>>      > $27 = 0x55ea4d504bd0
>>      >
>>      > (gdb) x /1g $rbx + 0x28
>>      > 0x55ea4d504bf8: 0x000055ea4d0f8d00
>>      >
>>      > In main thread, we can see that output->_stream is well defined.
>>      >
>>      > (gdb) p output
>>      > $24 = (LogOutput *) 0x55ea4d504bd0
>>      >
>>      > (gdb) p ((LogFileOutput*)output)->_stream
>>      > $23 = (FILE *) 0x55ea4d0f8d00
>>      >
>>      >
>>      >
>>     
>> -------------------------------------------------------------------------
>>      >
>>      > Can I say it's a memory ordering issue on x86_64? TBH, I can't
>>     explain
>>      > how that happens.
>>      >
>>      > Early moment in main thread, LogFileOutput::initialize() was
>>     called and
>>      > _stream was initialized. After then, the program at least has
>>     gone through:
>>      >
>>      > 1. Atomic::add (lock prefix instruction, which is implied 
>> store-load
>>      > barrier?).
>>      > 2. sem_wait/post (acquire-release semantics).
>>      >
>>      > Therefore, I believe the current thread must see the written 
>> value of
>>      > LogFileOutput::initialize(). ie. _stream should be (FILE *)
>>      > 0x55ea4d0f8d00, not 0xbababa...
>>      >
>>      > Last but not least question. To further understand this problem 
>> and
>>      > verify patch, I think the key is to find a reproducible. Any
>>     thoughts on
>>      > how to reproduce this kinda of problem?  So far, I only came
>>     across this
>>      > crash once on my side. I used a docker image of Oracle Linux 7.9
>>     (glibc
>>      > 2.17) with cpuset 0~2.  Beside a core dump and a crash report, I
>>     can't
>>      > reproduce it anymore. I think I must overlook some important 
>> factors
>>      > here.  Really appreciate some head-ups.
>>      >
>>      > I notice that Oracle is using 3-gen Zen architecture to run
>>     tests. Is it
>>      > anything different from conventional x86_64? In AMD's manual[1],
>>     7.4.2,
>>      > Table-3, here is the description about lock-prefix instruction.
>>      >
>>      > k  — All loads and stores associated with the I/O and locked
>>      > instructions complete to memory (no bufferedstores) before a 
>> load or
>>      > store from a subsequent instruction is issued.
>>      >
>>      > It seems to me that it doesn't imply "bidirectional full memory
>>     barrier"
>>      > or storeload barrier. it just has acquire semantic. I read the 
>> story
>>      > behind this change and Aleksey's article[2]. I'm just not sure the
>>      > following code is equivalent to "mfence" on EPYC machines because
>>     AMD's
>>      > manual is different from Intel's.
>>      >
>>      > inline void OrderAccess::fence() {
>>      >     // always use locked addl since mfence is sometimes expensive
>>      > #ifdef AMD64
>>      >    __asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", 
>> "memory");
>>      > #else
>>      >    __asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", 
>> "memory");
>>      > #endif
>>      >    compiler_barrier();
>>      > }
>>      >
>>      >
>>      > [1] https://www.amd.com/system/files/TechDocs/24593.pdf
>>     <https://www.amd.com/system/files/TechDocs/24593.pdf>
>>      > [2]
>>      >
>>     
>> https://shipilev.net/blog/2014/on-the-fence-with-dependencies/#_footnote_1 
>>
>>     
>> <https://shipilev.net/blog/2014/on-the-fence-with-dependencies/#_footnote_1> 
>>
>>      >
>>      > thanks,
>>      > --lx
>>      >
>>      >
>>      >
>>