<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div dir="ltr"></div><div dir="ltr">One more thing, I am having a hard time believing your trace overhead is 4ns. A memory access in the L2 cache is 10 ns, so it seems your metrics are not possible except for possible trivial traces/micro benchmarks. </div><div dir="ltr"><br></div><div dir="ltr">Add in the capability of others threads to read the traces means the cache is being flushed - doesn’t seem possible. </div><div dir="ltr"><br><blockquote type="cite">On Feb 20, 2023, at 7:57 PM, Robert Engels <rengels@ix.netcom.com> wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"><meta http-equiv="content-type" content="text/html; charset=utf-8"><div dir="ltr"></div><div dir="ltr">Also, check out <a href="https://rakyll.org/profiler-labels/">https://rakyll.org/profiler-labels/</a></div><div dir="ltr"><br><blockquote type="cite">On Feb 20, 2023, at 7:49 PM, Robert Engels <rengels@ix.netcom.com> wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"><meta http-equiv="content-type" content="text/html; charset=utf-8"><div dir="ltr"></div><div dir="ltr">Sorry github.com/robaho/goanalyzer</div><div dir="ltr"><br><blockquote type="cite">On Feb 20, 2023, at 7:44 PM, Robert Engels <rengels@ix.netcom.com> wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"><meta http-equiv="content-type" content="text/html; charset=utf-8"><div dir="ltr"></div><div dir="ltr">I would argue that the trace as shown isn’t very useful. For instance what makes thread group 2-1 any different than 2-2? (This is the bad side effect of async and arbitrary fork/join thread pools in Java). </div><div dir="ltr"><br></div><div dir="ltr">I would take a look at Go - which has had to deal with problem for quite a while. You can look at github.com/robaho/go-analyzer for some additional information. </div><div dir="ltr"><br></div><div dir="ltr">You want to trace events - specifically exceptional events. And be able to trace them with enough information to diagnose performance issues. You can’t manually review 1M traces in a UX, so you need to present the data in a way that surfaces potential issues.</div><div dir="ltr"><br></div><div dir="ltr">The easiest way to do this is via histograms to discard data from the trace buffer when you realize it is not exceptional. </div><div dir="ltr"><br><blockquote type="cite">On Feb 20, 2023, at 7:09 PM, Carl M <java@rkive.org> wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr">

  <meta charset="UTF-8"> 

  <div>

   For context, my tracer produces a trace that can be loaded into the Chrome Dev Tools viewer.  It looks like:  <a href="https://github.com/perfmark/perfmark/blob/v0.26.x/doc/screenshot.png">https://github.com/perfmark/perfmark/blob/v0.26.x/doc/screenshot.png</a>

  </div> 

  <div class="default-style">

  </div> 

  <div class="default-style">

  </div> 

  <div class="default-style">

   I'm thinking now that even if the performance issues I encountered are fixed, the UX for viewing such a trace would be need to be fixed too.

  </div> 

  <div class="default-style">

  </div> 

  <blockquote type="cite"> 

   <div>

    On 02/20/2023 4:46 PM PST robert engels <<a href="mailto:rengels@ix.netcom.com">rengels@ix.netcom.com</a>> wrote:

   </div> 

   <div>

   </div> 

   <div>

   </div> 

   <div>

    I don’t think a pool of buffers matters.

   </div> 

   <div>

   </div> 

   <div>

    It depends on the number of events - or at least it should. If 3 threads write 1M events, or 1M threads write 3 events - it is the same data size - and by using a ThreadLocal you avoid the concurrency overhead.

   </div> 

   <div>

   </div> 

   <div>

    If the trace is a histogram like structure, then it could matter because a histogram can be shared by many threads (with CAS updating) - then it should be a global structure not a ThreadLocal. If you want to reduce contention you can use N histograms and index with mod vthread hash code.

   </div> 

   <div>

   </div> 

   <blockquote type="cite"> 

    <div>

     On Feb 20, 2023, at 6:33 PM, Ron Pressler <<a href="mailto:ron.pressler@oracle.com">ron.pressler@oracle.com</a>> wrote:

    </div> 

    <div>

    </div> 

    <div>

     Hi.

    </div> 

    <div>

    </div> 

    <div>

     The method to disable thread locals has been a source of confusion, and we’re likely to remove it. It was never intended as some mode libraries must support, but to enforce some very special situations — never mind, it’s been consistently misunderstood as an ordinary mode that needs to be supported, and so it is likely going away.

    </div> 

    <div>

    </div> 

    <div>

     However, given that if virtual threads are present at all you can assume there’s a very large number of them (as that’s why they’re used) — tens of thousands *at least* — you should ask yourself whether an individual buffer for each thread is really what you want. A small pool of buffers, similar in number to the number of cores — ~1000x smaller than the number of threads — might be a better way to go. You can start with a ConcurrentLinkedQueue to store the buffers, and have threads take and return buffers to that queue. If contention is a noticeable problem, you can do something more sophisticated with an array that is randomly accessed in some way and entries are CASed in and out.

    </div> 

    <div>

    </div> 

    <div>

     — Ron

    </div> 

    <div>

    </div> 

    <div>

    </div> 

    <blockquote type="cite"> 

     <div>

      On 20 Feb 2023, at 23:15, Carl M <<a href="mailto:java@rkive.org">java@rkive.org</a>> wrote:

     </div> 

     <div>

     </div> 

     <div>

      While testing out Virtual Threads with project Loom, I encountered some challenges that I was hoping this mailing list could provide guidance on.

     </div> 

     <div>

     </div> 

     <div>

      I have a tracing library that uses ThreadLocals for recording events and timing info. The concurrency is structured so that each thread is the sole writer to it's own trace buffer, but separate threads can come in and read that data asynchronously. I am using ThreadLocals to avoid contention between multiple tracing threads. Secondarily, I depend on threads exiting for automatic clean up of the trace data per thread.

     </div> 

     <div>

     </div> 

     <div>

      Virtual threads present a hard to overcome challenge, because I can't find a way to tell if ThreadLocals are supported. One of the value propsitions of my library is that it has a consistent and low overhead. Specifically, calling ThreadLocal.set() throws an UnsupportedOperationException in the event that they are not allowed. In the case of using Virtual threads, the likelihood of this happening is much higher, since users are now able to create threads cheaply. I have explored several work-arounds, but not being able to tell is one I can't seem to cleanly overcome. Some ideas that did not pan out:

     </div> 

     <div>

     </div> 

     <div>

      * Use a ConcurrentHashMap to implement my own "threadlocal" like solution. Two problems come up: 1. It's easy to accidentally keep the thread alive, and 2. When Thread Locals are supported, my library doesn't get the speedup from them.

     </div> 

     <div>

     </div> 

     <div>

      * Use an AtomicReferenceArray and hash into a fixed size of buckets. This avoids using the Thread as a Key, and pays a minor cost of synchronizing on the bucket for recording trace data. In effect it's a poor man's ThreadLocal. However, If I get unlucky there will be contention on a bucket that doesn't naturally shard itself like CHM does.

     </div> 

     <div>

     </div> 

     <div>

      * Do Nothing. This causes callers to allocate a ton of memory since the ThreadLocal.initialValue() gets called a ton, leading to unpredictable tracer overhead. There is a small but noticeable amount of overhead for creating the initial value (like registering with the reader) so this ends up not being practial.

     </div> 

     <div>

     </div> 

     <div>

      * A Hybrid of ThreadLocal when supported and fallback to CHM or ARA as mentioned above. This is the solution I came up with, where my ThreadLocal calls get() but has no initialValue() override. If the value is null, I attempt to set it. If there is an exception, I write the value to the CHM/ARA and then check there first for future get() calls. The problem with this is that the exception from set() causes an unacceptable amount of overhead for something that should have been very cheap. It isn't sufficient to check if the thread is virtual to see if TLs are supported, so I can't check the class name of the thread apriori. And, since multiple types of threads are calling into my library, I can't require callers to use TLs.

     </div> 

     <div>

     </div> 

     <div>

     </div> 

     <div>

      I'm kind of at a loss as to how to efficiently fallback to a slower implementation when TLs aren't supported, since I can't tell if they are or not. (e.g. can't tell if the electric fence is on without touching it). Again, I'd prefer to keep the fast ThreadLocals if they are supported though.

     </div> 

     <div>

     </div> 

     <div>

     </div> 

     <div>

      I'm looking for ideas (or just to register feedback) with this email, and have been otherwise very happy with the progress on project Loom.

     </div> 

     <div>

     </div> 

     <div>

      Carl

     </div> 

    </blockquote> 

   </blockquote> 

  </blockquote>

</div></blockquote></div></blockquote></div></blockquote></div></blockquote></body></html>