Decreased latency performance with Stack Walker API compared to sun.misc.JavaLangAccess

Fri Oct 20 19:53:04 UTC 2017

This would could also make sense for asynchronous logging with Log4j 2. We currently disabled capturing location information for asynchronous logging as capturing the information before passing the information to the thread doing the logging is fairly expensive. Being able to capture the stack trace and then use StackWalker to walk it in another thread would be very useful.

Ralph

> On Oct 20, 2017, at 10:35 AM, Rafael Winterhalter <rafael.wth at gmail.com> wrote:
> 
> Hei Mandy,
> thank you for your quick reply.
> 
> The benchmark is already only capturing the stack frame object. I ran the
> benchmark with the transformation as a comparison and it is indeed slower
> due to extra work. The JIT code suggests the native method plus monitor to
> be the major cost here but it is of course necessary to have it there to
> fetch the stack information.
> 
> Without this transformation in the user  thread, capturing stack frame
> objects of 100 frames is however still about 2.5 times as expensive as
> constructing a throwable as a snapshot.
> 
> Just to not be misunderstood: the total cost is reduced by using the stack
> walker, even compared to using java lang secrets under Java 8 but the
> inability to process the stack outside of a worker thread makes it
> inapplicable for my purposes due to latency requirements.
> 
> Thank you for considering this use case and all your great work!
> 
> Cheers, Rafael
> 
> 
> Am 20.10.2017 18:11 schrieb "mandy chung" <mandy.chung at oracle.com>:
> 
> Hi Rafael,
> 
> Thanks for the feedback.  We did some investigation in understanding the
> overhead of Throwable if it used StackWalker API [1].  It did come to mind
> whether the StackWalker API should provide a way to walk the backtrace
> which we should do the investigation with JDK-8141239.
> 
> The benchmark compares capturing 35 StackTraceElements for apple-to-apple
> comparison which is fair.  I am curious on the perf difference when you
> capture only StackFrame objects?  This would save the overhead to construct
> StackTraceElement objects (and its associated string objects).
> 
> Mandy
> [1] https://bugs.openjdk.java.net/browse/JDK-8141239
> 
> 
> On 10/20/17 6:32 AM, Rafael Winterhalter wrote:
> 
>> Hello,
>> 
>> a typical patern when reading the stack of the current thread in tooling
>> like performance monitoring used to imply the creation of an instance of
>> Throwable and to process this instance's attached stack in another thread.
>> The performance cost is shared about 10/90 for creating a new throwable
>> compared to reading its frames, so this is really a worthy optimization.
>> 
>> It is also common to use the JavaLangAccess API which offers selective
>> access of single frames.
>> 
>> This API does no longer exist as it was superseeded by the Stack Walker API
>> which is of course much safer and even a more performant alternative when
>> looking at the total performance. However, using a stack walker, it is no
>> longer possible to move the stack processing out of the user thread but it
>> must be done at the moment the snapshot of the stack is taken. It turns out
>> that this increases latency dramatically when processing stacks compared to
>> the asyncronous alternative.
>> 
>> In a quick benchmark, it seems like walking 35 frames of a 100 frames stack
>> allows me 70k operations per second whereas creating a new throwable yields
>> about 200k operations per second. Also, within a less isolated test, I can
>> infer this additional overhead from the actual latency numbers of a web
>> service when using the stack walker API to extract the top 35 frames
>> compared to the "old" solution using JavaLangAccess.
>> 
>> For this reason, it seems to be the best solution to avoid the stack walker
>> when aiming for latency at the moment if the stack is not required
>> immediately and if access resources are available in other threads.
>> 
>> I would therefore like to propose to extend the stack walker API to allow
>> walking the stack of an existing throwable to allow for similar performance
>> as with JavaLangAccess. I understand that the VM must do more work
>> altogether. When receving the full stack from a throwable, this takes about
>> three times as long. In practice, for a product I am involved in, this
>> casues a noticable overhead when running a Java 9 VM compared to Java 8.
>> 
>> Alternatively, it would of course even be better if one could take a
>> snapshot of only the top x frames to walk on this object rather then a
>> throwable.
>> 
>> I have added my benchmarks (snapshot for the current user thread operation,
>> complete for the entire processing) into this Gist:
>> https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70
>> 
>> Thank you for any feedback on my finding!
>> 
>> Best regards, Rafael
>> 
>