Decreased latency performance with Stack Walker API compared to sun.misc.JavaLangAccess

Fri Oct 20 16:11:20 UTC 2017

Hi Rafael,

Thanks for the feedback.  We did some investigation in understanding the 
overhead of Throwable if it used StackWalker API [1].  It did come to 
mind whether the StackWalker API should provide a way to walk the 
backtrace which we should do the investigation with JDK-8141239.

The benchmark compares capturing 35 StackTraceElements for 
apple-to-apple comparison which is fair.  I am curious on the perf 
difference when you capture only StackFrame objects?  This would save 
the overhead to construct StackTraceElement objects (and its associated 
string objects).

Mandy
[1] https://bugs.openjdk.java.net/browse/JDK-8141239

On 10/20/17 6:32 AM, Rafael Winterhalter wrote:
> Hello,
>
> a typical patern when reading the stack of the current thread in tooling
> like performance monitoring used to imply the creation of an instance of
> Throwable and to process this instance's attached stack in another thread.
> The performance cost is shared about 10/90 for creating a new throwable
> compared to reading its frames, so this is really a worthy optimization.
>
> It is also common to use the JavaLangAccess API which offers selective
> access of single frames.
>
> This API does no longer exist as it was superseeded by the Stack Walker API
> which is of course much safer and even a more performant alternative when
> looking at the total performance. However, using a stack walker, it is no
> longer possible to move the stack processing out of the user thread but it
> must be done at the moment the snapshot of the stack is taken. It turns out
> that this increases latency dramatically when processing stacks compared to
> the asyncronous alternative.
>
> In a quick benchmark, it seems like walking 35 frames of a 100 frames stack
> allows me 70k operations per second whereas creating a new throwable yields
> about 200k operations per second. Also, within a less isolated test, I can
> infer this additional overhead from the actual latency numbers of a web
> service when using the stack walker API to extract the top 35 frames
> compared to the "old" solution using JavaLangAccess.
>
> For this reason, it seems to be the best solution to avoid the stack walker
> when aiming for latency at the moment if the stack is not required
> immediately and if access resources are available in other threads.
>
> I would therefore like to propose to extend the stack walker API to allow
> walking the stack of an existing throwable to allow for similar performance
> as with JavaLangAccess. I understand that the VM must do more work
> altogether. When receving the full stack from a throwable, this takes about
> three times as long. In practice, for a product I am involved in, this
> casues a noticable overhead when running a Java 9 VM compared to Java 8.
>
> Alternatively, it would of course even be better if one could take a
> snapshot of only the top x frames to walk on this object rather then a
> throwable.
>
> I have added my benchmarks (snapshot for the current user thread operation,
> complete for the entire processing) into this Gist:
> https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70
>
> Thank you for any feedback on my finding!
>
> Best regards, Rafael