Decreased latency performance with Stack Walker API compared to sun.misc.JavaLangAccess

Rafael Winterhalter rafael.wth at gmail.com
Fri Oct 20 17:35:06 UTC 2017


Hei Mandy,
thank you for your quick reply.

The benchmark is already only capturing the stack frame object. I ran the
benchmark with the transformation as a comparison and it is indeed slower
due to extra work. The JIT code suggests the native method plus monitor to
be the major cost here but it is of course necessary to have it there to
fetch the stack information.

Without this transformation in the user  thread, capturing stack frame
objects of 100 frames is however still about 2.5 times as expensive as
constructing a throwable as a snapshot.

Just to not be misunderstood: the total cost is reduced by using the stack
walker, even compared to using java lang secrets under Java 8 but the
inability to process the stack outside of a worker thread makes it
inapplicable for my purposes due to latency requirements.

Thank you for considering this use case and all your great work!

Cheers, Rafael


Am 20.10.2017 18:11 schrieb "mandy chung" <mandy.chung at oracle.com>:

Hi Rafael,

Thanks for the feedback.  We did some investigation in understanding the
overhead of Throwable if it used StackWalker API [1].  It did come to mind
whether the StackWalker API should provide a way to walk the backtrace
which we should do the investigation with JDK-8141239.

The benchmark compares capturing 35 StackTraceElements for apple-to-apple
comparison which is fair.  I am curious on the perf difference when you
capture only StackFrame objects?  This would save the overhead to construct
StackTraceElement objects (and its associated string objects).

Mandy
[1] https://bugs.openjdk.java.net/browse/JDK-8141239


On 10/20/17 6:32 AM, Rafael Winterhalter wrote:

> Hello,
>
> a typical patern when reading the stack of the current thread in tooling
> like performance monitoring used to imply the creation of an instance of
> Throwable and to process this instance's attached stack in another thread.
> The performance cost is shared about 10/90 for creating a new throwable
> compared to reading its frames, so this is really a worthy optimization.
>
> It is also common to use the JavaLangAccess API which offers selective
> access of single frames.
>
> This API does no longer exist as it was superseeded by the Stack Walker API
> which is of course much safer and even a more performant alternative when
> looking at the total performance. However, using a stack walker, it is no
> longer possible to move the stack processing out of the user thread but it
> must be done at the moment the snapshot of the stack is taken. It turns out
> that this increases latency dramatically when processing stacks compared to
> the asyncronous alternative.
>
> In a quick benchmark, it seems like walking 35 frames of a 100 frames stack
> allows me 70k operations per second whereas creating a new throwable yields
> about 200k operations per second. Also, within a less isolated test, I can
> infer this additional overhead from the actual latency numbers of a web
> service when using the stack walker API to extract the top 35 frames
> compared to the "old" solution using JavaLangAccess.
>
> For this reason, it seems to be the best solution to avoid the stack walker
> when aiming for latency at the moment if the stack is not required
> immediately and if access resources are available in other threads.
>
> I would therefore like to propose to extend the stack walker API to allow
> walking the stack of an existing throwable to allow for similar performance
> as with JavaLangAccess. I understand that the VM must do more work
> altogether. When receving the full stack from a throwable, this takes about
> three times as long. In practice, for a product I am involved in, this
> casues a noticable overhead when running a Java 9 VM compared to Java 8.
>
> Alternatively, it would of course even be better if one could take a
> snapshot of only the top x frames to walk on this object rather then a
> throwable.
>
> I have added my benchmarks (snapshot for the current user thread operation,
> complete for the entire processing) into this Gist:
> https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70
>
> Thank you for any feedback on my finding!
>
> Best regards, Rafael
>


More information about the core-libs-dev mailing list