Decreased latency performance with Stack Walker API compared to sun.misc.JavaLangAccess

Fri Oct 20 13:32:33 UTC 2017

Hello,

a typical patern when reading the stack of the current thread in tooling
like performance monitoring used to imply the creation of an instance of
Throwable and to process this instance's attached stack in another thread.
The performance cost is shared about 10/90 for creating a new throwable
compared to reading its frames, so this is really a worthy optimization.

It is also common to use the JavaLangAccess API which offers selective
access of single frames.

This API does no longer exist as it was superseeded by the Stack Walker API
which is of course much safer and even a more performant alternative when
looking at the total performance. However, using a stack walker, it is no
longer possible to move the stack processing out of the user thread but it
must be done at the moment the snapshot of the stack is taken. It turns out
that this increases latency dramatically when processing stacks compared to
the asyncronous alternative.

In a quick benchmark, it seems like walking 35 frames of a 100 frames stack
allows me 70k operations per second whereas creating a new throwable yields
about 200k operations per second. Also, within a less isolated test, I can
infer this additional overhead from the actual latency numbers of a web
service when using the stack walker API to extract the top 35 frames
compared to the "old" solution using JavaLangAccess.

For this reason, it seems to be the best solution to avoid the stack walker
when aiming for latency at the moment if the stack is not required
immediately and if access resources are available in other threads.

I would therefore like to propose to extend the stack walker API to allow
walking the stack of an existing throwable to allow for similar performance
as with JavaLangAccess. I understand that the VM must do more work
altogether. When receving the full stack from a throwable, this takes about
three times as long. In practice, for a product I am involved in, this
casues a noticable overhead when running a Java 9 VM compared to Java 8.

Alternatively, it would of course even be better if one could take a
snapshot of only the top x frames to walk on this object rather then a
throwable.

I have added my benchmarks (snapshot for the current user thread operation,
complete for the entire processing) into this Gist:
https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70

Thank you for any feedback on my finding!

Best regards, Rafael