Decreased latency performance with Stack Walker API compared to sun.misc.JavaLangAccess
Hello, a typical patern when reading the stack of the current thread in tooling like performance monitoring used to imply the creation of an instance of Throwable and to process this instance's attached stack in another thread. The performance cost is shared about 10/90 for creating a new throwable compared to reading its frames, so this is really a worthy optimization. It is also common to use the JavaLangAccess API which offers selective access of single frames. This API does no longer exist as it was superseeded by the Stack Walker API which is of course much safer and even a more performant alternative when looking at the total performance. However, using a stack walker, it is no longer possible to move the stack processing out of the user thread but it must be done at the moment the snapshot of the stack is taken. It turns out that this increases latency dramatically when processing stacks compared to the asyncronous alternative. In a quick benchmark, it seems like walking 35 frames of a 100 frames stack allows me 70k operations per second whereas creating a new throwable yields about 200k operations per second. Also, within a less isolated test, I can infer this additional overhead from the actual latency numbers of a web service when using the stack walker API to extract the top 35 frames compared to the "old" solution using JavaLangAccess. For this reason, it seems to be the best solution to avoid the stack walker when aiming for latency at the moment if the stack is not required immediately and if access resources are available in other threads. I would therefore like to propose to extend the stack walker API to allow walking the stack of an existing throwable to allow for similar performance as with JavaLangAccess. I understand that the VM must do more work altogether. When receving the full stack from a throwable, this takes about three times as long. In practice, for a product I am involved in, this casues a noticable overhead when running a Java 9 VM compared to Java 8. Alternatively, it would of course even be better if one could take a snapshot of only the top x frames to walk on this object rather then a throwable. I have added my benchmarks (snapshot for the current user thread operation, complete for the entire processing) into this Gist: https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70 Thank you for any feedback on my finding! Best regards, Rafael
Hi Rafael, stream.iterator() is usually super slow*, did you try with toArray() or forEach() instead ? Rémi * you want to see a push based API (Stream) as a pull based API (Iterator) ----- Mail original -----
De: "Rafael Winterhalter" <rafael.wth@gmail.com> À: "core-libs-dev" <core-libs-dev@openjdk.java.net> Envoyé: Vendredi 20 Octobre 2017 15:32:33 Objet: Decreased latency performance with Stack Walker API compared to sun.misc.JavaLangAccess
Hello,
a typical patern when reading the stack of the current thread in tooling like performance monitoring used to imply the creation of an instance of Throwable and to process this instance's attached stack in another thread. The performance cost is shared about 10/90 for creating a new throwable compared to reading its frames, so this is really a worthy optimization.
It is also common to use the JavaLangAccess API which offers selective access of single frames.
This API does no longer exist as it was superseeded by the Stack Walker API which is of course much safer and even a more performant alternative when looking at the total performance. However, using a stack walker, it is no longer possible to move the stack processing out of the user thread but it must be done at the moment the snapshot of the stack is taken. It turns out that this increases latency dramatically when processing stacks compared to the asyncronous alternative.
In a quick benchmark, it seems like walking 35 frames of a 100 frames stack allows me 70k operations per second whereas creating a new throwable yields about 200k operations per second. Also, within a less isolated test, I can infer this additional overhead from the actual latency numbers of a web service when using the stack walker API to extract the top 35 frames compared to the "old" solution using JavaLangAccess.
For this reason, it seems to be the best solution to avoid the stack walker when aiming for latency at the moment if the stack is not required immediately and if access resources are available in other threads.
I would therefore like to propose to extend the stack walker API to allow walking the stack of an existing throwable to allow for similar performance as with JavaLangAccess. I understand that the VM must do more work altogether. When receving the full stack from a throwable, this takes about three times as long. In practice, for a product I am involved in, this casues a noticable overhead when running a Java 9 VM compared to Java 8.
Alternatively, it would of course even be better if one could take a snapshot of only the top x frames to walk on this object rather then a throwable.
I have added my benchmarks (snapshot for the current user thread operation, complete for the entire processing) into this Gist: https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70
Thank you for any feedback on my finding!
Best regards, Rafael
Hi Remi, thank you for your quick reply! I just changed the benchmark to use: stackFrameStream.skip(startFrame).limit(stackDepth).collect(Collectors.toList()); and it yields a small improvement. Best regards, Rafael 2017-10-20 15:54 GMT+02:00 Remi Forax <forax@univ-mlv.fr>:
Hi Rafael, stream.iterator() is usually super slow*, did you try with toArray() or forEach() instead ?
Rémi * you want to see a push based API (Stream) as a pull based API (Iterator)
----- Mail original -----
De: "Rafael Winterhalter" <rafael.wth@gmail.com> À: "core-libs-dev" <core-libs-dev@openjdk.java.net> Envoyé: Vendredi 20 Octobre 2017 15:32:33 Objet: Decreased latency performance with Stack Walker API compared to sun.misc.JavaLangAccess
Hello,
a typical patern when reading the stack of the current thread in tooling like performance monitoring used to imply the creation of an instance of Throwable and to process this instance's attached stack in another thread. The performance cost is shared about 10/90 for creating a new throwable compared to reading its frames, so this is really a worthy optimization.
It is also common to use the JavaLangAccess API which offers selective access of single frames.
This API does no longer exist as it was superseeded by the Stack Walker API which is of course much safer and even a more performant alternative when looking at the total performance. However, using a stack walker, it is no longer possible to move the stack processing out of the user thread but it must be done at the moment the snapshot of the stack is taken. It turns out that this increases latency dramatically when processing stacks compared to the asyncronous alternative.
In a quick benchmark, it seems like walking 35 frames of a 100 frames stack allows me 70k operations per second whereas creating a new throwable yields about 200k operations per second. Also, within a less isolated test, I can infer this additional overhead from the actual latency numbers of a web service when using the stack walker API to extract the top 35 frames compared to the "old" solution using JavaLangAccess.
For this reason, it seems to be the best solution to avoid the stack walker when aiming for latency at the moment if the stack is not required immediately and if access resources are available in other threads.
I would therefore like to propose to extend the stack walker API to allow walking the stack of an existing throwable to allow for similar performance as with JavaLangAccess. I understand that the VM must do more work altogether. When receving the full stack from a throwable, this takes about three times as long. In practice, for a product I am involved in, this casues a noticable overhead when running a Java 9 VM compared to Java 8.
Alternatively, it would of course even be better if one could take a snapshot of only the top x frames to walk on this object rather then a throwable.
I have added my benchmarks (snapshot for the current user thread operation, complete for the entire processing) into this Gist: https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70
Thank you for any feedback on my finding!
Best regards, Rafael
Hi Rafael, Thanks for the feedback. We did some investigation in understanding the overhead of Throwable if it used StackWalker API [1]. It did come to mind whether the StackWalker API should provide a way to walk the backtrace which we should do the investigation with JDK-8141239. The benchmark compares capturing 35 StackTraceElements for apple-to-apple comparison which is fair. I am curious on the perf difference when you capture only StackFrame objects? This would save the overhead to construct StackTraceElement objects (and its associated string objects). Mandy [1] https://bugs.openjdk.java.net/browse/JDK-8141239 On 10/20/17 6:32 AM, Rafael Winterhalter wrote:
Hello,
a typical patern when reading the stack of the current thread in tooling like performance monitoring used to imply the creation of an instance of Throwable and to process this instance's attached stack in another thread. The performance cost is shared about 10/90 for creating a new throwable compared to reading its frames, so this is really a worthy optimization.
It is also common to use the JavaLangAccess API which offers selective access of single frames.
This API does no longer exist as it was superseeded by the Stack Walker API which is of course much safer and even a more performant alternative when looking at the total performance. However, using a stack walker, it is no longer possible to move the stack processing out of the user thread but it must be done at the moment the snapshot of the stack is taken. It turns out that this increases latency dramatically when processing stacks compared to the asyncronous alternative.
In a quick benchmark, it seems like walking 35 frames of a 100 frames stack allows me 70k operations per second whereas creating a new throwable yields about 200k operations per second. Also, within a less isolated test, I can infer this additional overhead from the actual latency numbers of a web service when using the stack walker API to extract the top 35 frames compared to the "old" solution using JavaLangAccess.
For this reason, it seems to be the best solution to avoid the stack walker when aiming for latency at the moment if the stack is not required immediately and if access resources are available in other threads.
I would therefore like to propose to extend the stack walker API to allow walking the stack of an existing throwable to allow for similar performance as with JavaLangAccess. I understand that the VM must do more work altogether. When receving the full stack from a throwable, this takes about three times as long. In practice, for a product I am involved in, this casues a noticable overhead when running a Java 9 VM compared to Java 8.
Alternatively, it would of course even be better if one could take a snapshot of only the top x frames to walk on this object rather then a throwable.
I have added my benchmarks (snapshot for the current user thread operation, complete for the entire processing) into this Gist: https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70
Thank you for any feedback on my finding!
Best regards, Rafael
Hei Mandy, thank you for your quick reply. The benchmark is already only capturing the stack frame object. I ran the benchmark with the transformation as a comparison and it is indeed slower due to extra work. The JIT code suggests the native method plus monitor to be the major cost here but it is of course necessary to have it there to fetch the stack information. Without this transformation in the user thread, capturing stack frame objects of 100 frames is however still about 2.5 times as expensive as constructing a throwable as a snapshot. Just to not be misunderstood: the total cost is reduced by using the stack walker, even compared to using java lang secrets under Java 8 but the inability to process the stack outside of a worker thread makes it inapplicable for my purposes due to latency requirements. Thank you for considering this use case and all your great work! Cheers, Rafael Am 20.10.2017 18:11 schrieb "mandy chung" <mandy.chung@oracle.com>: Hi Rafael, Thanks for the feedback. We did some investigation in understanding the overhead of Throwable if it used StackWalker API [1]. It did come to mind whether the StackWalker API should provide a way to walk the backtrace which we should do the investigation with JDK-8141239. The benchmark compares capturing 35 StackTraceElements for apple-to-apple comparison which is fair. I am curious on the perf difference when you capture only StackFrame objects? This would save the overhead to construct StackTraceElement objects (and its associated string objects). Mandy [1] https://bugs.openjdk.java.net/browse/JDK-8141239 On 10/20/17 6:32 AM, Rafael Winterhalter wrote:
Hello,
a typical patern when reading the stack of the current thread in tooling like performance monitoring used to imply the creation of an instance of Throwable and to process this instance's attached stack in another thread. The performance cost is shared about 10/90 for creating a new throwable compared to reading its frames, so this is really a worthy optimization.
It is also common to use the JavaLangAccess API which offers selective access of single frames.
This API does no longer exist as it was superseeded by the Stack Walker API which is of course much safer and even a more performant alternative when looking at the total performance. However, using a stack walker, it is no longer possible to move the stack processing out of the user thread but it must be done at the moment the snapshot of the stack is taken. It turns out that this increases latency dramatically when processing stacks compared to the asyncronous alternative.
In a quick benchmark, it seems like walking 35 frames of a 100 frames stack allows me 70k operations per second whereas creating a new throwable yields about 200k operations per second. Also, within a less isolated test, I can infer this additional overhead from the actual latency numbers of a web service when using the stack walker API to extract the top 35 frames compared to the "old" solution using JavaLangAccess.
For this reason, it seems to be the best solution to avoid the stack walker when aiming for latency at the moment if the stack is not required immediately and if access resources are available in other threads.
I would therefore like to propose to extend the stack walker API to allow walking the stack of an existing throwable to allow for similar performance as with JavaLangAccess. I understand that the VM must do more work altogether. When receving the full stack from a throwable, this takes about three times as long. In practice, for a product I am involved in, this casues a noticable overhead when running a Java 9 VM compared to Java 8.
Alternatively, it would of course even be better if one could take a snapshot of only the top x frames to walk on this object rather then a throwable.
I have added my benchmarks (snapshot for the current user thread operation, complete for the entire processing) into this Gist: https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70
Thank you for any feedback on my finding!
Best regards, Rafael
This would could also make sense for asynchronous logging with Log4j 2. We currently disabled capturing location information for asynchronous logging as capturing the information before passing the information to the thread doing the logging is fairly expensive. Being able to capture the stack trace and then use StackWalker to walk it in another thread would be very useful. Ralph
On Oct 20, 2017, at 10:35 AM, Rafael Winterhalter <rafael.wth@gmail.com> wrote:
Hei Mandy, thank you for your quick reply.
The benchmark is already only capturing the stack frame object. I ran the benchmark with the transformation as a comparison and it is indeed slower due to extra work. The JIT code suggests the native method plus monitor to be the major cost here but it is of course necessary to have it there to fetch the stack information.
Without this transformation in the user thread, capturing stack frame objects of 100 frames is however still about 2.5 times as expensive as constructing a throwable as a snapshot.
Just to not be misunderstood: the total cost is reduced by using the stack walker, even compared to using java lang secrets under Java 8 but the inability to process the stack outside of a worker thread makes it inapplicable for my purposes due to latency requirements.
Thank you for considering this use case and all your great work!
Cheers, Rafael
Am 20.10.2017 18:11 schrieb "mandy chung" <mandy.chung@oracle.com>:
Hi Rafael,
Thanks for the feedback. We did some investigation in understanding the overhead of Throwable if it used StackWalker API [1]. It did come to mind whether the StackWalker API should provide a way to walk the backtrace which we should do the investigation with JDK-8141239.
The benchmark compares capturing 35 StackTraceElements for apple-to-apple comparison which is fair. I am curious on the perf difference when you capture only StackFrame objects? This would save the overhead to construct StackTraceElement objects (and its associated string objects).
Mandy [1] https://bugs.openjdk.java.net/browse/JDK-8141239
On 10/20/17 6:32 AM, Rafael Winterhalter wrote:
Hello,
a typical patern when reading the stack of the current thread in tooling like performance monitoring used to imply the creation of an instance of Throwable and to process this instance's attached stack in another thread. The performance cost is shared about 10/90 for creating a new throwable compared to reading its frames, so this is really a worthy optimization.
It is also common to use the JavaLangAccess API which offers selective access of single frames.
This API does no longer exist as it was superseeded by the Stack Walker API which is of course much safer and even a more performant alternative when looking at the total performance. However, using a stack walker, it is no longer possible to move the stack processing out of the user thread but it must be done at the moment the snapshot of the stack is taken. It turns out that this increases latency dramatically when processing stacks compared to the asyncronous alternative.
In a quick benchmark, it seems like walking 35 frames of a 100 frames stack allows me 70k operations per second whereas creating a new throwable yields about 200k operations per second. Also, within a less isolated test, I can infer this additional overhead from the actual latency numbers of a web service when using the stack walker API to extract the top 35 frames compared to the "old" solution using JavaLangAccess.
For this reason, it seems to be the best solution to avoid the stack walker when aiming for latency at the moment if the stack is not required immediately and if access resources are available in other threads.
I would therefore like to propose to extend the stack walker API to allow walking the stack of an existing throwable to allow for similar performance as with JavaLangAccess. I understand that the VM must do more work altogether. When receving the full stack from a throwable, this takes about three times as long. In practice, for a product I am involved in, this casues a noticable overhead when running a Java 9 VM compared to Java 8.
Alternatively, it would of course even be better if one could take a snapshot of only the top x frames to walk on this object rather then a throwable.
I have added my benchmarks (snapshot for the current user thread operation, complete for the entire processing) into this Gist: https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70
Thank you for any feedback on my finding!
Best regards, Rafael
I created https://bugs.openjdk.java.net/browse/JDK-8189752 to track this. Mandy On 10/20/17 10:35 AM, Rafael Winterhalter wrote:
Hei Mandy, thank you for your quick reply.
The benchmark is already only capturing the stack frame object. I ran the benchmark with the transformation as a comparison and it is indeed slower due to extra work. The JIT code suggests the native method plus monitor to be the major cost here but it is of course necessary to have it there to fetch the stack information.
Without this transformation in the user thread, capturing stack frame objects of 100 frames is however still about 2.5 times as expensive as constructing a throwable as a snapshot.
Just to not be misunderstood: the total cost is reduced by using the stack walker, even compared to using java lang secrets under Java 8 but the inability to process the stack outside of a worker thread makes it inapplicable for my purposes due to latency requirements.
Thank you for considering this use case and all your great work!
Cheers, Rafael
Am 20.10.2017 18:11 schrieb "mandy chung" <mandy.chung@oracle.com <mailto:mandy.chung@oracle.com>>:
Hi Rafael,
Thanks for the feedback. We did some investigation in understanding the overhead of Throwable if it used StackWalker API [1]. It did come to mind whether the StackWalker API should provide a way to walk the backtrace which we should do the investigation with JDK-8141239.
The benchmark compares capturing 35 StackTraceElements for apple-to-apple comparison which is fair. I am curious on the perf difference when you capture only StackFrame objects? This would save the overhead to construct StackTraceElement objects (and its associated string objects).
Mandy [1] https://bugs.openjdk.java.net/browse/JDK-8141239 <https://bugs.openjdk.java.net/browse/JDK-8141239>
On 10/20/17 6:32 AM, Rafael Winterhalter wrote:
Hello,
a typical patern when reading the stack of the current thread in tooling like performance monitoring used to imply the creation of an instance of Throwable and to process this instance's attached stack in another thread. The performance cost is shared about 10/90 for creating a new throwable compared to reading its frames, so this is really a worthy optimization.
It is also common to use the JavaLangAccess API which offers selective access of single frames.
This API does no longer exist as it was superseeded by the Stack Walker API which is of course much safer and even a more performant alternative when looking at the total performance. However, using a stack walker, it is no longer possible to move the stack processing out of the user thread but it must be done at the moment the snapshot of the stack is taken. It turns out that this increases latency dramatically when processing stacks compared to the asyncronous alternative.
In a quick benchmark, it seems like walking 35 frames of a 100 frames stack allows me 70k operations per second whereas creating a new throwable yields about 200k operations per second. Also, within a less isolated test, I can infer this additional overhead from the actual latency numbers of a web service when using the stack walker API to extract the top 35 frames compared to the "old" solution using JavaLangAccess.
For this reason, it seems to be the best solution to avoid the stack walker when aiming for latency at the moment if the stack is not required immediately and if access resources are available in other threads.
I would therefore like to propose to extend the stack walker API to allow walking the stack of an existing throwable to allow for similar performance as with JavaLangAccess. I understand that the VM must do more work altogether. When receving the full stack from a throwable, this takes about three times as long. In practice, for a product I am involved in, this casues a noticable overhead when running a Java 9 VM compared to Java 8.
Alternatively, it would of course even be better if one could take a snapshot of only the top x frames to walk on this object rather then a throwable.
I have added my benchmarks (snapshot for the current user thread operation, complete for the entire processing) into this Gist: https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70 <https://gist.github.com/raphw/96e7c81d7c719cf7991b361bb7266c70>
Thank you for any feedback on my finding!
Best regards, Rafael
participants (4)
-
mandy chung
-
Rafael Winterhalter
-
Ralph Goers
-
Remi Forax