So, I'm pretty dubious, mostly because of the risks mentioned in the JEP. If you need a flag-check and two code paths for every String method, that's going to make the String class more slow and bloated, and make it very difficult for the JIT compiler to do its job inlining and intrinsifying calls to String methods. At Google, we spent a fair bit of time last year climbing out of the performance hole that trimming substrings dropped us into - we had a fair bit of code that was based around substrings being approximately memory-neutral, and it cost us a lot of GC overhead and rewriting to make the change. The JDK itself still has exposed APIs that make tradeoffs based on cheap substrings (the URL(String) constructor does a lot of this, for example). The proposed change here has the potential of doing the opposite with most String operations - trading off less GC overhead for more mutator cost. But String operations are a pretty big chunk of CPU time, on the whole. Does anyone really have a sense of how to make this kind of decision? The JEP seems mostly to be hoping that other organizations will do the testing for you. (I agree that it is worth doing some experimentation in this area, but I wanted to say this early, because if I could reach back in time and tell you *not* to make the substring change, I would. We seriously considered simply backing it out locally, but it would have been a lot of effort for us to maintain that kind of patch, and we didn't want our performance tradeoffs to be that much different from the stock JDK's.) Jeremy On Thu, May 14, 2015 at 4:05 PM, <mark.reinhold@oracle.com> wrote:
New JEP Candidate: http://openjdk.java.net/jeps/254
- Mark
(getting back to this) Hi Jeremy, On 05/16/2015 03:34 AM, Jeremy Manson wrote:
So, I'm pretty dubious, mostly because of the risks mentioned in the JEP. If you need a flag-check and two code paths for every String method, that's going to make the String class more slow and bloated, and make it very difficult for the JIT compiler to do its job inlining and intrinsifying calls to String methods.
Yes, we know that's a potential problem, e.g. outlined here: http://cr.openjdk.java.net/~shade/density/equals.txt The hope is that the string coder check would be amortized by the substantial performance improvement with the ubiquitous Latin1 (optimized) operations. Also, getting a few code generation quirks kicked out may further offset the perceived performance costs of doing this (you can do such a trick every so often, but not all the time).
The proposed change here has the potential of doing the opposite with most String operations - trading off less GC overhead for more mutator cost. But String operations are a pretty big chunk of CPU time, on the whole.
The thing is, many mutator ops on Strings are also improved, because the data become more easily cacheable and/or require less steps to complete (think vectorization that takes 2x less instructions).
Does anyone really have a sense of how to make this kind of decision? The JEP seems mostly to be hoping that other organizations will do the testing for you.
It is not true that JEP hopes to have other organizations to do testing for it. The JEP tries to illuminate that this is a performance-sensitive change, so early testing and feedback is very appreciated. So, if you have the String-intensive workloads in your org, can you try and run the prototype JDK against it? Our early runs on our workloads of interest show the appealing improvements. That is, the decision to integrate this is not done yet, as we don't have the complete performance picture and/or fully-tested prototype. In other words, there are quite a few blank spots to fill out. Your data may be the part of that picture when we decide to integrate in JDK 9.
(I agree that it is worth doing some experimentation in this area, but I wanted to say this early, because if I could reach back in time and tell you *not* to make the substring change, I would. We seriously considered simply backing it out locally, but it would have been a lot of effort for us to maintain that kind of patch, and we didn't want our performance tradeoffs to be that much different from the stock JDK's.)
This is your golden ticket: if you come back with concrete data in your hands saying that the particular tradeoff the JEP made is not sensible for your applications, it would be considered in the decision to integrate. But, it should be a real data and/or contrived benchmark simulating the real-world scenario, not just theoretical appeals -- we know how misguided those can get. Thanks, -Aleksey
TL;DR: In principle, we'd love to do more early testing of Hotspot / JDK features, but our benchmarks are, honestly, not all that great. We end up having to test against live services, which makes this sort of thing really hard. More info than you need: There are two real problems here: 1) To do apples-to-apples comparisons, we have to make sure that *our* patches all work with whatever version of Hotspot we're testing. 2) Pulling down a new JDK9 - even an official release - usually means that there are a lot of instabilities, half-finished work, and inefficiencies, so we can't really run tests very well against it. That's not a knock on Hotspot developers; the only way to know about some of these problems is to run the JDK in infrastructure like ours. ( An example of something that hit us hard that no one else would notice: http://hg.openjdk.java.net/jdk9/hs/hotspot/rev/5ba37c4c0578 ) It took us months to forward port all of our patches to JDK8, and we've spent the last six months getting it to the point that we're comfortable enough to ship to our users (just in time for JDK7 EOL!). That's required disabling tiered compilation, heavily tweaking code cache flushing (which is still causing us CPU regressions), rewriting various parts of the metaspace to behave more efficiently, and fixing various incompatibilities with our internal patches. That's completely apart from the dozens of backwards incompatibilities introduced in JDK8 that triggered a very, very, very large code cleanup effort, from the new hash iteration order to the unicode update to the fact that if you call flush on a closed BufferedOutputStream it now throws an exception. (We actually ended up randomizing our hash iteration order, which helps us guard against broken code, is slightly more secure, and means that we never get bitten by that as part of an upgrade again.) In short, upgrading is in no sense cheap for us, and trying out new features is hard. We usually restrict ourselves to using new features that can be more-or-less cleanly patched to the version of the JDK we're using and hidden behind a flag. This is an important enough change that we might be able to make some effort, but we'll have to see how it goes. Jeremy On Mon, Jun 1, 2015 at 1:31 AM, Aleksey Shipilev < aleksey.shipilev@oracle.com> wrote:
(getting back to this)
Hi Jeremy,
On 05/16/2015 03:34 AM, Jeremy Manson wrote:
So, I'm pretty dubious, mostly because of the risks mentioned in the JEP. If you need a flag-check and two code paths for every String method, that's going to make the String class more slow and bloated, and make it very difficult for the JIT compiler to do its job inlining and intrinsifying calls to String methods.
Yes, we know that's a potential problem, e.g. outlined here: http://cr.openjdk.java.net/~shade/density/equals.txt
The hope is that the string coder check would be amortized by the substantial performance improvement with the ubiquitous Latin1 (optimized) operations. Also, getting a few code generation quirks kicked out may further offset the perceived performance costs of doing this (you can do such a trick every so often, but not all the time).
The proposed change here has the potential of doing the opposite with most String operations - trading off less GC overhead for more mutator cost. But String operations are a pretty big chunk of CPU time, on the whole.
The thing is, many mutator ops on Strings are also improved, because the data become more easily cacheable and/or require less steps to complete (think vectorization that takes 2x less instructions).
Does anyone really have a sense of how to make this kind of decision? The JEP seems mostly to be hoping that other organizations will do the testing for you.
It is not true that JEP hopes to have other organizations to do testing for it. The JEP tries to illuminate that this is a performance-sensitive change, so early testing and feedback is very appreciated. So, if you have the String-intensive workloads in your org, can you try and run the prototype JDK against it? Our early runs on our workloads of interest show the appealing improvements.
That is, the decision to integrate this is not done yet, as we don't have the complete performance picture and/or fully-tested prototype. In other words, there are quite a few blank spots to fill out. Your data may be the part of that picture when we decide to integrate in JDK 9.
(I agree that it is worth doing some experimentation in this area, but I wanted to say this early, because if I could reach back in time and tell you *not* to make the substring change, I would. We seriously considered simply backing it out locally, but it would have been a lot of effort for us to maintain that kind of patch, and we didn't want our performance tradeoffs to be that much different from the stock JDK's.)
This is your golden ticket: if you come back with concrete data in your hands saying that the particular tradeoff the JEP made is not sensible for your applications, it would be considered in the decision to integrate. But, it should be a real data and/or contrived benchmark simulating the real-world scenario, not just theoretical appeals -- we know how misguided those can get.
Thanks, -Aleksey
For what it's worth, we would welcome this change. We took a large memory hit and a small performance hit when we upgraded from 1.6 to 1.7 in some of our memory-bound applications.
From a purely performance perspective, the most expensive CPU operations are memory access these days. Anything that halves memory reads will likely produce better performance.
From an implementation perspective, having used 1.6's compressed strings feature in production, we are comfortable that none of our code, nor any of our dependencies rely on String internal representation in such a way as to cause a significant backward compatibility issue.
Thanks Moh
-----Original Message----- From: core-libs-dev [mailto:core-libs-dev-bounces@openjdk.java.net] On Behalf Of mark.reinhold@oracle.com Sent: Thursday, May 14, 2015 7:05 PM To: xueming.shen@oracle.com Cc: core-libs-dev@openjdk.java.net Subject: JEP 254: Compact Strings
New JEP Candidate: http://openjdk.java.net/jeps/254
- Mark
From a purely performance perspective, the most expensive CPU operations are memory access these days.
Very true ... for random accesses. Anything that halves memory reads will likely produce better performance. This part is a bit unclear for the proposed changes. While it's true that single byte encoding will be denser than two byte, most string ops end up walking the backing store linearly; prefetch (either implicit h/w or software-assisted) could hide the memory access latency. Personally, what I'd like to see is fusing storage of String with its backing data, irrespective of encoding (i.e. removing the indirection to fetch the char[] or byte[]). On Mon, May 18, 2015 at 10:24 AM, Rezaei, Mohammad A. < Mohammad.Rezaei@gs.com> wrote:
For what it's worth, we would welcome this change. We took a large memory hit and a small performance hit when we upgraded from 1.6 to 1.7 in some of our memory-bound applications.
From a purely performance perspective, the most expensive CPU operations are memory access these days. Anything that halves memory reads will likely produce better performance.
From an implementation perspective, having used 1.6's compressed strings feature in production, we are comfortable that none of our code, nor any of our dependencies rely on String internal representation in such a way as to cause a significant backward compatibility issue.
Thanks Moh
-----Original Message----- From: core-libs-dev [mailto:core-libs-dev-bounces@openjdk.java.net] On Behalf Of mark.reinhold@oracle.com Sent: Thursday, May 14, 2015 7:05 PM To: xueming.shen@oracle.com Cc: core-libs-dev@openjdk.java.net Subject: JEP 254: Compact Strings
New JEP Candidate: http://openjdk.java.net/jeps/254
- Mark
On 05/18/2015 05:35 PM, Vitaly Davidovich wrote:
This part is a bit unclear for the proposed changes. While it's true that single byte encoding will be denser than two byte, most string ops end up walking the backing store linearly; prefetch (either implicit h/w or software-assisted) could hide the memory access latency.
It will still pollute the caches though, and generally incur more instructions to be executed (e.g. think about the vectorized scan of the char[] array -- the compressed version will take 2x less instructions).
Personally, what I'd like to see is fusing storage of String with its backing data, irrespective of encoding (i.e. removing the indirection to fetch the char[] or byte[]).
This is not the target for this JEP, and the groundwork for String-char[] fusion is handled elsewhere (I put my hopes at Valhalla that will explore the exact path to add the "exotic" object shapes into the runtime). String-char[] fusion neither conflicts with the Compact String optimization, nor provides the alternative. Removing the "excess" headers from backing char[] array would solve the "static" overhead in Strings, while the String compaction would further compact the backing storage. Thanks, -Aleksey.
Hi Aleksey, While it's true that the denser format will require fewer cachelines, my experience is that most strings are smaller than a single cacheline worth of storage, maybe two lines in some cases; there's just a ton of them in the heap. So the heap footprint should be substantially reduced, but I'm not sure the cache pollution will be significantly reduced. There's currently no vectorization of char[] scanning (or any vectorization other than memcpy for that matter) - are you referring to the recent Intel contributions here or there's a plan to further improve vectorization in time for this JEP? Just curious. I agree that string fusion is separate from this change, and we've discussed this before. It just seems to me like that's a bigger perf problem today since even tiny/small strings (very common, IME) incur the indirection and bloat overhead, so would have liked to see that addressed first. If you're saying that's fully on valhalla's plate, ok, but I haven't seen anything proposed there yet. Thanks sent from my phone On Jun 1, 2015 4:50 AM, "Aleksey Shipilev" <aleksey.shipilev@oracle.com> wrote:
On 05/18/2015 05:35 PM, Vitaly Davidovich wrote:
This part is a bit unclear for the proposed changes. While it's true that single byte encoding will be denser than two byte, most string ops end up walking the backing store linearly; prefetch (either implicit h/w or software-assisted) could hide the memory access latency.
It will still pollute the caches though, and generally incur more instructions to be executed (e.g. think about the vectorized scan of the char[] array -- the compressed version will take 2x less instructions).
Personally, what I'd like to see is fusing storage of String with its backing data, irrespective of encoding (i.e. removing the indirection to fetch the char[] or byte[]).
This is not the target for this JEP, and the groundwork for String-char[] fusion is handled elsewhere (I put my hopes at Valhalla that will explore the exact path to add the "exotic" object shapes into the runtime).
String-char[] fusion neither conflicts with the Compact String optimization, nor provides the alternative. Removing the "excess" headers from backing char[] array would solve the "static" overhead in Strings, while the String compaction would further compact the backing storage.
Thanks, -Aleksey.
On 06/01/2015 03:54 PM, Vitaly Davidovich wrote:
While it's true that the denser format will require fewer cachelines, my experience is that most strings are smaller than a single cacheline worth of storage, maybe two lines in some cases; there's just a ton of them in the heap. So the heap footprint should be substantially reduced, but I'm not sure the cache pollution will be significantly reduced.
This calculation assumes object allocations are granular to the cache lines. They are not: if String takes less space within the cache line, it allows *more* object data to be squeezed there. In other words, with compact Strings, the entire dataset can take less cache lines, thus improving performance.
There's currently no vectorization of char[] scanning (or any vectorization other than memcpy for that matter) - are you referring to the recent Intel contributions here or there's a plan to further improve vectorization in time for this JEP? Just curious.
String methods are intensely intrinsified (and vectorized in those implementations). String::equals, String::compareTo, and some encoding/decoding come to mind. I really, really invite you to read the collateral materials from the JEP, where we explored quite a few performance characteristics already. Thanks, -Aleksey.
My calculation doesn't assume cacheline granularity; I'm looking at strictly the strings. What's allocated next to/around them is completely arbitrary, circumstantial, uncontrollable to a large extent, and often not repeatable. If you're claiming that some second or even third order locality effects will be measurable, I don't know how :). I'm sure there will be some as theoretically it's possible, but it'll be hard to demonstrate that on anything other than specially crafted microbenchmarks. Ok, you're talking about some string intrinsics and not general char[] being vectorized - fair enough. sent from my phone On Jun 1, 2015 9:31 AM, "Aleksey Shipilev" <aleksey.shipilev@oracle.com> wrote:
On 06/01/2015 03:54 PM, Vitaly Davidovich wrote:
While it's true that the denser format will require fewer cachelines, my experience is that most strings are smaller than a single cacheline worth of storage, maybe two lines in some cases; there's just a ton of them in the heap. So the heap footprint should be substantially reduced, but I'm not sure the cache pollution will be significantly reduced.
This calculation assumes object allocations are granular to the cache lines. They are not: if String takes less space within the cache line, it allows *more* object data to be squeezed there. In other words, with compact Strings, the entire dataset can take less cache lines, thus improving performance.
There's currently no vectorization of char[] scanning (or any vectorization other than memcpy for that matter) - are you referring to the recent Intel contributions here or there's a plan to further improve vectorization in time for this JEP? Just curious.
String methods are intensely intrinsified (and vectorized in those implementations). String::equals, String::compareTo, and some encoding/decoding come to mind.
I really, really invite you to read the collateral materials from the JEP, where we explored quite a few performance characteristics already.
Thanks, -Aleksey.
participants (5)
-
Aleksey Shipilev
-
Jeremy Manson
-
mark.reinhold@oracle.com
-
Rezaei, Mohammad A.
-
Vitaly Davidovich