FYC : JDK-7197183 : Improve copying behaviour of String.subSequence()

Tue Feb 19 21:35:51 UTC 2013

I think it's really easy for users to write a SubSequence by
themselves if they need to solve the problem; there is no need for
String to do it, and definitely not at the expense of breaching
previous explicit contracts. We should solve this problem externally,
explicitly; not internally with some surprising plumbings.

The patch may not be very helpful anyway. It is unlikely that an
application can simply switch from using sub-String to using
sub-CharSequence, since CharSequence is not nearly as convenient as
String - CharSequence has no contract on hashCode/equals, and no
operations like indexOf() etc. CharSequence is still not widely used,
not even in the standard lib - among java.** classes, besides
subclasses of CharSequence/Appendable, there are only 4 classes with
public methods accepting CharSequence args:  Normalizer, Matcher,
Pattern, CharsetEncoder.

A new top level public class that mimics String is probably going to
be helpful; this class may not be worthy of being included in
java.lang; it could be provided by 3rd parties. The class should

1. be able to wrap any CharSequence and char[] source
2. reference the source without copying, till explicitly asked to
3. though not immutable, remain constant if source remains constant
4. have well defined hashCode/equals
5. have convenience operations like indexOf()
6. include popular methods from various StringUtils
7. have well defined space/time cost on all methods

Let me call it "Strand". If anyone got a problem with the change of
String impl, we can refer him to wrap the string in a Strand, and
operate on the strand instead.

Zhong Yu

On Mon, Feb 18, 2013 at 11:27 PM, Mike Duigou <mike.duigou at oracle.com> wrote:
> Hello all;
>
> JDK 7u6 included a significant change to java.lang.String. The change was internal to the String implementation and didn't result in any API changes but it does have a significant impact on the performance of some uses cases. (See http://mail.openjdk.java.net/pipermail/core-libs-dev/2012-June/010509.html for an earlier discussion)
>
> Prior to 7u6 String maintained two fields "count" and "offset" which described the location within the character array of the String's characters. In 7u6 the count and offset fields were removed and String instances no longer share their character arrays with other String instances. Each substring(), subSequence() or split() now creates entirely new String instances for it's results. Before 7u6 two or more instances of String could share the same backing character array. A number of String operations; clone(), split(), subString(), subSequence() did not copy the backing character array but created new String instances pointing at their character array with appropriate "count" and "offset" values.
>
> As with most sharing techniques, there are tradeoffs. The "count/offset" approach works reasonably well for cases where the substrings have a shorter lifetime than the original. Frequently it was found though that the large character arrays "leaked" through small Strings derived from larger String instances. Examples would be tokens parsed with substring() from HTTP headers being used as keys in Maps. This caused the entire header character array from the original header String to not be garbage collected (or worse the entire set of headers for a request).
>
> Our benchmarking and application testing prior to changing the String implementation in 7u6 showed that it was a net benefit to not share the character array instances. The benchmarking and performance testing for this change actually began in 2007 and was very extensive. Benchmarking and performance analysis since the release of 7u6 continues to indicate that removal of sharing is the better approach. It is extremely unlikely that we would consider returning to the pre-7u6 implementation (in case you were thinking of suggesting that).
>
> There are some cases where the new approach can result in significant performance penalties. This is a "For Your Consideration" review not a pre-push changeset. The review changeset is a weakening of the "never share the String character array" rule and it means that it would suffer from exactly the same weakness. Few applications currently use subSequence() most currently use subString(). Applications which would benefit from this change would have to switch to using subSequence(). Apps can safely switch to subSequence in anticipation of this change because currently subSequence() is identical to substring(). This means that should this changeset not be integrated app code would suffer no penalty and if this change is eventually integrated then app performance will improve.
>
> http://cr.openjdk.java.net/~mduigou/JDK-7197183/0/
>
> From our current testing we found that applications currently using subSequence() failed if the equals(), hashCode() and toString() implementations did not exactly match String. Additionally we had to change String.equals() so that it recognizes can return "true" for matching instances of String.SubSequence.
>
> You will see some unfortunate potential usage patterns in the presented implementation--most specifically, calling toString() on the result of a String.subSequence() results in a new String instance being created (ouch!). I would like to eliminate the caching of the hashCode() result but it appears that it is frequently used and failing to cache the hash code results in greatly decreased performance for the relevant applications. Currently TreeSet and TreeMap which use natural order fail for data sets of mixed String and String.SubSequence. I believe it is necessary for natural order sorting to work for mixed collections of String and String.SubSequence instances.
>
> Would this proposal cause your applications any problems? Is this proposal absolutely necessary for your application to have adequate performance? Have you already made other accommodations for the altered performance behaviour of Strings introduced in 7u6? Other thoughts?
>
> Mike