FYC : JDK-7197183 : Improve copying behaviour of String.subSequence()

Tue Feb 19 14:11:06 UTC 2013

Hi Mike,
I agree with Peter,
a new method is better than changing subSequence to return a 
CharSequence implementation that try to fake itself as a String.

And as a guys that have written several parsers (for HTTP servers or not),
ByteBuffer/CharBuffer was introduced in 1.4 to write effective parsers 
in Java,
no one should try write a Java parser using String buffers anymore.

Rémi

On 02/19/2013 10:28 AM, Peter Levart wrote:
> Hi Mike,
>
> Regarding the implementation details: I think it would be better to 
> reference just the String's value[] array in the String.SubSequence 
> instead of the String instance itself. Two reasons come to mind:
>
> - eliminate another indirection when accessing the array
> - when referencing the String instance, code like this would leak 
> implementation details (again):
>
> String s = new String("...");
> CharSequence cs = s.subSequence(0, s.length()-1);
> WeakReference<String> wr = new WeakReference<>(s);
> // ... if we keep a reference to cs, wr will never be cleared...
>
> Regarding the strategy: It's very unfortunate that code exists that 
> uses String.subSequence result in a way treating it as an object with 
> value semantics for equals() and hashCode() despite the specification 
> for CharSequence clearly stating:
>
> /"This interface does not refine the general contracts of the 
> //|equals| 
> <http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html#equals%28java.lang.Object%29>//and 
> //|hashCode| 
> <http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html#hashCode%28%29>//methods. 
> The result of comparing two objects that implement //CharSequence//is 
> therefore, in general, undefined. Each object may be implemented by a 
> different class, and there is no guarantee that each class will be 
> capable of testing its instances for equality with those of the other. 
> It is therefore inappropriate to use arbitrary 
> //CharSequence//instances as elements in a set or as keys in a map."/
>
> The excuse for such usage is the specification of String.subSequence 
> method:
>
> /"Returns a new character sequence that is a subsequence of this 
> sequence. /
>
> /An invocation of this method of the form /
>
>    /  str.subSequence(begin, end)/
>
> /behaves in exactly the same way as the invocation /
>
>    /  str.substring(begin, end)/
>
> /This method is defined so that the //String//class can implement the 
> //|CharSequence| 
> <http://docs.oracle.com/javase/6/docs/api/java/lang/CharSequence.html>//interface. 
> "/
>
>
> So this proposal actually breaks this specification. It tries to break 
> it in a way that would keep some of the usages still behave like 
> before, but it can't entirely succeed. The following code is valid 
> now, but will throw CCE with this proposal:
>
> String s = ...;
> String s0 = (String) s.subSequence(0, 1);
>
> One can imagine other scenarios that would break (like custom 
> comparators casting to String, etc.).
>
> So to be entirely compatible, an escape-hatch would have to be 
> available (in the form of a System property for example) to restore 
> the old behaviour if requested.
>
> If there is an escape-hatch, the question arises: Why not breaking the 
> String.subSequence entirely? We could brake it so that the method 
> would clearly specify the returned CharSequence behaviour regarding 
> underlying value[] array sharing, hashCode(), equals(), subSequence() 
> and Comparable only in terms of CharSequence instances returned from 
> the method and not cross String <-> String.SubSequence.
>
> This would encourage migration of behaviourally incompatible code to 
> forms that are compatible with both pre JDK8 and post JDK8 
> String.subSequence specification instead of keeping the status quo.
>
> There's also a third option - adding new method (back-porting it to 
> JDK7):
>
> public String.SubSequence stringSubSequence(int, int);
>
> ...and keeping subSequence as is.
>
> Regards, Peter
>
> On 02/19/2013 06:27 AM, Mike Duigou wrote:
>> Hello all;
>>
>> JDK 7u6 included a significant change to java.lang.String. The change 
>> was internal to the String implementation and didn't result in any 
>> API changes but it does have a significant impact on the performance 
>> of some uses cases. (See 
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2012-June/010509.html 
>> for an earlier discussion)
>>
>> Prior to 7u6 String maintained two fields "count" and "offset" which 
>> described the location within the character array of the String's 
>> characters. In 7u6 the count and offset fields were removed and 
>> String instances no longer share their character arrays with other 
>> String instances. Each substring(), subSequence() or split() now 
>> creates entirely new String instances for it's results. Before 7u6 
>> two or more instances of String could share the same backing 
>> character array. A number of String operations; clone(), split(), 
>> subString(), subSequence() did not copy the backing character array 
>> but created new String instances pointing at their character array 
>> with appropriate "count" and "offset" values.
>>
>> As with most sharing techniques, there are tradeoffs. The 
>> "count/offset" approach works reasonably well for cases where the 
>> substrings have a shorter lifetime than the original. Frequently it 
>> was found though that the large character arrays "leaked" through 
>> small Strings derived from larger String instances. Examples would be 
>> tokens parsed with substring() from HTTP headers being used as keys 
>> in Maps. This caused the entire header character array from the 
>> original header String to not be garbage collected (or worse the 
>> entire set of headers for a request).
>>
>> Our benchmarking and application testing prior to changing the String 
>> implementation in 7u6 showed that it was a net benefit to not share 
>> the character array instances. The benchmarking and performance 
>> testing for this change actually began in 2007 and was very 
>> extensive. Benchmarking and performance analysis since the release of 
>> 7u6 continues to indicate that removal of sharing is the better 
>> approach. It is extremely unlikely that we would consider returning 
>> to the pre-7u6 implementation (in case you were thinking of 
>> suggesting that).
>>
>> There are some cases where the new approach can result in significant 
>> performance penalties. This is a "For Your Consideration" review not 
>> a pre-push changeset. The review changeset is a weakening of the 
>> "never share the String character array" rule and it means that it 
>> would suffer from exactly the same weakness. Few applications 
>> currently use subSequence() most currently use subString(). 
>> Applications which would benefit from this change would have to 
>> switch to using subSequence(). Apps can safely switch to subSequence 
>> in anticipation of this change because currently subSequence() is 
>> identical to substring(). This means that should this changeset not 
>> be integrated app code would suffer no penalty and if this change is 
>> eventually integrated then app performance will improve.
>>
>> http://cr.openjdk.java.net/~mduigou/JDK-7197183/0/
>>
>>  From our current testing we found that applications currently using 
>> subSequence() failed if the equals(), hashCode() and toString() 
>> implementations did not exactly match String. Additionally we had to 
>> change String.equals() so that it recognizes can return "true" for 
>> matching instances of String.SubSequence.
>>
>> You will see some unfortunate potential usage patterns in the 
>> presented implementation--most specifically, calling toString() on 
>> the result of a String.subSequence() results in a new String instance 
>> being created (ouch!). I would like to eliminate the caching of the 
>> hashCode() result but it appears that it is frequently used and 
>> failing to cache the hash code results in greatly decreased 
>> performance for the relevant applications. Currently TreeSet and 
>> TreeMap which use natural order fail for data sets of mixed String 
>> and String.SubSequence. I believe it is necessary for natural order 
>> sorting to work for mixed collections of String and 
>> String.SubSequence instances.
>>
>> Would this proposal cause your applications any problems? Is this 
>> proposal absolutely necessary for your application to have adequate 
>> performance? Have you already made other accommodations for the 
>> altered performance behaviour of Strings introduced in 7u6? Other 
>> thoughts?
>>
>> Mike
>