FYC : JDK-7197183 : Improve copying behaviour of String.subSequence()
Remi Forax
forax at univ-mlv.fr
Tue Feb 19 14:11:06 UTC 2013
Hi Mike,
I agree with Peter,
a new method is better than changing subSequence to return a
CharSequence implementation that try to fake itself as a String.
And as a guys that have written several parsers (for HTTP servers or not),
ByteBuffer/CharBuffer was introduced in 1.4 to write effective parsers
in Java,
no one should try write a Java parser using String buffers anymore.
Rémi
On 02/19/2013 10:28 AM, Peter Levart wrote:
> Hi Mike,
>
> Regarding the implementation details: I think it would be better to
> reference just the String's value[] array in the String.SubSequence
> instead of the String instance itself. Two reasons come to mind:
>
> - eliminate another indirection when accessing the array
> - when referencing the String instance, code like this would leak
> implementation details (again):
>
> String s = new String("...");
> CharSequence cs = s.subSequence(0, s.length()-1);
> WeakReference<String> wr = new WeakReference<>(s);
> // ... if we keep a reference to cs, wr will never be cleared...
>
> Regarding the strategy: It's very unfortunate that code exists that
> uses String.subSequence result in a way treating it as an object with
> value semantics for equals() and hashCode() despite the specification
> for CharSequence clearly stating:
>
> /"This interface does not refine the general contracts of the
> //|equals|
> <http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html#equals%28java.lang.Object%29>//and
> //|hashCode|
> <http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html#hashCode%28%29>//methods.
> The result of comparing two objects that implement //CharSequence//is
> therefore, in general, undefined. Each object may be implemented by a
> different class, and there is no guarantee that each class will be
> capable of testing its instances for equality with those of the other.
> It is therefore inappropriate to use arbitrary
> //CharSequence//instances as elements in a set or as keys in a map."/
>
> The excuse for such usage is the specification of String.subSequence
> method:
>
> /"Returns a new character sequence that is a subsequence of this
> sequence. /
>
> /An invocation of this method of the form /
>
> / str.subSequence(begin, end)/
>
> /behaves in exactly the same way as the invocation /
>
> / str.substring(begin, end)/
>
> /This method is defined so that the //String//class can implement the
> //|CharSequence|
> <http://docs.oracle.com/javase/6/docs/api/java/lang/CharSequence.html>//interface.
> "/
>
>
> So this proposal actually breaks this specification. It tries to break
> it in a way that would keep some of the usages still behave like
> before, but it can't entirely succeed. The following code is valid
> now, but will throw CCE with this proposal:
>
> String s = ...;
> String s0 = (String) s.subSequence(0, 1);
>
> One can imagine other scenarios that would break (like custom
> comparators casting to String, etc.).
>
> So to be entirely compatible, an escape-hatch would have to be
> available (in the form of a System property for example) to restore
> the old behaviour if requested.
>
> If there is an escape-hatch, the question arises: Why not breaking the
> String.subSequence entirely? We could brake it so that the method
> would clearly specify the returned CharSequence behaviour regarding
> underlying value[] array sharing, hashCode(), equals(), subSequence()
> and Comparable only in terms of CharSequence instances returned from
> the method and not cross String <-> String.SubSequence.
>
> This would encourage migration of behaviourally incompatible code to
> forms that are compatible with both pre JDK8 and post JDK8
> String.subSequence specification instead of keeping the status quo.
>
> There's also a third option - adding new method (back-porting it to
> JDK7):
>
> public String.SubSequence stringSubSequence(int, int);
>
> ...and keeping subSequence as is.
>
> Regards, Peter
>
> On 02/19/2013 06:27 AM, Mike Duigou wrote:
>> Hello all;
>>
>> JDK 7u6 included a significant change to java.lang.String. The change
>> was internal to the String implementation and didn't result in any
>> API changes but it does have a significant impact on the performance
>> of some uses cases. (See
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2012-June/010509.html
>> for an earlier discussion)
>>
>> Prior to 7u6 String maintained two fields "count" and "offset" which
>> described the location within the character array of the String's
>> characters. In 7u6 the count and offset fields were removed and
>> String instances no longer share their character arrays with other
>> String instances. Each substring(), subSequence() or split() now
>> creates entirely new String instances for it's results. Before 7u6
>> two or more instances of String could share the same backing
>> character array. A number of String operations; clone(), split(),
>> subString(), subSequence() did not copy the backing character array
>> but created new String instances pointing at their character array
>> with appropriate "count" and "offset" values.
>>
>> As with most sharing techniques, there are tradeoffs. The
>> "count/offset" approach works reasonably well for cases where the
>> substrings have a shorter lifetime than the original. Frequently it
>> was found though that the large character arrays "leaked" through
>> small Strings derived from larger String instances. Examples would be
>> tokens parsed with substring() from HTTP headers being used as keys
>> in Maps. This caused the entire header character array from the
>> original header String to not be garbage collected (or worse the
>> entire set of headers for a request).
>>
>> Our benchmarking and application testing prior to changing the String
>> implementation in 7u6 showed that it was a net benefit to not share
>> the character array instances. The benchmarking and performance
>> testing for this change actually began in 2007 and was very
>> extensive. Benchmarking and performance analysis since the release of
>> 7u6 continues to indicate that removal of sharing is the better
>> approach. It is extremely unlikely that we would consider returning
>> to the pre-7u6 implementation (in case you were thinking of
>> suggesting that).
>>
>> There are some cases where the new approach can result in significant
>> performance penalties. This is a "For Your Consideration" review not
>> a pre-push changeset. The review changeset is a weakening of the
>> "never share the String character array" rule and it means that it
>> would suffer from exactly the same weakness. Few applications
>> currently use subSequence() most currently use subString().
>> Applications which would benefit from this change would have to
>> switch to using subSequence(). Apps can safely switch to subSequence
>> in anticipation of this change because currently subSequence() is
>> identical to substring(). This means that should this changeset not
>> be integrated app code would suffer no penalty and if this change is
>> eventually integrated then app performance will improve.
>>
>> http://cr.openjdk.java.net/~mduigou/JDK-7197183/0/
>>
>> From our current testing we found that applications currently using
>> subSequence() failed if the equals(), hashCode() and toString()
>> implementations did not exactly match String. Additionally we had to
>> change String.equals() so that it recognizes can return "true" for
>> matching instances of String.SubSequence.
>>
>> You will see some unfortunate potential usage patterns in the
>> presented implementation--most specifically, calling toString() on
>> the result of a String.subSequence() results in a new String instance
>> being created (ouch!). I would like to eliminate the caching of the
>> hashCode() result but it appears that it is frequently used and
>> failing to cache the hash code results in greatly decreased
>> performance for the relevant applications. Currently TreeSet and
>> TreeMap which use natural order fail for data sets of mixed String
>> and String.SubSequence. I believe it is necessary for natural order
>> sorting to work for mixed collections of String and
>> String.SubSequence instances.
>>
>> Would this proposal cause your applications any problems? Is this
>> proposal absolutely necessary for your application to have adequate
>> performance? Have you already made other accommodations for the
>> altered performance behaviour of Strings introduced in 7u6? Other
>> thoughts?
>>
>> Mike
>
More information about the core-libs-dev
mailing list