question regarding the java.lang.String design

Fri Jan 30 23:30:21 PST 2009

I wouldn't add a new field but (in the interests of keeping bloat  
under control) would look for ways to encode corner cases in existing  
fields. E.g. if the length field is zero then disregard offset and go  
straight to the value array.  Or, with a type test on the value array  
allow it to be UTF8 byte[] array.  Martin's idea of representation  
swapping could be gates by a type test on the value field alone.

We've also discussed GC support in the VM group over the years. My  
favorite idea is Unsafe.chopArray, which would be used in  
StringBuffer.toString to tease apart the used and unused parts of the  
buffer in O(1) instructions, leaving the unused part for reuse as a  
slightly shorter buffer. Lots of ideas floating around... As Christian  
says the hard part is prototyping, and then characterizing the result  
requires insightful choice of benchmarks. .

-- John  (on my iPhone)

On Jan 30, 2009, at 10:20 PM, Xiaobin Lu <Xiaobin.Lu at Sun.COM> wrote:

> Hi David,
>
> I was ignoring the fact that substring could use the offset & count  
> for sharing purpose. I am thinking whether we should have a flag  
> like "isCharArrayShared" which will be set to true only for those  
> string returned from substring call. That way, for many other  
> methods in String, we could ignore loading offset & count fields  
> which are mostly set to 0 and val.length anyways (val is the  
> character array).
>
> Regards,
> -Xiaobin
>
> David Holmes - Sun Microsystems wrote:
>> Hi Xiaobin,
>>
>> As you've probably gleaned by now the count and offset fields are  
>> to allow sharing of the underlying char[] - which is a safe thing  
>> to do exactly because a string is immutable. I've often thought  
>> this particular optimization was under-utilized.
>>
>> As others have said optimization of strings has been a recurring  
>> theme for many years now - there was even a paper on it at last  
>> year's ACM OOPSLA conference. IBM Research's Tokyo labs do a lot in  
>> this area - see for example "RT0750 A Quantitative Analysis of  
>> Space Waste from Java Strings and its Elimination at GC Time".
>>
>> I've occasionally thought that with all the duplicate strings that  
>> readily occur in Java it might be an option to have a few large  
>> tables of "text" containing all the characters, and then to define  
>> a String as one or more pairs of indices into these tables. But  
>> that's as far as I've thought about it :)
>>
>> Cheers,
>> David Holmes
>>
>>
>> Xiaobin Lu said the following on 01/31/09 04:42:
>>> Resend the email to hotspot-dev at openjdk.java.net.
>>> -Xiaobin
>>>
>>> Xiaobin Lu wrote:
>>>> Hi folks,
>>>>
>>>> While I am looking at the java.lang.String implementation, I  
>>>> noticed that it has "offset" and "count" field in  
>>>> java.lang.String. For the offset field, I only found two places  
>>>> which set that field, but I believe they can be got rid of too.  
>>>> The two places are String(StringBuffer buffer) &  
>>>> String(StringBuilder builder).
>>>>
>>>> My question is that if String is immutable, why do we need to  
>>>> carry these two fields? String could be more compacted without  
>>>> these two fields. The equals to method can be more efficiently  
>>>> implemented as just calling java.util.Array.equals(v1, v2) which  
>>>> is intrinsified on x86 at least.
>>>>
>>>> Another crazy thought is that we can compact the character array  
>>>> to a byte array if we don't have any characters other than ASCII  
>>>> (which we might use a boolean flag to indicate that).
>>>>
>>>> I'd appreciate your insight on this.
>>>>
>>>> -Xiaobin
>>>>
>>>>
>>>>
>>>
>