question regarding the java.lang.String design

Mon Feb 2 07:34:25 PST 2009


David Holmes - Sun Microsystems wrote:
> Hi Xiaobin,
>
> As you've probably gleaned by now the count and offset fields are to 
> allow sharing of the underlying char[] - which is a safe thing to do 
> exactly because a string is immutable. I've often thought this 
> particular optimization was under-utilized.
>
> As others have said optimization of strings has been a recurring theme 
> for many years now - there was even a paper on it at last year's ACM 
> OOPSLA conference. IBM Research's Tokyo labs do a lot in this area - 
> see for example "RT0750 A Quantitative Analysis of Space Waste from 
> Java Strings and its Elimination at GC Time".
FWIW, one of the optimizations they present in their paper is actually 
unsafe. If two strings (str1 and str2, say) are the same, a young 
collection might get rid of one (say str2) and replace it with the 
second one (all references that pointed to str2 now point to str1). So, 
two objects that have two distinct IDs (i.e., str1 == str2 would return 
false) might suddenly become the same object (i.e., str1 == str2 would 
now return true). I don't think this is allowed by the Java spec. Said 
string should have instead been modified to share the same char array, 
instead of having two different ones.

Tony
> I've occasionally thought that with all the duplicate strings that 
> readily occur in Java it might be an option to have a few large tables 
> of "text" containing all the characters, and then to define a String 
> as one or more pairs of indices into these tables. But that's as far 
> as I've thought about it :)
>
> Cheers,
> David Holmes
>
>
> Xiaobin Lu said the following on 01/31/09 04:42:
>> Resend the email to hotspot-dev at openjdk.java.net.
>> -Xiaobin
>>
>> Xiaobin Lu wrote:
>>> Hi folks,
>>>
>>> While I am looking at the java.lang.String implementation, I noticed 
>>> that it has "offset" and "count" field in java.lang.String. For the 
>>> offset field, I only found two places which set that field, but I 
>>> believe they can be got rid of too. The two places are 
>>> String(StringBuffer buffer) & String(StringBuilder builder).
>>>
>>> My question is that if String is immutable, why do we need to carry 
>>> these two fields? String could be more compacted without these two 
>>> fields. The equals to method can be more efficiently implemented as 
>>> just calling java.util.Array.equals(v1, v2) which is intrinsified on 
>>> x86 at least.
>>>
>>> Another crazy thought is that we can compact the character array to 
>>> a byte array if we don't have any characters other than ASCII (which 
>>> we might use a boolean flag to indicate that).
>>>
>>> I'd appreciate your insight on this.
>>>
>>> -Xiaobin
>>>
>>>
>>>
>>

-- 
---------------------------------------------------------------------
| Tony Printezis, Staff Engineer   | Sun Microsystems Inc.          |
|                                  | MS UBUR02-311                  |
| e-mail: tony.printezis at sun.com   | 35 Network Drive               |
| office: +1 781 442 0998 (x20998) | Burlington, MA 01803-2756, USA |
---------------------------------------------------------------------
e-mail client: Thunderbird (Linux)