RFR: String Density/Compact String JEP 254 (update)

Tue Nov 3 17:58:58 UTC 2015

Hi,

This is a significant body of impressive work, well done all who worked on it.

String
—

 148      * The instance field value is generally opaque to optimizing JIT
 149      * compilers. Therefore, in performance-sensitive place, an explicit
 150      * check of the static boolean {@code COMPACT_STRINGS} is done first
 151      * before checking the {@code coder} field since the static boolean
 152      * {@code COMPACT_STRINGS} would be constant folded away by an
 153      * optimizing JIT compiler. The idioms for these cases are as follows.
...
 172      * @implNote
 173      * The actual value for this field is injected by JVM. The static
 174      * initialization block is used to set the value here to communicate
 175      * that this static final field is not statically foldable, and to
 176      * avoid any possible circular dependency during vm initialization.
 177      */
 178     static final boolean COMPACT_STRINGS;

For those not so knowledgeable on matters the impl note may appear to contradict what is stated previously on constant folding.

You might want to clarify that you don’t want the field to be directly initialized to a constant expression since usages are replaced at compile time with the value of the constant expression.

I notice in some cases for comparisons of this string with another string, such as String.startsWith, when the coders are not equal and this coder is LANTIN1 and the other coder is UTF16 you have a short cut on the assumption that the contents will be different, but in other cases like compareTo or regionMatches this is not the case. Why the difference?

  String a = … // LATIN1 encoding
  String b = … // UTF16 encoding, sharing a common prefix with a, and some additional UTF16 chars afterwards

  a.startsWith(b.substring(0, a.length()); // false
  b.startsWith(a); // true

?

Ah, I see that you are compressing in package private constructors, so in effect normalizing to LANTIN1 where possible:

3008      * Package private constructor. Trailing Void argument is there for
3009      * disambiguating it against other (public) constructors.
3010      *
3011      * Stores the char[] value into a byte[] that each byte represents
3012      * the8 low-order bits of the corresponding character, if the char[]
3013      * contains only latin1 character. Or a byte[] that stores all
3014      * characters in their byte sequences defined by the {@code StringUTF16}.
3015      */
3016     String(char[] value, int off, int len, Void sig) {

Thus i think my above example will actually work as expected, since the substring will result in a new string using a LATIN1 coder.

But does my point still apply that the coder checking logic is inconsistently applied?

I cannot quite tell if the normalization is consistently applied or only for certain operations. Perhaps there needs to be some asserts judiciously placed that verify the string content to better catch cases where normalization is unintentionally not applied?

For new tests that use Random you should add the following

 * @key randomness

I stopped there for now.

Paul.

P.S. I hope at some point in the future we can revisit the HotSpot intrinsics for byte[]/char[] equality and comparison with future array mismatch work.

> On 30 Oct 2015, at 22:30, Xueming Shen <xueming.shen at oracle.com> wrote:
> 
> Hi,
> 
> Thanks for the comments/suggestions. Here are the updated webrevs with minor changes here
> and there based on the feedback.
> 
> http://cr.openjdk.java.net/~sherman/8054307/jdk/
> http://cr.openjdk.java.net/~thartmann/compact_strings/webrev/hotspot/
> 
> [closed, Oracle internal only]
> http://javaweb.us.oracle.com/~tohartma/compact_strings/hotspot/
> http://javaweb.us.oracle.com/~tohartma/compact_strings/hotspot_test_closed/
> 
> The code is ready for integration. The current plan is to integrate via the hotspot repo in coming
> week if it passes the PIT.
> 
> Thanks
> -Sherman
> 
> On 10/5/15 8:30 AM, Xueming Shen wrote:
>> (resent to hotspot-dev at openjdk.java.net)
>> 
>> Hi,
>> 
>> Please review the change for JEP 254/Compact String project.
>> 
>> JPE 254: http://openjdk.java.net/jeps/254
>> Issue:   https://bugs.openjdk.java.net/browse/JDK-8054307
>> Webrevs: http://cr.openjdk.java.net/~sherman/8054307/jdk/
>> http://cr.openjdk.java.net/~thartmann/compact_strings/webrev/hotspot
>> 
>> Description:
>> 
>>  String Density project is to change the internal representation of the
>>  String class from a UTF-16 char array to a byte array plus an encoding
>>  flag field. The new String class stores characters encoded either as
>>  ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes
>>  per character), based upon the contents of the string. The encoding
>>  flag indicates which encoding is used. It offers reduced memory footprint
>>  while maintaining throughput performance. See JEP 254 for more additional
>>  information
>> 
>> Implementation repo/try out:
>>  http://hg.openjdk.java.net/jdk9/sandbox/  branch: JDK-8054307-branch
>> 
>>  $ hg clone http://hg.openjdk.java.net/jdk9/sandbox/
>>  $ cd sandbox
>>  $ sh ./get_source.sh
>>  $ sh ./common/bin/hgforest.sh up -r JDK-8054307-branch
>>  $ make configure
>>  $ make images
>> 
>> Implementation Notes:
>> 
>> - To change the internal representation of the String and the String
>>   builder classes (AbstractStringBuilder, StringBuilder and StringBuffer)
>>   from a UTF-16 char array to a byte array plus an encoding flag field.
>> 
>>   The new representation stores the String characters in a single byte
>>   format using the lower 8-bit of character's 16-bit UTF16 value, and
>>   sets the encoding flag as LATIN1, if all characters of the String
>>   object are Unicode Latin1 characters (with its UTF16 value < \u0100)
>> 
>>   It stores the String characters in 2-byte format with their UTF-16 value
>>   and sets the flag as UTF16, if any of the character inside the String
>>   object is NOT Unicode latin1 character.
>> 
>> - To change the method implementation of the String class and its builders
>>   to function on the new internal character storage, mainly to delegate to
>>   two implementation classes StringUTF16 and StringLatin1
>> 
>> - To update the StringCoding class to decoding/encoding the String between
>>   String.byte[]/coder(LATIN1/UTF16) <-> byte[](native encoding) instead
>>   of the original String.char[] <-> byte[] (native encoding)
>> 
>> - To update the hotSpot compiler (new and updated instrinsics), GC (String
>>   Deduplication mods) and Runtime to work with the new internal "byte[] +
>>   coder flag" representation.
>> 
>>   See Tobias's note for details of the hotspot changes:
>> http://cr.openjdk.java.net/~thartmann/compact_strings/hotspot-impl-note
>> 
>> - To add a vm option "CompactStrings" (default is true) to provide a
>>   switch-off mechanism to always store the String characters in UTF16
>>   encoding (always 2 bytes, but still in a byte[], instead of the
>>   original char[]).
>> 
>> 
>> Supporting performance artifacts:
>> 
>> - Report(s) on memory footprint impact
>> 
>> http://cr.openjdk.java.net/~shade/density/string-density-report.pdf
>> 
>>   Latest SPECjbb2005 footprint reduction and throughput numbers for both
>>   Intel (Linux) and SPARC, in which it shows the Compact String binaries
>>   use less memory and have higher throughput.
>> 
>>   latest:http://cr.openjdk.java.net/~sherman/8054307/specjbb2005
>>   old: http://cr.openjdk.java.net/~huntch/string-density/reports/String-Density-SPARC-jbb2005-Report.pdf
>> 
>> - Throughput performance impact via String API micro-benchmarks
>> 
>> http://cr.openjdk.java.net/~thartmann/compact_strings/microbenchmarks/Haswell_090915.pdf
>> http://cr.openjdk.java.net/~thartmann/compact_strings/microbenchmarks/IvyBridge_090915.pdf
>> http://cr.openjdk.java.net/~thartmann/compact_strings/microbenchmarks/Sparc_090915.pdf
>>   http://cr.openjdk.java.net/~sherman/8054307/string-coding.txt
>> 
>> Thanks,
>> Sherman
>