RFR: String Density/Compact String JEP 254

Fri Oct 2 21:02:18 UTC 2015

Hi,

Please review the change for JEP 254/Compact String project.

JPE 254: http://openjdk.java.net/jeps/254
Issue:   https://bugs.openjdk.java.net/browse/JDK-8054307
Webrevs: http://cr.openjdk.java.net/~sherman/8054307/jdk/
          http://cr.openjdk.java.net/~thartmann/compact_strings/webrev/hotspot

Description:

   String Density project is to change the internal representation of the
   String class from a UTF-16 char array to a byte array plus an encoding
   flag field. The new String class stores characters encoded either as
   ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes
   per character), based upon the contents of the string. The encoding
   flag indicates which encoding is used. It offers reduced memory footprint
   while maintaining throughput performance. See JEP 254 for more additional
   information

Implementation repo to try out:
   http://hg.openjdk.java.net/jdk9/sandbox/  branch: JDK-8054307-branch

   $ hg clone http://hg.openjdk.java.net/jdk9/sandbox/
   $ cd sandbox
   $ sh ./get_source.sh
   $ sh ./common/bin/hgforest.sh up -r JDK-8054307-branch
   $ make configure
   $ make images

Implementation Notes:

  - To change the internal representation of the String and the String
    builder classes (AbstractStringBuilder, StringBuilder and StringBuffer)
    from a UTF-16 char array to a byte array plus an encoding flag field.

    The new representation stores the String characters in a single byte
    format using the lower 8-bit of character's 16-bit UTF16 value, and
    sets the encoding flag as LATIN1, if all characters of the String
    object are Unicode Latin1 characters (with its UTF16 value < \u0100)

    It stores the String characters in 2-byte format with their UTF-16 value
    and sets the flag as UTF16, if any of the character inside the String
    object is NOT Unicode latin1 character.

  - To change the method implementation of the String class and its builders
    to function on the new internal character storage, mainly to delegate to
    two implementation classes StringUTF16 and StringLatin1

  - To update the StringCoding class to decoding/encoding the String between
    String.byte[]/coder(LATIN1/UTF16) <-> byte[](native encoding) instead
    of the original String.char[] <-> byte[] (native encoding)

  - To update the hotSpot compiler (new and updated instrinsics), GC (String
    Deduplication mods) and Runtime to work with the new internal "byte[] +
    coder flag" representation.

    See Tobias's note for details of the hotspot changes:
    http://cr.openjdk.java.net/~thartmann/compact_strings/hotspot-impl-note

  - To add a vm option "CompactStrings" (default is true) to provide a
    switch-off mechanism to always store the String characters in UTF16
    encoding (always 2 bytes, but still in a byte[], instead of the
    original char[]).

Supporting performance artifacts:

  - Report(s) on memory footprint impact

    http://cr.openjdk.java.net/~shade/density/string-density-report.pdf

    Latest SPECjbb2005 footprint reduction and throughput numbers for both
    Intel (Linux) and SPARC, in which it shows the Compact String binaries
    use less memory and have higher throughput.

    latest:http://cr.openjdk.java.net/~sherman/8054307/specjbb2005
    old: http://cr.openjdk.java.net/~huntch/string-density/reports/String-Density-SPARC-jbb2005-Report.pdf

  - Throughput performance impact via String API micro-benchmarks

    http://cr.openjdk.java.net/~thartmann/compact_strings/microbenchmarks/Haswell_090915.pdf
    http://cr.openjdk.java.net/~thartmann/compact_strings/microbenchmarks/IvyBridge_090915.pdf
    http://cr.openjdk.java.net/~thartmann/compact_strings/microbenchmarks/Sparc_090915.pdf
    http://cr.openjdk.java.net/~sherman/8054307/string-coding.txt

Thanks,
Sherman