String concatenation tweaks

Wed Mar 11 21:01:26 UTC 2015

OpenJDK's implementation of String concatenation compiles

   "foo" + bar + "quux" + baz

into essentially the same bytecode as

  new StringBuilder()
    .append("foo")
    .append(bar)
    .append("quux")
    .append(baz)
    .toString()

We've been successfully experimenting at Google with presizing the
StringBuilder to avoid the need for rebuffering, with extensive
consultation with martinrb@ and cushon at .  I have not yet ported the patch
to head, but wanted to bounce the idea off this list before doing so.  Some
key points of interest:

   - It suffices to provide an upper bound on the size, if that's not too
   much bigger than the real length.  For example, for primitives, we use the
   bound of the maximum length of the toString of that primitive type: for
   example, a boolean is treated as having length bounded at 5.
   - Nonconstant Objects, including CharSequences, have their toString
   stored in a local.  For example, "foo" + myStringBuilder would be compiled
   to approximately

   String myStringBuilderToString = myStringBuilder.toString();
   return new StringBuilder(3 + myStringBuilderToString.length())
     .append("foo")
     .append(myStringBuilderToString)
     .toString();

   This is necessary to deal with the possibility of mutation
   midexpression.  (Nonconstant primitives are also stored in a local to
   preserve evaluation order and avoid mutation, but not converted to
   Strings.  There might be some room for optimization here for primitive
   values coming from final fields or locals.)
   - Some mostly-redundant null checking is necessary to deal with the evil
   edge case where toString() returns null.
   - Taking all the above into account, our benchmarks showed 15% CPU
   improvements and 25% fewer bytes allocated relative to the status quo,
   independent of -XX:+OptimizeStringConcat.
   - While we were at it, in the case of two arguments that are statically
   known to be Strings, our benchmarks show String.concat to be firmly more
   efficient than the StringBuilder, even in the presence of flags like
   -XX:+OptimizeStringConcat.  This is arguably a separate optimization, but
   nonetheless effective; our benchmarks at the time suggested 40% CPU
   improvements and 60% fewer bytes allocated relative to the status quo.

So for example, "foo" + myInt + myString + "bar" + myObj would be compiled
to the equivalent of

int myIntTmp = myInt;
String myStringTmp = String.valueOf(myString); // defend against null
String myObjTmp = String.valueOf(String.valueOf(myObj)); // defend against
evil toString implementations returning null

return new StringBuilder(
     17 // length of "foo" (3) + max length of myInt (11) + length of "bar"
(3)
     + myStringTmp.length()
     + myObjTmp.length())
   .append("foo")
   .append(myIntTmp)
   .append(myStringTmp)
   .append("bar")
   .append(myObjTmp)
   .toString();

As far as language constraints go, the JLS is (apparently deliberately)
vague about how string concatenation is implemented.  "An implementation
may choose to perform conversion and concatenation in one step to avoid
creating and then discarding an intermediate String object. To increase the
performance of repeated string concatenation, a Java compiler may use the
StringBuffer class or a similar technique to reduce the number of
intermediate String objects that are created by evaluation of an
expression."  We see no reason this approach would not qualify as a
"similar technique."

If these suggestions (and performance numbers) are of interest, I can port
our patch for upstream use.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/compiler-dev/attachments/20150311/b46e3a2c/attachment.html>