StringBuilder version of java.util.regex.Matcher.append*

Jeremy Manson jeremymanson at google.com
Tue Mar 25 21:07:59 UTC 2014


Okay.  Thanks, Sherman.  Here's an updated version.

I've diverged a bit from Peter's version.  In this version,
appendExpandedReplacement takes a StringBuilder.  The implications is that
In the StringBuilder case, it saves creating a new StringBuilder object.
 In the StringBuffer case, it creates a new StringBuilder, but it doesn't
have to acquire and release all of those locks.

I also noticed a redundant cast to (int), which I removed.

Jeremy


On Fri, Mar 21, 2014 at 7:34 PM, Xueming Shen <xueming.shen at oracle.com>wrote:

>  let's add the StringBuilder method(s), if you can provide an updated
> version, I can run the rest (since it's
> to add new api, there is an internal CCC process need to go through).
>
> -Sherman
>
>
> On 3/21/14 5:18 PM, Jeremy Manson wrote:
>
> So, this is all a little opaque to me.  How do we make the go/no-go
> decision on something like this?  Everyone who has chimed in seems to think
> it is a good idea.
>
>  Jeremy
>
>
> On Thu, Mar 20, 2014 at 10:38 AM, Jeremy Manson <jeremymanson at google.com>wrote:
>
>> Sherman,
>>
>>  If you had released it then (which you wouldn't have been able to do,
>> because you would have to wait another two years for Java 7), you would
>> have found that it improved performance even with C2.  It is only
>> post-escape-analysis that the performance in C2 equalized.
>>
>>  Anyway, I think adding the StringBuilder variant and deferring /
>> dealing with the Appendable differently is the right approach, FWIW.
>>
>>  Jeremy
>>
>>
>> On Thu, Mar 20, 2014 at 10:25 AM, Xueming Shen <xueming.shen at oracle.com>wrote:
>>
>>> 2009? I do have something similar back to 2009 :-)
>>>
>>> http://cr.openjdk.java.net/~sherman/regex_replace/webrev/
>>>
>>> Then the ball was dropped around the discussion of whether or not
>>> the IOE should be thrown.
>>>
>>> But if we are going to/have to have explicit StringBuilder/Buffer pair
>>> anyway, then we can keep the Appendable version as private for now
>>> and deal with the StringBuilder and Appendable as two separate
>>> issues.
>>>
>>> -Sherman
>>>
>>>
>>> On 03/20/2014 09:52 AM, Jeremy Manson wrote:
>>>
>>>> That's definitely an improvement - I think that when I wrote this (circa
>>>> 2009), I didn't think about Appendable.
>>>>
>>>> I take it my argument convinced someone?  :)
>>>>
>>>> Jeremy
>>>>
>>>>
>>>> On Thu, Mar 20, 2014 at 1:32 AM, Peter Levart<peter.levart at gmail.com
>>>> >wrote:
>>>>
>>>>  On 03/19/2014 06:51 PM, Jeremy Manson wrote:
>>>>>
>>>>>  I'm told that the diff didn't make it.  I've put it in a Google drive
>>>>>> folder...
>>>>>>
>>>>>> https://drive.google.com/file/d/0B_GaXa6O4K5LY3Y0aHpranM3aEU/
>>>>>> edit?usp=sharing
>>>>>>
>>>>>> Jeremy
>>>>>>
>>>>>>  Hi Jeremy,
>>>>>
>>>>> Your factoring-out of expandReplacement() method exposed an
>>>>> opportunity to
>>>>> further optimize the code. Instead of creating intermediate
>>>>> StringBuilder
>>>>> instance for each expandReplacement() call, this method could append
>>>>> directly to resulting StringBuffer/StringBuilder, like in the
>>>>> following:
>>>>>
>>>>> http://cr.openjdk.java.net/~plevart/jdk9-dev/MatcherWithStringBuilder/
>>>>> webrev.01/
>>>>>
>>>>>
>>>>> Regards, Peter
>>>>>
>>>>>
>>>>>
>>>>>  On Wed, Mar 19, 2014 at 10:41 AM, Jeremy Manson<
>>>>>> jeremymanson at google.com>
>>>>>> wrote:
>>>>>>
>>>>>>   Hi folks,
>>>>>>
>>>>>>> We've had this internally for a while, and I keep meaning to bring
>>>>>>> it up
>>>>>>> here.  The Matcher class has a few public methods that take
>>>>>>> StringBuffers,
>>>>>>> and we've found it useful to add similar versions that take
>>>>>>> StringBuilders.
>>>>>>>
>>>>>>> It has two benefits:
>>>>>>>
>>>>>>> - Users don't have to convert from one to the other when they want
>>>>>>> to use
>>>>>>> the method in question.  The symmetry is nice.
>>>>>>>
>>>>>>> - The StringBuilder variants are faster (if lock optimizations don't
>>>>>>> kick
>>>>>>> in, which happens in the interpreter and the client compiler).  For
>>>>>>> interpreted / client-compiled code, we saw something like a 25%
>>>>>>> speedup
>>>>>>> on
>>>>>>> String.replaceAll(), which calls into this code.
>>>>>>>
>>>>>>> Any interest?  Diff attached.
>>>>>>>
>>>>>>> Jeremy
>>>>>>>
>>>>>>>
>>>>>>>
>>>
>>
>
>
-------------- next part --------------
diff --git a/src/share/classes/java/util/regex/Matcher.java b/src/share/classes/java/util/regex/Matcher.java
--- a/src/share/classes/java/util/regex/Matcher.java
+++ b/src/share/classes/java/util/regex/Matcher.java
@@ -65,7 +65,8 @@
  * new strings whose contents can, if desired, be computed from the match
  * result.  The {@link #appendReplacement appendReplacement} and {@link
  * #appendTail appendTail} methods can be used in tandem in order to collect
- * the result into an existing string buffer, or the more convenient {@link
+ * the result into an existing string buffer or
+ * string builder.  Alternatively, the more convenient {@link
  * #replaceAll replaceAll} method can be used to create a string in which every
  * matching subsequence in the input sequence is replaced.
  *
@@ -792,14 +793,118 @@
      *          that does not exist in the pattern
      */
     public Matcher appendReplacement(StringBuffer sb, String replacement) {
-
         // If no match, return error
         if (first < 0)
             throw new IllegalStateException("No match available");
 
-        // Process substitution string to replace group references with groups
+        StringBuilder result = new StringBuilder();
+        appendExpandedReplacement(replacement, result);
+
+        // Append the intervening text
+        sb.append(text, lastAppendPosition, first);
+        // Append the match substitution
+        sb.append(result);
+
+        lastAppendPosition = last;
+        return this;
+    }
+
+    /**
+     * Implements a non-terminal append-and-replace step.
+     *
+     * <p> This method performs the following actions: </p>
+     *
+     * <ol>
+     *
+     *   <li><p> It reads characters from the input sequence, starting at the
+     *   append position, and appends them to the given string builder.  It
+     *   stops after reading the last character preceding the previous match,
+     *   that is, the character at index {@link
+     *   #start()} <tt>-</tt> <tt>1</tt>.  </p></li>
+     *
+     *   <li><p> It appends the given replacement string to the string builder.
+     *   </p></li>
+     *
+     *   <li><p> It sets the append position of this matcher to the index of
+     *   the last character matched, plus one, that is, to {@link #end()}.
+     *   </p></li>
+     *
+     * </ol>
+     *
+     * <p> The replacement string may contain references to subsequences
+     * captured during the previous match: Each occurrence of
+     * <tt>$</tt><i>g</i><tt></tt> will be replaced by the result of
+     * evaluating {@link #group(int) group}<tt>(</tt><i>g</i><tt>)</tt>.
+     * The first number after the <tt>$</tt> is always treated as part of
+     * the group reference. Subsequent numbers are incorporated into g if
+     * they would form a legal group reference. Only the numerals '0'
+     * through '9' are considered as potential components of the group
+     * reference. If the second group matched the string <tt>"foo"</tt>, for
+     * example, then passing the replacement string <tt>"$2bar"</tt> would
+     * cause <tt>"foobar"</tt> to be appended to the string builder. A dollar
+     * sign (<tt>$</tt>) may be included as a literal in the replacement
+     * string by preceding it with a backslash (<tt>\$</tt>).
+     *
+     * <p> Note that backslashes (<tt>\</tt>) and dollar signs (<tt>$</tt>) in
+     * the replacement string may cause the results to be different than if it
+     * were being treated as a literal replacement string. Dollar signs may be
+     * treated as references to captured subsequences as described above, and
+     * backslashes are used to escape literal characters in the replacement
+     * string.
+     *
+     * <p> This method is intended to be used in a loop together with the
+     * {@link #appendTail appendTail} and {@link #find find} methods.  The
+     * following code, for example, writes <tt>one dog two dogs in the
+     * yard</tt> to the standard-output stream: </p>
+     *
+     * <blockquote><pre>
+     * Pattern p = Pattern.compile("cat");
+     * Matcher m = p.matcher("one cat two cats in the yard");
+     * StringBuilder sb = new StringBuilder();
+     * while (m.find()) {
+     *     m.appendReplacement(sb, "dog");
+     * }
+     * m.appendTail(sb);
+     * System.out.println(sb.toString());</pre></blockquote>
+     *
+     * @param  sb
+     *         The target string builder
+     *
+     * @param  replacement
+     *         The replacement string
+     *
+     * @return  This matcher
+     *
+     * @throws  IllegalStateException
+     *          If no match has yet been attempted,
+     *          or if the previous match operation failed
+     *
+     * @throws  IndexOutOfBoundsException
+     *          If the replacement string refers to a capturing group
+     *          that does not exist in the pattern
+     */
+    public Matcher appendReplacement(StringBuilder sb, String replacement) {
+        // If no match, return error
+        if (first < 0)
+            throw new IllegalStateException("No match available");
+
+        // Append the intervening text
+        sb.append(text, lastAppendPosition, first);
+
+        // Append the match substitution
+        appendExpandedReplacement(replacement, sb);
+
+        lastAppendPosition = last;
+        return this;
+    }
+
+    /**
+     * Processes replacement string to replace group references with
+     * groups.
+     */
+    private StringBuilder appendExpandedReplacement(
+        String replacement, StringBuilder result) {
         int cursor = 0;
-        StringBuilder result = new StringBuilder();
 
         while (cursor < replacement.length()) {
             char nextChar = replacement.charAt(cursor);
@@ -852,8 +957,8 @@
                     cursor++;
                 } else {
                     // The first number is always a group
-                    refNum = (int)nextChar - '0';
-                    if ((refNum < 0)||(refNum > 9))
+                    refNum = nextChar - '0';
+                    if ((refNum < 0) || (refNum > 9))
                         throw new IllegalArgumentException(
                             "Illegal group reference");
                     cursor++;
@@ -864,7 +969,7 @@
                             break;
                         }
                         int nextDigit = replacement.charAt(cursor) - '0';
-                        if ((nextDigit < 0)||(nextDigit > 9)) { // not a number
+                        if ((nextDigit < 0) || (nextDigit > 9)) { // not a number
                             break;
                         }
                         int newRefNum = (refNum * 10) + nextDigit;
@@ -884,13 +989,7 @@
                 cursor++;
             }
         }
-        // Append the intervening text
-        sb.append(text, lastAppendPosition, first);
-        // Append the match substitution
-        sb.append(result);
-
-        lastAppendPosition = last;
-        return this;
+        return result;
     }
 
     /**
@@ -913,6 +1012,25 @@
     }
 
     /**
+     * Implements a terminal append-and-replace step.
+     *
+     * <p> This method reads characters from the input sequence, starting at
+     * the append position, and appends them to the given string builder.  It is
+     * intended to be invoked after one or more invocations of the {@link
+     * #appendReplacement appendReplacement} method in order to copy the
+     * remainder of the input sequence.  </p>
+     *
+     * @param  sb
+     *         The target string builder
+     *
+     * @return  The target string builder
+     */
+    public StringBuilder appendTail(StringBuilder sb) {
+        sb.append(text, lastAppendPosition, getTextLength());
+        return sb;
+    }
+
+    /**
      * Replaces every subsequence of the input sequence that matches the
      * pattern with the given replacement string.
      *
@@ -950,7 +1068,7 @@
         reset();
         boolean result = find();
         if (result) {
-            StringBuffer sb = new StringBuffer();
+            StringBuilder sb = new StringBuilder();
             do {
                 appendReplacement(sb, replacement);
                 result = find();
@@ -1000,7 +1118,7 @@
         reset();
         if (!find())
             return text.toString();
-        StringBuffer sb = new StringBuffer();
+        StringBuilder sb = new StringBuilder();
         appendReplacement(sb, replacement);
         appendTail(sb);
         return sb.toString();


More information about the core-libs-dev mailing list