RFR: 8215017: Improve String::equals warmup characteristics
Hi, following up from discussions during review of JDK-8214971[1], I examined the startup and peak performance of a few different variant of writing String::equals. Webrev: http://cr.openjdk.java.net/~redestad/8215017/jdk.00/ Bug: https://bugs.openjdk.java.net/browse/JDK-8215017 - folding coder() == aString.coder() into sameCoder(aString) helps interpreter without adversely affecting higher optimization levels - Jim's proposal to use Arrays.equals is _interesting_: it improves peak performance on some inputs but regress it on others. I'll defer that to a future RFE as it needs a more thorough examination. - what we can do is simplify to only use StringLatin1.equals. If I'm not completely mistaken these are all semantically neutral (and StringUTF16.equals effectively redundant). If I'm completely wrong here I'll welcome the learning opportunity. :-) This removes a branch and two method calls, and for UTF16 Strings we'll use a simpler algorithm early, which turns out to be beneficial during interpreter and C1 level. I added a simple microbenchmark to explore this, results show 1.2-2.5x improvements in interpreter performance, while remaining perfectly neutral results for optimized code on this simple micro[2]. This could be extended to clean up and move StringLatin1.equals back into String and remove StringUTF16, but we'd also need to rearrange the intrinsics on the VM side. Let me know what you think. Thanks! /Claes [1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-December/057162.ht... [2] ========== Baseline ================= -Xint Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 968.640 ± 1.337 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 2082.007 ± 5.303 ns/op StringEquals.equalsDifferent avgt 4 583.166 ± 29.461 ns/op StringEquals.equalsDifferentCoders avgt 4 422.993 ± 1.291 ns/op StringEquals.equalsEqual avgt 4 988.671 ± 1.492 ns/op StringEquals.equalsEqualsUTF16 avgt 4 2103.060 ± 5.705 ns/op -XX:+CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 23.896 ± 0.089 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 23.935 ± 0.562 ns/op StringEquals.equalsDifferent avgt 4 15.086 ± 0.044 ns/op StringEquals.equalsDifferentCoders avgt 4 12.572 ± 0.008 ns/op StringEquals.equalsEqual avgt 4 25.143 ± 0.025 ns/op StringEquals.equalsEqualsUTF16 avgt 4 25.148 ± 0.021 ns/op -XX:-CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 24.539 ± 0.127 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 22.638 ± 0.047 ns/op StringEquals.equalsDifferent avgt 4 13.930 ± 0.835 ns/op StringEquals.equalsDifferentCoders avgt 4 13.836 ± 0.025 ns/op StringEquals.equalsEqual avgt 4 26.420 ± 0.020 ns/op StringEquals.equalsEqualsUTF16 avgt 4 23.889 ± 0.037 ns/op ========== Fix ====================== -Xint Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 811.859 ± 8.663 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 802.784 ± 352.884 ns/op StringEquals.equalsDifferent avgt 4 431.837 ± 1.884 ns/op StringEquals.equalsDifferentCoders avgt 4 358.244 ± 1.208 ns/op StringEquals.equalsEqual avgt 4 832.056 ± 3.541 ns/op StringEquals.equalsEqualsUTF16 avgt 4 832.434 ± 3.516 ns/op -XX:+CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 23.906 ± 0.151 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 23.905 ± 0.123 ns/op StringEquals.equalsDifferent avgt 4 15.088 ± 0.023 ns/op StringEquals.equalsDifferentCoders avgt 4 12.575 ± 0.030 ns/op StringEquals.equalsEqual avgt 4 25.149 ± 0.059 ns/op StringEquals.equalsEqualsUTF16 avgt 4 25.149 ± 0.033 ns/op -XX:-CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 24.521 ± 0.050 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 22.639 ± 0.035 ns/op StringEquals.equalsDifferent avgt 4 13.831 ± 0.020 ns/op StringEquals.equalsDifferentCoders avgt 4 13.884 ± 0.345 ns/op StringEquals.equalsEqual avgt 4 26.395 ± 0.066 ns/op StringEquals.equalsEqualsUTF16 avgt 4 23.904 ± 0.112 ns/op
Hi, Tobias weighed in on this in another thread[1], and while he thinks the proposed patch is semantically correct, concerns was raised that maybe the UTF16 intrinsics could be superior (on some platforms). I ran the microbenchmark below as well as existing string-density- benchmark[2] suite, noting no statistically significant differences for peak performance on x64_86 (Windows, Linux, Mac) and SPARC T4 through M7. Warmup improvements are similar across all platforms. So from our point of view things look green. However, I have no means of testing the intrinsics on other platforms (S390, aarch64, ppc), so it'd be much appreciated if performance could be verified on those platforms using the proposed patch and benchmark. By inspecting code it seems the difference should be negligible or even positive, e.g., on aarch64 there's a trailing comparison that is elided when treating the byte[] as a char[] - overhead that is possibly offset entirely by removing an extra branch before going into the intrinsics. Thanks! /Claes [1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-December/057240.ht... [2] http://cr.openjdk.java.net/~shade/density/string-density-bench.zip (had to remove hg maven plugin, reference to sun.misc.Version and update JMH version for this to build and run on latest JDK) On 2018-12-08 01:11, Claes Redestad wrote:
Hi,
following up from discussions during review of JDK-8214971[1], I examined the startup and peak performance of a few different variant of writing String::equals.
Webrev: http://cr.openjdk.java.net/~redestad/8215017/jdk.00/ Bug: https://bugs.openjdk.java.net/browse/JDK-8215017
- folding coder() == aString.coder() into sameCoder(aString) helps interpreter without adversely affecting higher optimization levels
- Jim's proposal to use Arrays.equals is _interesting_: it improves peak performance on some inputs but regress it on others. I'll defer that to a future RFE as it needs a more thorough examination.
- what we can do is simplify to only use StringLatin1.equals. If I'm not completely mistaken these are all semantically neutral (and StringUTF16.equals effectively redundant). If I'm completely wrong here I'll welcome the learning opportunity. :-)
This removes a branch and two method calls, and for UTF16 Strings we'll use a simpler algorithm early, which turns out to be beneficial during interpreter and C1 level.
I added a simple microbenchmark to explore this, results show 1.2-2.5x improvements in interpreter performance, while remaining perfectly neutral results for optimized code on this simple micro[2].
This could be extended to clean up and move StringLatin1.equals back into String and remove StringUTF16, but we'd also need to rearrange the intrinsics on the VM side. Let me know what you think.
Thanks!
/Claes
[1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-December/057162.ht...
[2] ========== Baseline =================
-Xint Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 968.640 ± 1.337 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 2082.007 ± 5.303 ns/op StringEquals.equalsDifferent avgt 4 583.166 ± 29.461 ns/op StringEquals.equalsDifferentCoders avgt 4 422.993 ± 1.291 ns/op StringEquals.equalsEqual avgt 4 988.671 ± 1.492 ns/op StringEquals.equalsEqualsUTF16 avgt 4 2103.060 ± 5.705 ns/op
-XX:+CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 23.896 ± 0.089 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 23.935 ± 0.562 ns/op StringEquals.equalsDifferent avgt 4 15.086 ± 0.044 ns/op StringEquals.equalsDifferentCoders avgt 4 12.572 ± 0.008 ns/op StringEquals.equalsEqual avgt 4 25.143 ± 0.025 ns/op StringEquals.equalsEqualsUTF16 avgt 4 25.148 ± 0.021 ns/op
-XX:-CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 24.539 ± 0.127 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 22.638 ± 0.047 ns/op StringEquals.equalsDifferent avgt 4 13.930 ± 0.835 ns/op StringEquals.equalsDifferentCoders avgt 4 13.836 ± 0.025 ns/op StringEquals.equalsEqual avgt 4 26.420 ± 0.020 ns/op StringEquals.equalsEqualsUTF16 avgt 4 23.889 ± 0.037 ns/op
========== Fix ======================
-Xint Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 811.859 ± 8.663 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 802.784 ± 352.884 ns/op StringEquals.equalsDifferent avgt 4 431.837 ± 1.884 ns/op StringEquals.equalsDifferentCoders avgt 4 358.244 ± 1.208 ns/op StringEquals.equalsEqual avgt 4 832.056 ± 3.541 ns/op StringEquals.equalsEqualsUTF16 avgt 4 832.434 ± 3.516 ns/op
-XX:+CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 23.906 ± 0.151 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 23.905 ± 0.123 ns/op StringEquals.equalsDifferent avgt 4 15.088 ± 0.023 ns/op StringEquals.equalsDifferentCoders avgt 4 12.575 ± 0.030 ns/op StringEquals.equalsEqual avgt 4 25.149 ± 0.059 ns/op StringEquals.equalsEqualsUTF16 avgt 4 25.149 ± 0.033 ns/op
-XX:-CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 24.521 ± 0.050 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 22.639 ± 0.035 ns/op StringEquals.equalsDifferent avgt 4 13.831 ± 0.020 ns/op StringEquals.equalsDifferentCoders avgt 4 13.884 ± 0.345 ns/op StringEquals.equalsEqual avgt 4 26.395 ± 0.066 ns/op StringEquals.equalsEqualsUTF16 avgt 4 23.904 ± 0.112 ns/op
Claes, make sure you've mind-melded with Aleksey on this, e.g. via https://www.youtube.com/watch?v=wIyeOaitmWM On Fri, Dec 7, 2018 at 4:07 PM Claes Redestad <claes.redestad@oracle.com> wrote:
Hi,
following up from discussions during review of JDK-8214971[1], I examined the startup and peak performance of a few different variant of writing String::equals.
Webrev: http://cr.openjdk.java.net/~redestad/8215017/jdk.00/ Bug: https://bugs.openjdk.java.net/browse/JDK-8215017
- folding coder() == aString.coder() into sameCoder(aString) helps interpreter without adversely affecting higher optimization levels
- Jim's proposal to use Arrays.equals is _interesting_: it improves peak performance on some inputs but regress it on others. I'll defer that to a future RFE as it needs a more thorough examination.
- what we can do is simplify to only use StringLatin1.equals. If I'm not completely mistaken these are all semantically neutral (and StringUTF16.equals effectively redundant). If I'm completely wrong here I'll welcome the learning opportunity. :-)
This removes a branch and two method calls, and for UTF16 Strings we'll use a simpler algorithm early, which turns out to be beneficial during interpreter and C1 level.
I added a simple microbenchmark to explore this, results show 1.2-2.5x improvements in interpreter performance, while remaining perfectly neutral results for optimized code on this simple micro[2].
This could be extended to clean up and move StringLatin1.equals back into String and remove StringUTF16, but we'd also need to rearrange the intrinsics on the VM side. Let me know what you think.
Thanks!
/Claes
[1]
http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-December/057162.ht...
[2] ========== Baseline =================
-Xint Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 968.640 ± 1.337 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 2082.007 ± 5.303 ns/op StringEquals.equalsDifferent avgt 4 583.166 ± 29.461 ns/op StringEquals.equalsDifferentCoders avgt 4 422.993 ± 1.291 ns/op StringEquals.equalsEqual avgt 4 988.671 ± 1.492 ns/op StringEquals.equalsEqualsUTF16 avgt 4 2103.060 ± 5.705 ns/op
-XX:+CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 23.896 ± 0.089 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 23.935 ± 0.562 ns/op StringEquals.equalsDifferent avgt 4 15.086 ± 0.044 ns/op StringEquals.equalsDifferentCoders avgt 4 12.572 ± 0.008 ns/op StringEquals.equalsEqual avgt 4 25.143 ± 0.025 ns/op StringEquals.equalsEqualsUTF16 avgt 4 25.148 ± 0.021 ns/op
-XX:-CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 24.539 ± 0.127 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 22.638 ± 0.047 ns/op StringEquals.equalsDifferent avgt 4 13.930 ± 0.835 ns/op StringEquals.equalsDifferentCoders avgt 4 13.836 ± 0.025 ns/op StringEquals.equalsEqual avgt 4 26.420 ± 0.020 ns/op StringEquals.equalsEqualsUTF16 avgt 4 23.889 ± 0.037 ns/op
========== Fix ======================
-Xint Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 811.859 ± 8.663 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 802.784 ± 352.884 ns/op StringEquals.equalsDifferent avgt 4 431.837 ± 1.884 ns/op StringEquals.equalsDifferentCoders avgt 4 358.244 ± 1.208 ns/op StringEquals.equalsEqual avgt 4 832.056 ± 3.541 ns/op StringEquals.equalsEqualsUTF16 avgt 4 832.434 ± 3.516 ns/op
-XX:+CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 23.906 ± 0.151 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 23.905 ± 0.123 ns/op StringEquals.equalsDifferent avgt 4 15.088 ± 0.023 ns/op StringEquals.equalsDifferentCoders avgt 4 12.575 ± 0.030 ns/op StringEquals.equalsEqual avgt 4 25.149 ± 0.059 ns/op StringEquals.equalsEqualsUTF16 avgt 4 25.149 ± 0.033 ns/op
-XX:-CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 24.521 ± 0.050 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 22.639 ± 0.035 ns/op StringEquals.equalsDifferent avgt 4 13.831 ± 0.020 ns/op StringEquals.equalsDifferentCoders avgt 4 13.884 ± 0.345 ns/op StringEquals.equalsEqual avgt 4 26.395 ± 0.066 ns/op StringEquals.equalsEqualsUTF16 avgt 4 23.904 ± 0.112 ns/op
Hi, please review this improved version, which drops the new sameCoder() method and insteads inlines that test in the one place where its use is performance sensitive. Webrev: http://cr.openjdk.java.net/~redestad/8215017/jdk.01/ Microbenchmark show perfectly matching results, except an additional 8-20% improvement in interpreter results (which means a more noticeable improvement in startup/warmup tests): Benchmark Mode Cnt Score Error Units StringEquals.almostEqual avgt 5 740.164 ± 2.586 ns/op StringEquals.almostEqualUTF16 avgt 5 707.969 ± 2.796 ns/op StringEquals.different avgt 5 357.230 ± 9.751 ns/op StringEquals.differentCoders avgt 5 287.026 ± 2.112 ns/op StringEquals.equal avgt 5 762.012 ± 5.500 ns/op StringEquals.equalsUTF16 avgt 5 767.058 ± 4.619 ns/op Thanks! /Claes On 2018-12-08 01:11, Claes Redestad wrote:
Hi,
following up from discussions during review of JDK-8214971[1], I examined the startup and peak performance of a few different variant of writing String::equals.
Webrev: http://cr.openjdk.java.net/~redestad/8215017/jdk.00/ Bug: https://bugs.openjdk.java.net/browse/JDK-8215017
- folding coder() == aString.coder() into sameCoder(aString) helps interpreter without adversely affecting higher optimization levels
- Jim's proposal to use Arrays.equals is _interesting_: it improves peak performance on some inputs but regress it on others. I'll defer that to a future RFE as it needs a more thorough examination.
- what we can do is simplify to only use StringLatin1.equals. If I'm not completely mistaken these are all semantically neutral (and StringUTF16.equals effectively redundant). If I'm completely wrong here I'll welcome the learning opportunity. :-)
This removes a branch and two method calls, and for UTF16 Strings we'll use a simpler algorithm early, which turns out to be beneficial during interpreter and C1 level.
I added a simple microbenchmark to explore this, results show 1.2-2.5x improvements in interpreter performance, while remaining perfectly neutral results for optimized code on this simple micro[2].
This could be extended to clean up and move StringLatin1.equals back into String and remove StringUTF16, but we'd also need to rearrange the intrinsics on the VM side. Let me know what you think.
Thanks!
/Claes
[1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-December/057162.ht...
[2] ========== Baseline =================
-Xint Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 968.640 ± 1.337 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 2082.007 ± 5.303 ns/op StringEquals.equalsDifferent avgt 4 583.166 ± 29.461 ns/op StringEquals.equalsDifferentCoders avgt 4 422.993 ± 1.291 ns/op StringEquals.equalsEqual avgt 4 988.671 ± 1.492 ns/op StringEquals.equalsEqualsUTF16 avgt 4 2103.060 ± 5.705 ns/op
-XX:+CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 23.896 ± 0.089 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 23.935 ± 0.562 ns/op StringEquals.equalsDifferent avgt 4 15.086 ± 0.044 ns/op StringEquals.equalsDifferentCoders avgt 4 12.572 ± 0.008 ns/op StringEquals.equalsEqual avgt 4 25.143 ± 0.025 ns/op StringEquals.equalsEqualsUTF16 avgt 4 25.148 ± 0.021 ns/op
-XX:-CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 24.539 ± 0.127 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 22.638 ± 0.047 ns/op StringEquals.equalsDifferent avgt 4 13.930 ± 0.835 ns/op StringEquals.equalsDifferentCoders avgt 4 13.836 ± 0.025 ns/op StringEquals.equalsEqual avgt 4 26.420 ± 0.020 ns/op StringEquals.equalsEqualsUTF16 avgt 4 23.889 ± 0.037 ns/op
========== Fix ======================
-Xint Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 811.859 ± 8.663 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 802.784 ± 352.884 ns/op StringEquals.equalsDifferent avgt 4 431.837 ± 1.884 ns/op StringEquals.equalsDifferentCoders avgt 4 358.244 ± 1.208 ns/op StringEquals.equalsEqual avgt 4 832.056 ± 3.541 ns/op StringEquals.equalsEqualsUTF16 avgt 4 832.434 ± 3.516 ns/op
-XX:+CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 23.906 ± 0.151 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 23.905 ± 0.123 ns/op StringEquals.equalsDifferent avgt 4 15.088 ± 0.023 ns/op StringEquals.equalsDifferentCoders avgt 4 12.575 ± 0.030 ns/op StringEquals.equalsEqual avgt 4 25.149 ± 0.059 ns/op StringEquals.equalsEqualsUTF16 avgt 4 25.149 ± 0.033 ns/op
-XX:-CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 24.521 ± 0.050 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 22.639 ± 0.035 ns/op StringEquals.equalsDifferent avgt 4 13.831 ± 0.020 ns/op StringEquals.equalsDifferentCoders avgt 4 13.884 ± 0.345 ns/op StringEquals.equalsEqual avgt 4 26.395 ± 0.066 ns/op StringEquals.equalsEqualsUTF16 avgt 4 23.904 ± 0.112 ns/op
I'm confused. This change seems to remove the one and only call to StringUTF16.equals, making that dead code? Looking more closely at StringUTF16.equals and StringLatin1.equals, both seem to boil down to equivalent """do the byte arrays contain the same bytes?"""? Which suggests we can coalesce these two methods and have String.equals simply check that the two coders and bytes are the same? Arrays.equals(byte[],byte[]) is also an intrinsic but has more checks. Hmmmm ... there's been performance work on ArraysSupport.mismatch. Maybe we're not using it for Strings because we expect Strings to be short? I don't know if hotspot these days needs to keep intrinsics around for library code that has been deleted. On Thu, Apr 11, 2019 at 7:09 AM Claes Redestad <claes.redestad@oracle.com> wrote:
Hi,
please review this improved version, which drops the new sameCoder() method and insteads inlines that test in the one place where its use is performance sensitive.
Webrev: http://cr.openjdk.java.net/~redestad/8215017/jdk.01/
Microbenchmark show perfectly matching results, except an additional 8-20% improvement in interpreter results (which means a more noticeable improvement in startup/warmup tests):
Benchmark Mode Cnt Score Error Units StringEquals.almostEqual avgt 5 740.164 ± 2.586 ns/op StringEquals.almostEqualUTF16 avgt 5 707.969 ± 2.796 ns/op StringEquals.different avgt 5 357.230 ± 9.751 ns/op StringEquals.differentCoders avgt 5 287.026 ± 2.112 ns/op StringEquals.equal avgt 5 762.012 ± 5.500 ns/op StringEquals.equalsUTF16 avgt 5 767.058 ± 4.619 ns/op
Thanks!
/Claes
On 2018-12-08 01:11, Claes Redestad wrote:
Hi,
following up from discussions during review of JDK-8214971[1], I examined the startup and peak performance of a few different variant of writing String::equals.
Webrev: http://cr.openjdk.java.net/~redestad/8215017/jdk.00/ Bug: https://bugs.openjdk.java.net/browse/JDK-8215017
- folding coder() == aString.coder() into sameCoder(aString) helps interpreter without adversely affecting higher optimization levels
- Jim's proposal to use Arrays.equals is _interesting_: it improves peak performance on some inputs but regress it on others. I'll defer that to a future RFE as it needs a more thorough examination.
- what we can do is simplify to only use StringLatin1.equals. If I'm not completely mistaken these are all semantically neutral (and StringUTF16.equals effectively redundant). If I'm completely wrong here I'll welcome the learning opportunity. :-)
This removes a branch and two method calls, and for UTF16 Strings we'll use a simpler algorithm early, which turns out to be beneficial during interpreter and C1 level.
I added a simple microbenchmark to explore this, results show 1.2-2.5x improvements in interpreter performance, while remaining perfectly neutral results for optimized code on this simple micro[2].
This could be extended to clean up and move StringLatin1.equals back into String and remove StringUTF16, but we'd also need to rearrange the intrinsics on the VM side. Let me know what you think.
Thanks!
/Claes
[1]
http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-December/057162.ht...
[2] ========== Baseline =================
-Xint Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 968.640 ± 1.337 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 2082.007 ± 5.303 ns/op StringEquals.equalsDifferent avgt 4 583.166 ± 29.461 ns/op StringEquals.equalsDifferentCoders avgt 4 422.993 ± 1.291 ns/op StringEquals.equalsEqual avgt 4 988.671 ± 1.492 ns/op StringEquals.equalsEqualsUTF16 avgt 4 2103.060 ± 5.705 ns/op
-XX:+CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 23.896 ± 0.089 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 23.935 ± 0.562 ns/op StringEquals.equalsDifferent avgt 4 15.086 ± 0.044 ns/op StringEquals.equalsDifferentCoders avgt 4 12.572 ± 0.008 ns/op StringEquals.equalsEqual avgt 4 25.143 ± 0.025 ns/op StringEquals.equalsEqualsUTF16 avgt 4 25.148 ± 0.021 ns/op
-XX:-CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 24.539 ± 0.127 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 22.638 ± 0.047 ns/op StringEquals.equalsDifferent avgt 4 13.930 ± 0.835 ns/op StringEquals.equalsDifferentCoders avgt 4 13.836 ± 0.025 ns/op StringEquals.equalsEqual avgt 4 26.420 ± 0.020 ns/op StringEquals.equalsEqualsUTF16 avgt 4 23.889 ± 0.037 ns/op
========== Fix ======================
-Xint Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 811.859 ± 8.663 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 802.784 ± 352.884 ns/op StringEquals.equalsDifferent avgt 4 431.837 ± 1.884 ns/op StringEquals.equalsDifferentCoders avgt 4 358.244 ± 1.208 ns/op StringEquals.equalsEqual avgt 4 832.056 ± 3.541 ns/op StringEquals.equalsEqualsUTF16 avgt 4 832.434 ± 3.516 ns/op
-XX:+CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 23.906 ± 0.151 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 23.905 ± 0.123 ns/op StringEquals.equalsDifferent avgt 4 15.088 ± 0.023 ns/op StringEquals.equalsDifferentCoders avgt 4 12.575 ± 0.030 ns/op StringEquals.equalsEqual avgt 4 25.149 ± 0.059 ns/op StringEquals.equalsEqualsUTF16 avgt 4 25.149 ± 0.033 ns/op
-XX:-CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 24.521 ± 0.050 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 22.639 ± 0.035 ns/op StringEquals.equalsDifferent avgt 4 13.831 ± 0.020 ns/op StringEquals.equalsDifferentCoders avgt 4 13.884 ± 0.345 ns/op StringEquals.equalsEqual avgt 4 26.395 ± 0.066 ns/op StringEquals.equalsEqualsUTF16 avgt 4 23.904 ± 0.112 ns/op
On 2019-04-11 18:06, Martin Buchholz wrote:
I'm confused. This change seems to remove the one and only call to StringUTF16.equals, making that dead code? Looking more closely at StringUTF16.equals and StringLatin1.equals, both seem to boil down to equivalent """do the byte arrays contain the same bytes?"""? Which suggests we can coalesce these two methods and have String.equals simply check that the two coders and bytes are the same? Arrays.equals(byte[],byte[]) is also an intrinsic but has more checks. Hmmmm ... there's been performance work on ArraysSupport.mismatch. Maybe we're not using it for Strings because we expect Strings to be short? I don't know if hotspot these days needs to keep intrinsics around for library code that has been deleted.
Yes, this'd effectively kill off the need for StringUTF16.equals. I don't think we need to hold on to the intrinsic, but would prefer deferring that cleanup to a follow-up. I noted earlier that Arrays.equals would work, but has quite different peak and warmup behavior - some of it good, some of it bad (especially during startup/warmup). As this is a performance-sensitive area I'd prefer taking steps that are all-round improvements. /Claes
On Dec 7, 2018, at 4:11 PM, Claes Redestad <claes.redestad@oracle.com> wrote:
- Jim's proposal to use Arrays.equals is _interesting_: it improves peak performance on some inputs but regress it on others. I'll defer that to a future RFE as it needs a more thorough examination.
- what we can do is simplify to only use StringLatin1.equals. If I'm not completely mistaken these are all semantically neutral (and StringUTF16.equals effectively redundant). If I'm completely wrong here I'll welcome the learning opportunity. :-)
If they are semantically neutral then they belong in the same place as the neutral intrinsics, which is jdk.internal.util.ArraysSupport. This in turn calls the question of how the operation differs from the existing mismatch intrinsic. I'll go out on a limb here and say that the standard ArraysSupport.mismatch intrinsic can probably be tuned to cover string compares without hurting its other customers. Often such intrinsics are only engineered for throughput on long inputs. Later on we may find they can be additionally tuned for other workloads, by adjusting the setup logic, or (alternatively) they can be factored into multiple entry points for specific use cases, with complete sharing of assembly-level code except for differing setup logic for the different use cases. This sort of thing is the case with our arraycopy intrinsics (used by the JIT) and may well also be the case for array matching intrinsics. I agree the above questions can be answered in a separate investigation. — John
Array copy and array compare seem like very similar optimization problems and could even share some infrastructure. On Thu, Apr 11, 2019 at 9:23 AM John Rose <john.r.rose@oracle.com> wrote:
On Dec 7, 2018, at 4:11 PM, Claes Redestad <claes.redestad@oracle.com> wrote:
- Jim's proposal to use Arrays.equals is _interesting_: it improves peak performance on some inputs but regress it on others. I'll defer that to a future RFE as it needs a more thorough examination.
- what we can do is simplify to only use StringLatin1.equals. If I'm not completely mistaken these are all semantically neutral (and StringUTF16.equals effectively redundant). If I'm completely wrong here I'll welcome the learning opportunity. :-)
If they are semantically neutral then they belong in the same place as the neutral intrinsics, which is jdk.internal.util.ArraysSupport. This in turn calls the question of how the operation differs from the existing mismatch intrinsic.
I'll go out on a limb here and say that the standard ArraysSupport.mismatch intrinsic can probably be tuned to cover string compares without hurting its other customers. Often such intrinsics are only engineered for throughput on long inputs. Later on we may find they can be additionally tuned for other workloads, by adjusting the setup logic, or (alternatively) they can be factored into multiple entry points for specific use cases, with complete sharing of assembly-level code except for differing setup logic for the different use cases. This sort of thing is the case with our arraycopy intrinsics (used by the JIT) and may well also be the case for array matching intrinsics.
I agree the above questions can be answered in a separate investigation.
— John
+1
On Dec 7, 2018, at 8:11 PM, Claes Redestad <claes.redestad@oracle.com> wrote:
Hi,
following up from discussions during review of JDK-8214971[1], I examined the startup and peak performance of a few different variant of writing String::equals.
Webrev: http://cr.openjdk.java.net/~redestad/8215017/jdk.00/ Bug: https://bugs.openjdk.java.net/browse/JDK-8215017
- folding coder() == aString.coder() into sameCoder(aString) helps interpreter without adversely affecting higher optimization levels
- Jim's proposal to use Arrays.equals is _interesting_: it improves peak performance on some inputs but regress it on others. I'll defer that to a future RFE as it needs a more thorough examination.
- what we can do is simplify to only use StringLatin1.equals. If I'm not completely mistaken these are all semantically neutral (and StringUTF16.equals effectively redundant). If I'm completely wrong here I'll welcome the learning opportunity. :-)
This removes a branch and two method calls, and for UTF16 Strings we'll use a simpler algorithm early, which turns out to be beneficial during interpreter and C1 level.
I added a simple microbenchmark to explore this, results show 1.2-2.5x improvements in interpreter performance, while remaining perfectly neutral results for optimized code on this simple micro[2].
This could be extended to clean up and move StringLatin1.equals back into String and remove StringUTF16, but we'd also need to rearrange the intrinsics on the VM side. Let me know what you think.
Thanks!
/Claes
[1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-December/057162.ht...
[2] ========== Baseline =================
-Xint Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 968.640 ± 1.337 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 2082.007 ± 5.303 ns/op StringEquals.equalsDifferent avgt 4 583.166 ± 29.461 ns/op StringEquals.equalsDifferentCoders avgt 4 422.993 ± 1.291 ns/op StringEquals.equalsEqual avgt 4 988.671 ± 1.492 ns/op StringEquals.equalsEqualsUTF16 avgt 4 2103.060 ± 5.705 ns/op
-XX:+CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 23.896 ± 0.089 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 23.935 ± 0.562 ns/op StringEquals.equalsDifferent avgt 4 15.086 ± 0.044 ns/op StringEquals.equalsDifferentCoders avgt 4 12.572 ± 0.008 ns/op StringEquals.equalsEqual avgt 4 25.143 ± 0.025 ns/op StringEquals.equalsEqualsUTF16 avgt 4 25.148 ± 0.021 ns/op
-XX:-CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 24.539 ± 0.127 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 22.638 ± 0.047 ns/op StringEquals.equalsDifferent avgt 4 13.930 ± 0.835 ns/op StringEquals.equalsDifferentCoders avgt 4 13.836 ± 0.025 ns/op StringEquals.equalsEqual avgt 4 26.420 ± 0.020 ns/op StringEquals.equalsEqualsUTF16 avgt 4 23.889 ± 0.037 ns/op
========== Fix ======================
-Xint Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 811.859 ± 8.663 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 802.784 ± 352.884 ns/op StringEquals.equalsDifferent avgt 4 431.837 ± 1.884 ns/op StringEquals.equalsDifferentCoders avgt 4 358.244 ± 1.208 ns/op StringEquals.equalsEqual avgt 4 832.056 ± 3.541 ns/op StringEquals.equalsEqualsUTF16 avgt 4 832.434 ± 3.516 ns/op
-XX:+CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 23.906 ± 0.151 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 23.905 ± 0.123 ns/op StringEquals.equalsDifferent avgt 4 15.088 ± 0.023 ns/op StringEquals.equalsDifferentCoders avgt 4 12.575 ± 0.030 ns/op StringEquals.equalsEqual avgt 4 25.149 ± 0.059 ns/op StringEquals.equalsEqualsUTF16 avgt 4 25.149 ± 0.033 ns/op
-XX:-CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 24.521 ± 0.050 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 22.639 ± 0.035 ns/op StringEquals.equalsDifferent avgt 4 13.831 ± 0.020 ns/op StringEquals.equalsDifferentCoders avgt 4 13.884 ± 0.345 ns/op StringEquals.equalsEqual avgt 4 26.395 ± 0.066 ns/op StringEquals.equalsEqualsUTF16 avgt 4 23.904 ± 0.112 ns/op
+1
On Dec 7, 2018, at 8:11 PM, Claes Redestad <claes.redestad@oracle.com> wrote:
Hi,
following up from discussions during review of JDK-8214971[1], I examined the startup and peak performance of a few different variant of writing String::equals.
Webrev: http://cr.openjdk.java.net/~redestad/8215017/jdk.00/ Bug: https://bugs.openjdk.java.net/browse/JDK-8215017
- folding coder() == aString.coder() into sameCoder(aString) helps interpreter without adversely affecting higher optimization levels
- Jim's proposal to use Arrays.equals is _interesting_: it improves peak performance on some inputs but regress it on others. I'll defer that to a future RFE as it needs a more thorough examination.
- what we can do is simplify to only use StringLatin1.equals. If I'm not completely mistaken these are all semantically neutral (and StringUTF16.equals effectively redundant). If I'm completely wrong here I'll welcome the learning opportunity. :-)
This removes a branch and two method calls, and for UTF16 Strings we'll use a simpler algorithm early, which turns out to be beneficial during interpreter and C1 level.
I added a simple microbenchmark to explore this, results show 1.2-2.5x improvements in interpreter performance, while remaining perfectly neutral results for optimized code on this simple micro[2].
This could be extended to clean up and move StringLatin1.equals back into String and remove StringUTF16, but we'd also need to rearrange the intrinsics on the VM side. Let me know what you think.
Thanks!
/Claes
[1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-December/057162.ht...
[2] ========== Baseline =================
-Xint Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 968.640 ± 1.337 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 2082.007 ± 5.303 ns/op StringEquals.equalsDifferent avgt 4 583.166 ± 29.461 ns/op StringEquals.equalsDifferentCoders avgt 4 422.993 ± 1.291 ns/op StringEquals.equalsEqual avgt 4 988.671 ± 1.492 ns/op StringEquals.equalsEqualsUTF16 avgt 4 2103.060 ± 5.705 ns/op
-XX:+CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 23.896 ± 0.089 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 23.935 ± 0.562 ns/op StringEquals.equalsDifferent avgt 4 15.086 ± 0.044 ns/op StringEquals.equalsDifferentCoders avgt 4 12.572 ± 0.008 ns/op StringEquals.equalsEqual avgt 4 25.143 ± 0.025 ns/op StringEquals.equalsEqualsUTF16 avgt 4 25.148 ± 0.021 ns/op
-XX:-CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 24.539 ± 0.127 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 22.638 ± 0.047 ns/op StringEquals.equalsDifferent avgt 4 13.930 ± 0.835 ns/op StringEquals.equalsDifferentCoders avgt 4 13.836 ± 0.025 ns/op StringEquals.equalsEqual avgt 4 26.420 ± 0.020 ns/op StringEquals.equalsEqualsUTF16 avgt 4 23.889 ± 0.037 ns/op
========== Fix ======================
-Xint Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 811.859 ± 8.663 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 802.784 ± 352.884 ns/op StringEquals.equalsDifferent avgt 4 431.837 ± 1.884 ns/op StringEquals.equalsDifferentCoders avgt 4 358.244 ± 1.208 ns/op StringEquals.equalsEqual avgt 4 832.056 ± 3.541 ns/op StringEquals.equalsEqualsUTF16 avgt 4 832.434 ± 3.516 ns/op
-XX:+CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 23.906 ± 0.151 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 23.905 ± 0.123 ns/op StringEquals.equalsDifferent avgt 4 15.088 ± 0.023 ns/op StringEquals.equalsDifferentCoders avgt 4 12.575 ± 0.030 ns/op StringEquals.equalsEqual avgt 4 25.149 ± 0.059 ns/op StringEquals.equalsEqualsUTF16 avgt 4 25.149 ± 0.033 ns/op
-XX:-CompactStrings Benchmark Mode Cnt Score Error Units StringEquals.equalsAlmostEqual avgt 4 24.521 ± 0.050 ns/op StringEquals.equalsAlmostEqualUTF16 avgt 4 22.639 ± 0.035 ns/op StringEquals.equalsDifferent avgt 4 13.831 ± 0.020 ns/op StringEquals.equalsDifferentCoders avgt 4 13.884 ± 0.345 ns/op StringEquals.equalsEqual avgt 4 26.395 ± 0.066 ns/op StringEquals.equalsEqualsUTF16 avgt 4 23.904 ± 0.112 ns/op
participants (5)
-
Claes Redestad
-
James Laskey
-
Jim Laskey
-
John Rose
-
Martin Buchholz