RFR: 8148937: (str) Adapt StringJoiner for Compact Strings
Сергей Цыпанов
github.com+10835776+stsypanov at openjdk.java.net
Thu Feb 18 20:16:39 UTC 2021
On Thu, 18 Feb 2021 15:19:43 GMT, Claes Redestad <redestad at openjdk.org> wrote:
>> Hello,
>>
>> as of now `java.util.StringJoiner` still uses `char[]` as a storage for joined Strings.
>>
>> This applies for the cases when all joined Strings as well as delimiter, prefix and suffix contain only ASCII symbols.
>>
>> As a result when `StringJoiner.toString()` is called `byte[]` stored in Strings is inflated in order to fill in `char[]` and after that `char[]` is compressed when constructor of String is called:
>> String delimiter = this.delimiter;
>> char[] chars = new char[this.len + addLen];
>> int k = getChars(this.prefix, chars, 0);
>> if (size > 0) {
>> k += getChars(elts[0], chars, k); // inflate byte[]
>>
>> for(int i = 1; i < size; ++i) {
>> k += getChars(delimiter, chars, k);
>> k += getChars(elts[i], chars, k);
>> }
>> }
>>
>> k += getChars(this.suffix, chars, k);
>> return new String(chars); // compress char[] -> byte[]
>> This can be improved by utilizing new method `String.getBytes(byte[], int, int, byte, int)` [introduced](https://github.com/openjdk/jdk/pull/402) in [JDK-8224986](https://bugs.openjdk.java.net/browse/JDK-8254082)
>> covering both cases when resulting String is Latin1 or UTF-16
>>
>> I've prepared a patch along with benchmark proving that this change is correct and brings improvement.
>>
>> @BenchmarkMode(Mode.AverageTime)
>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>> @Fork(jvmArgsAppend = {"-Xms2g", "-Xmx2g"})
>> public class StringJoinerBenchmark {
>>
>> @Benchmark
>> public String stringJoiner(Data data) {
>> String[] stringArray = data.stringArray;
>> return Joiner.joinWithStringJoiner(stringArray);
>> }
>>
>> @State(Scope.Thread)
>> public static class Data {
>>
>> @Param({"latin", "cyrillic", "mixed"})
>> private String mode;
>>
>> @Param({"8", "32", "64"})
>> private int length;
>>
>> @Param({"5", "10", "100"})
>> private int count;
>>
>> private String[] stringArray;
>>
>> @Setup
>> public void setup() {
>> RandomStringGenerator generator = new RandomStringGenerator();
>>
>> stringArray = new String[count];
>>
>> for (int i = 0; i < count; i++) {
>> String alphabet = getAlphabet(i, mode);
>> stringArray[i] = generator.randomString(alphabet, length);
>> }
>> }
>>
>> private static String getAlphabet(int index, String mode) {
>> var latin = "abcdefghijklmnopqrstuvwxyz"; //English
>> var cyrillic = "абвгдеёжзиклмнопрстуфхцчшщьыъэюя"; // Russian
>>
>> String alphabet;
>> switch (mode) {
>> case "mixed" -> alphabet = index % 2 == 0 ? cyrillic : latin;
>> case "latin" -> alphabet = latin;
>> case "cyrillic" -> alphabet = cyrillic;
>> default -> throw new RuntimeException("Illegal mode " + mode);
>> }
>> return alphabet;
>> }
>> }
>> }
>>
>>
>> (count) (length) (mode) Java 14 patched Units
>> stringJoiner 5 8 latin 78.836 ± 0.208 67.546 ± 0.500 ns/op
>> stringJoiner 5 32 latin 92.877 ± 0.422 66.760 ± 0.498 ns/op
>> stringJoiner 5 64 latin 115.423 ± 0.883 73.224 ± 0.289 ns/op
>> stringJoiner 10 8 latin 152.587 ± 0.429 161.427 ± 0.635 ns/op
>> stringJoiner 10 32 latin 189.998 ± 0.478 164.099 ± 0.963 ns/op
>> stringJoiner 10 64 latin 238.679 ± 1.419 176.825 ± 0.533 ns/op
>> stringJoiner 100 8 latin 1215.612 ± 17.413 1541.802 ± 126.166 ns/op
>> stringJoiner 100 32 latin 1699.998 ± 28.407 1563.341 ± 4.439 ns/op
>> stringJoiner 100 64 latin 2289.388 ± 45.319 2215.931 ± 137.583 ns/op
>> stringJoiner 5 8 cyrillic 96.692 ± 0.947 80.946 ± 0.371 ns/op
>> stringJoiner 5 32 cyrillic 107.806 ± 0.429 84.717 ± 0.541 ns/op
>> stringJoiner 5 64 cyrillic 150.762 ± 2.267 96.214 ± 1.251 ns/op
>> stringJoiner 10 8 cyrillic 190.570 ± 0.381 182.754 ± 0.678 ns/op
>> stringJoiner 10 32 cyrillic 240.239 ± 1.110 187.991 ± 1.575 ns/op
>> stringJoiner 10 64 cyrillic 382.493 ± 2.692 219.358 ± 0.967 ns/op
>> stringJoiner 100 8 cyrillic 1737.435 ± 16.171 1658.379 ± 56.225 ns/op
>> stringJoiner 100 32 cyrillic 2321.106 ± 13.583 1942.028 ± 3.418 ns/op
>> stringJoiner 100 64 cyrillic 3189.563 ± 10.135 2484.796 ± 7.572 ns/op
>> stringJoiner 5 8 mixed 88.743 ± 0.709 76.049 ± 0.685 ns/op
>> stringJoiner 5 32 mixed 103.425 ± 0.745 81.006 ± 0.184 ns/op
>> stringJoiner 5 64 mixed 146.441 ± 0.763 93.071 ± 0.858 ns/op
>> stringJoiner 10 8 mixed 169.262 ± 0.482 158.890 ± 0.598 ns/op
>> stringJoiner 10 32 mixed 218.566 ± 0.945 170.415 ± 1.172 ns/op
>> stringJoiner 10 64 mixed 361.412 ± 1.546 200.374 ± 0.683 ns/op
>> stringJoiner 100 8 mixed 1594.544 ± 5.761 1476.689 ± 2.004 ns/op
>> stringJoiner 100 32 mixed 2221.333 ± 28.320 1879.071 ± 52.729 ns/op
>> stringJoiner 100 64 mixed 3115.786 ± 22.485 2400.644 ± 4.401 ns/op
>>
>> stringJoiner:·gc.alloc.rate.norm 5 8 latin 248.023 ± 0.001 136.015 ± 0.001 B/op
>> stringJoiner:·gc.alloc.rate.norm 5 32 latin 608.045 ± 0.001 256.022 ± 0.001 B/op
>> stringJoiner:·gc.alloc.rate.norm 5 64 latin 1088.067 ± 0.002 416.030 ± 0.001 B/op
>> stringJoiner:·gc.alloc.rate.norm 10 8 latin 464.044 ± 0.001 264.029 ± 0.001 B/op
>> stringJoiner:·gc.alloc.rate.norm 10 32 latin 1184.085 ± 0.003 504.044 ± 0.001 B/op
>> stringJoiner:·gc.alloc.rate.norm 10 64 latin 2144.123 ± 0.005 824.065 ± 0.001 B/op
>> stringJoiner:·gc.alloc.rate.norm 100 8 latin 3872.367 ± 8.002 2024.269 ± 8.016 B/op
>> stringJoiner:·gc.alloc.rate.norm 100 32 latin 11048.803 ± 7.988 4416.393 ± 0.004 B/op
>> stringJoiner:·gc.alloc.rate.norm 100 64 latin 20657.194 ± 9.775 7640.642 ± 9.819 B/op
>> stringJoiner:·gc.alloc.rate.norm 5 8 cyrillic 360.029 ± 0.001 184.020 ± 0.001 B/op
>> stringJoiner:·gc.alloc.rate.norm 5 32 cyrillic 960.060 ± 0.002 424.032 ± 0.001 B/op
>> stringJoiner:·gc.alloc.rate.norm 5 64 cyrillic 1760.088 ± 0.003 744.043 ± 0.001 B/op
>> stringJoiner:·gc.alloc.rate.norm 10 8 cyrillic 664.063 ± 0.001 352.038 ± 0.001 B/op
>> stringJoiner:·gc.alloc.rate.norm 10 32 cyrillic 1864.128 ± 0.003 832.067 ± 0.001 B/op
>> stringJoiner:·gc.alloc.rate.norm 10 64 cyrillic 3464.206 ± 0.007 1472.094 ± 0.001 B/op
>> stringJoiner:·gc.alloc.rate.norm 100 8 cyrillic 5704.541 ± 0.006 2928.317 ± 8.002 B/op
>> stringJoiner:·gc.alloc.rate.norm 100 32 cyrillic 17705.090 ± 0.045 7720.658 ± 0.007 B/op
>> stringJoiner:·gc.alloc.rate.norm 100 64 cyrillic 33689.855 ± 9.786 14121.042 ± 0.015 B/op
>> stringJoiner:·gc.alloc.rate.norm 5 8 mixed 360.028 ± 0.001 184.016 ± 0.001 B/op
>> stringJoiner:·gc.alloc.rate.norm 5 32 mixed 960.059 ± 0.002 424.031 ± 0.001 B/op
>> stringJoiner:·gc.alloc.rate.norm 5 64 mixed 1760.088 ± 0.003 744.041 ± 0.001 B/op
>> stringJoiner:·gc.alloc.rate.norm 10 8 mixed 664.052 ± 0.001 352.035 ± 0.002 B/op
>> stringJoiner:·gc.alloc.rate.norm 10 32 mixed 1864.116 ± 0.005 832.064 ± 0.001 B/op
>> stringJoiner:·gc.alloc.rate.norm 10 64 mixed 3464.196 ± 0.006 1472.088 ± 0.002 B/op
>> stringJoiner:·gc.alloc.rate.norm 100 8 mixed 5672.448 ± 8.002 2920.314 ± 0.006 B/op
>> stringJoiner:·gc.alloc.rate.norm 100 32 mixed 17681.092 ± 9.824 7728.653 ± 8.008 B/op
>> stringJoiner:·gc.alloc.rate.norm 100 64 mixed 33697.818 ± 8.011 14120.977 ± 0.034 B/op
>
> Some of these changes conflict with #2334, which suggest removing the `coder` and `isLatin1` methods from `String`.
>
> As a more general point I think it would be good to explore options that does not increase leakage of the implementation detail that `Strings` are latin1- or utf16-encoded outside of java.lang.
Hi @cl4es,
> Some of these changes conflict with #2334, which suggest removing the `coder` and `isLatin1` methods from `String`.
I've checked out Aleksey's branch and applied my changes onto it, the only thing that I changed to make it work is replacing
public boolean isLatin1(String str) {
return str.isLatin1();
}
with
public boolean isLatin1(String str) {
return str.coder == String.LATIN1;
}
The rest of the code was left intact. `jdk:tier1` is OK after the change.
> As a more general point I think it would be good to explore options that does not increase leakage of the implementation detail that `Strings` are latin1- or utf16-encoded outside of java.lang.
Apart from `JavaLangAccess` the only thing that comes to my mind is reflection, but it will destroy all the improvement. Otherwise I cannot figure out any other way to access somehow package-private latin/non-latin functionality of `j.l.String` in `java.util` package. I wonder, whether I'm missing any other opportunities?
-------------
PR: https://git.openjdk.java.net/jdk/pull/2627
More information about the security-dev
mailing list