RFR: 8265029: Preserve SIZED characteristics on slice operations (skip, limit)

Sat Apr 10 14:22:36 UTC 2021

On Sat, 10 Apr 2021 06:30:33 GMT, Tagir F. Valeev <tvaleev at openjdk.org> wrote:

> With the introduction of `toList()`, preserving the SIZED characteristics in more cases becomes more important. This patch preserves SIZED on `skip()` and `limit()` operations, so now every combination of `map/mapToX/boxed/asXyzStream/skip/limit/sorted` preserves size, and `toList()`, `toArray()` and `count()` may benefit from this. E. g., `LongStream.range(0, 10_000_000_000L).skip(1).count()` returns result instantly with this patch.
> 
> Some microbenchmarks added that confirm the reduced memory allocation in `toList()` and `toArray()` cases. Before patch:
> ref.SliceToList.seq_baseline:·gc.alloc.rate.norm                    10000  thrpt   10   40235,534 ±     0,984    B/op
> ref.SliceToList.seq_limit:·gc.alloc.rate.norm                       10000  thrpt   10  106431,101 ±     0,198    B/op
> ref.SliceToList.seq_skipLimit:·gc.alloc.rate.norm                   10000  thrpt   10  106544,977 ±     1,983    B/op
> value.SliceToArray.seq_baseline:·gc.alloc.rate.norm                 10000  thrpt   10   40121,878 ±     0,247    B/op
> value.SliceToArray.seq_limit:·gc.alloc.rate.norm                    10000  thrpt   10  106317,693 ±     1,083    B/op
> value.SliceToArray.seq_skipLimit:·gc.alloc.rate.norm                10000  thrpt   10  106430,954 ±     0,136    B/op
> 
> After patch:
> ref.SliceToList.seq_baseline:·gc.alloc.rate.norm                    10000  thrpt   10  40235,648 ±     1,354    B/op
> ref.SliceToList.seq_limit:·gc.alloc.rate.norm                       10000  thrpt   10  40355,784 ±     1,288    B/op
> ref.SliceToList.seq_skipLimit:·gc.alloc.rate.norm                   10000  thrpt   10  40476,032 ±     2,855    B/op
> value.SliceToArray.seq_baseline:·gc.alloc.rate.norm                 10000  thrpt   10  40121,830 ±     0,308    B/op
> value.SliceToArray.seq_limit:·gc.alloc.rate.norm                    10000  thrpt   10  40242,554 ±     0,443    B/op
> value.SliceToArray.seq_skipLimit:·gc.alloc.rate.norm                10000  thrpt   10  40363,674 ±     1,576    B/op
> 
> Time improvements are less exciting. It's likely that inlining and vectorizing dominate in these tests over array allocations and unnecessary copying. Still, I notice a significant improvement in SliceToArray.seq_limit case (2x) and mild improvement (+12..16%) in other slice tests. No significant change in parallel execution time, though its performance is much less stable and I didn't run enough tests.
> 
> Before patch:
> Benchmark                         (size)   Mode  Cnt      Score     Error  Units
> ref.SliceToList.par_baseline       10000  thrpt   30  14876,723 ±  99,770  ops/s
> ref.SliceToList.par_limit          10000  thrpt   30  14856,841 ± 215,089  ops/s
> ref.SliceToList.par_skipLimit      10000  thrpt   30   9555,818 ± 991,335  ops/s
> ref.SliceToList.seq_baseline       10000  thrpt   30  23732,290 ± 444,162  ops/s
> ref.SliceToList.seq_limit          10000  thrpt   30  14894,040 ± 176,496  ops/s
> ref.SliceToList.seq_skipLimit      10000  thrpt   30  10646,929 ±  36,469  ops/s
> value.SliceToArray.par_baseline    10000  thrpt   30  25093,141 ± 376,402  ops/s
> value.SliceToArray.par_limit       10000  thrpt   30  24798,889 ± 760,762  ops/s
> value.SliceToArray.par_skipLimit   10000  thrpt   30  16456,310 ± 926,882  ops/s
> value.SliceToArray.seq_baseline    10000  thrpt   30  69669,787 ± 494,562  ops/s
> value.SliceToArray.seq_limit       10000  thrpt   30  21097,081 ± 117,338  ops/s
> value.SliceToArray.seq_skipLimit   10000  thrpt   30  15522,871 ± 112,557  ops/s
> 
> After patch:
> Benchmark                         (size)   Mode  Cnt      Score      Error  Units
> ref.SliceToList.par_baseline       10000  thrpt   30  14793,373 ±   64,905  ops/s
> ref.SliceToList.par_limit          10000  thrpt   30  13301,024 ± 1300,431  ops/s
> ref.SliceToList.par_skipLimit      10000  thrpt   30  11131,698 ± 1769,932  ops/s
> ref.SliceToList.seq_baseline       10000  thrpt   30  24101,048 ±  263,528  ops/s
> ref.SliceToList.seq_limit          10000  thrpt   30  16872,168 ±   76,696  ops/s
> ref.SliceToList.seq_skipLimit      10000  thrpt   30  11953,253 ±  105,231  ops/s
> value.SliceToArray.par_baseline    10000  thrpt   30  25442,442 ±  455,554  ops/s
> value.SliceToArray.par_limit       10000  thrpt   30  23111,730 ± 2246,086  ops/s
> value.SliceToArray.par_skipLimit   10000  thrpt   30  17980,750 ± 2329,077  ops/s
> value.SliceToArray.seq_baseline    10000  thrpt   30  66512,898 ± 1001,042  ops/s
> value.SliceToArray.seq_limit       10000  thrpt   30  41792,549 ± 1085,547  ops/s
> value.SliceToArray.seq_skipLimit   10000  thrpt   30  18007,613 ±  141,716  ops/s
> 
> I also modernized SliceOps a little bit, using switch expression (with no explicit default!) and diamonds on anonymous classes.

test/jdk/java/util/stream/test/org/openjdk/tests/java/util/stream/CountTest.java line 195:

> 193:             assertEquals(Stream.of(1, 2, 3, 4).peek(e -> ai.getAndIncrement())
> 194:                 .parallel().skip(1).limit(2).skip(1).count(), 1);
> 195:             assertEquals(ai.get(), 0);

Does it make sense to extract the method here and launch it with different input streams?

test/jdk/java/util/stream/test/org/openjdk/tests/java/util/stream/SliceOpTest.java line 370:

> 368:         assertNotNull(prefix);
> 369:         assertTrue(parSpliterator.hasCharacteristics(Spliterator.SIZED));
> 370:         assertTrue(parSpliterator.hasCharacteristics(Spliterator.SUBSIZED));

It would be great to integrate the code comment to the assertion message. Then test failure would print the message making it easier to understand the failure.

It would be great to extract `assertHasCharacteristics` method so the failure printed the actual characteristics rather than "expected true got false"

test/micro/org/openjdk/bench/java/util/stream/ops/ref/SliceToList.java line 50:

> 48:     @Benchmark
> 49:     public List<String> seq_baseline() {
> 50:         return IntStream.range(0, size)

Typically you want to move all the constants to state fields to avoid constant folding by the compiler.
The compiler might accidentally use the fact that range start is always 0 and produce a dedicated optimized code for it. 

See https://shipilev.net/blog/2014/java-scala-divided-we-fail/

-------------

PR: https://git.openjdk.java.net/jdk/pull/3427