RFR: 8265029: Preserve SIZED characteristics on slice operations (skip, limit) [v7]

Mon May 24 21:57:08 UTC 2021

On Mon, 24 May 2021 04:35:42 GMT, Tagir F. Valeev <tvaleev at openjdk.org> wrote:

>> With the introduction of `toList()`, preserving the SIZED characteristics in more cases becomes more important. This patch preserves SIZED on `skip()` and `limit()` operations, so now every combination of `map/mapToX/boxed/asXyzStream/skip/limit/sorted` preserves size, and `toList()`, `toArray()` and `count()` may benefit from this. E. g., `LongStream.range(0, 10_000_000_000L).skip(1).count()` returns result instantly with this patch.
>> 
>> Some microbenchmarks added that confirm the reduced memory allocation in `toList()` and `toArray()` cases. Before patch:
>> 
>> ref.SliceToList.seq_baseline:·gc.alloc.rate.norm                    10000  thrpt   10   40235,534 ±     0,984    B/op
>> ref.SliceToList.seq_limit:·gc.alloc.rate.norm                       10000  thrpt   10  106431,101 ±     0,198    B/op
>> ref.SliceToList.seq_skipLimit:·gc.alloc.rate.norm                   10000  thrpt   10  106544,977 ±     1,983    B/op
>> value.SliceToArray.seq_baseline:·gc.alloc.rate.norm                 10000  thrpt   10   40121,878 ±     0,247    B/op
>> value.SliceToArray.seq_limit:·gc.alloc.rate.norm                    10000  thrpt   10  106317,693 ±     1,083    B/op
>> value.SliceToArray.seq_skipLimit:·gc.alloc.rate.norm                10000  thrpt   10  106430,954 ±     0,136    B/op
>> 
>> 
>> After patch:
>> 
>> ref.SliceToList.seq_baseline:·gc.alloc.rate.norm                    10000  thrpt   10  40235,648 ±     1,354    B/op
>> ref.SliceToList.seq_limit:·gc.alloc.rate.norm                       10000  thrpt   10  40355,784 ±     1,288    B/op
>> ref.SliceToList.seq_skipLimit:·gc.alloc.rate.norm                   10000  thrpt   10  40476,032 ±     2,855    B/op
>> value.SliceToArray.seq_baseline:·gc.alloc.rate.norm                 10000  thrpt   10  40121,830 ±     0,308    B/op
>> value.SliceToArray.seq_limit:·gc.alloc.rate.norm                    10000  thrpt   10  40242,554 ±     0,443    B/op
>> value.SliceToArray.seq_skipLimit:·gc.alloc.rate.norm                10000  thrpt   10  40363,674 ±     1,576    B/op
>> 
>> 
>> Time improvements are less exciting. It's likely that inlining and vectorizing dominate in these tests over array allocations and unnecessary copying. Still, I notice a significant improvement in SliceToArray.seq_limit case (2x) and mild improvement (+12..16%) in other slice tests. No significant change in parallel execution time, though its performance is much less stable and I didn't run enough tests.
>> 
>> Before patch:
>> 
>> Benchmark                         (size)   Mode  Cnt      Score     Error  Units
>> ref.SliceToList.par_baseline       10000  thrpt   30  14876,723 ±  99,770  ops/s
>> ref.SliceToList.par_limit          10000  thrpt   30  14856,841 ± 215,089  ops/s
>> ref.SliceToList.par_skipLimit      10000  thrpt   30   9555,818 ± 991,335  ops/s
>> ref.SliceToList.seq_baseline       10000  thrpt   30  23732,290 ± 444,162  ops/s
>> ref.SliceToList.seq_limit          10000  thrpt   30  14894,040 ± 176,496  ops/s
>> ref.SliceToList.seq_skipLimit      10000  thrpt   30  10646,929 ±  36,469  ops/s
>> value.SliceToArray.par_baseline    10000  thrpt   30  25093,141 ± 376,402  ops/s
>> value.SliceToArray.par_limit       10000  thrpt   30  24798,889 ± 760,762  ops/s
>> value.SliceToArray.par_skipLimit   10000  thrpt   30  16456,310 ± 926,882  ops/s
>> value.SliceToArray.seq_baseline    10000  thrpt   30  69669,787 ± 494,562  ops/s
>> value.SliceToArray.seq_limit       10000  thrpt   30  21097,081 ± 117,338  ops/s
>> value.SliceToArray.seq_skipLimit   10000  thrpt   30  15522,871 ± 112,557  ops/s
>> 
>> 
>> After patch:
>> 
>> Benchmark                         (size)   Mode  Cnt      Score      Error  Units
>> ref.SliceToList.par_baseline       10000  thrpt   30  14793,373 ±   64,905  ops/s
>> ref.SliceToList.par_limit          10000  thrpt   30  13301,024 ± 1300,431  ops/s
>> ref.SliceToList.par_skipLimit      10000  thrpt   30  11131,698 ± 1769,932  ops/s
>> ref.SliceToList.seq_baseline       10000  thrpt   30  24101,048 ±  263,528  ops/s
>> ref.SliceToList.seq_limit          10000  thrpt   30  16872,168 ±   76,696  ops/s
>> ref.SliceToList.seq_skipLimit      10000  thrpt   30  11953,253 ±  105,231  ops/s
>> value.SliceToArray.par_baseline    10000  thrpt   30  25442,442 ±  455,554  ops/s
>> value.SliceToArray.par_limit       10000  thrpt   30  23111,730 ± 2246,086  ops/s
>> value.SliceToArray.par_skipLimit   10000  thrpt   30  17980,750 ± 2329,077  ops/s
>> value.SliceToArray.seq_baseline    10000  thrpt   30  66512,898 ± 1001,042  ops/s
>> value.SliceToArray.seq_limit       10000  thrpt   30  41792,549 ± 1085,547  ops/s
>> value.SliceToArray.seq_skipLimit   10000  thrpt   30  18007,613 ±  141,716  ops/s
>> 
>> 
>> I also modernized SliceOps a little bit, using switch expression (with no explicit default!) and diamonds on anonymous classes.
>
> Tagir F. Valeev has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Trailing whitespace removed

Very good. Thanks making the adjustments. Architecturally, i think we are in a better place. Just have some comments, mostly around code comments.

src/java.base/share/classes/java/util/stream/AbstractPipeline.java line 471:

> 469:         int flags = getStreamAndOpFlags();
> 470:         long size = StreamOpFlag.SIZED.isKnown(flags) ? spliterator.getExactSizeIfKnown() : -1;
> 471:         if (size != -1 && StreamOpFlag.SIZE_ADJUSTING.isKnown(flags) && !isParallel()) {

Very nice. It's a good compromise to support only for sequential streams, since we have no size adjusting intermediate stateless op. If that was the case we would need to step back through the pipeline until the depth was zero, then step forward. I think it worth a comment here to inform our future selves if we ever add such an operation. 

Strictly speaking we only need to call `exactOutputSize` if the stage is size adjusting. Not sure it really matters perf-wise. If we leave as is maybe add a comment.

src/java.base/share/classes/java/util/stream/StreamOpFlag.java line 331:

> 329: 
> 330:     /**
> 331:      * Characteristic value signifying that an operation may adjust the

I think we need to add two additional constraints to the documentation:
1. The flag, if present, is only valid when SIZED is present; and
2. The flag is only valid for sequential streams.
The latter is a good compromise given we currently have no size adjusting stateless intermediate op.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3427