RFR: 8278518: String(byte[], int, int, Charset) constructor and String.translateEscapes() miss bounds check elimination [v3]

Thu Jan 13 18:21:25 UTC 2022

On Thu, 13 Jan 2022 10:24:15 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> In the micro benchmark for this bug, the bytecode has a single loop
>> head with 2 backedges. One branch is taken a lot more frequently than
>> the other (for instance, profiling reports 3820938 for the most
>> common, 134068 for the least common). When ciTypeFlow builds its loop
>> tree, the least common loop ends up as the inner loop (ciTypeFlow uses
>> the tail block number to order loops with a shared header). Then
>> ciTypeFlow clones the inner loop head.
>> 
>> The results I get then is:
>> 
>> StringBenchmark.safeDecoding    avgt    5  141.965 ± 53.511  ns/op
>> StringBenchmark.unsafeDecoding  avgt    5  123.051 ± 77.855  ns/op
>> 
>> The unsafe version uses unsafe instead of standard array accesses. The
>> reason for the unsafe version is to demonstrate that a missed range
>> check elimination in the common path hurts performance (the range
>> check is not performed on the loop iv but another phi so can't be
>> eliminated).
>> 
>> The patch I propose takes back branch count (from profile data) into
>> account when building the ciTypeFlow tree. As a consequence, the inner
>> loop (the one for which the loop head is cloned), is the most most
>> frequent one. The range check in the common path is also hoisted in
>> that case (it's now performed on the iv). Running the micro benchmark
>> again, I get:
>> 
>> StringBenchmark.safeDecoding    avgt    5  53.368 ± 2.232  ns/op
>> StringBenchmark.unsafeDecoding  avgt    5  55.565 ± 5.124  ns/op
>> 
>> Both are improved (2-3x) and the unsafe version doesn't perform better.
>
> Roland Westrelin has updated the pull request incrementally with four additional commits since the last revision:
> 
>  - whitespaces
>  - more review
>  - simple benchmark
>  - benchmark

This is a good improvement.  I agree that the type-flow pass "plays dumb" about graph structure; it was originally designed even "dumber".  The idea was that following phases could arrange loop structure all by themselves. But this gave up some early optimizations in the parsing phase, which is too early for loop inference.  So type-flow has gotten "smarter" over the years, to give the parsing phase more foresight about loop structure.  This new improvement is appropriate to that design.

As a matter of style and readability I suggest threading the `ciMethod` pointer (which is the root for getting profile info out) through instance variables rather than random extra arguments.  The problem with random extra arguments is they could be any (random) `ciMethod` but we know it's only the statically relevant method to the whole `ciTypeFlow` pass.  You can't tell from reading the code (except as a whole) what that random argument is, but you *can* if the `ciMethod` in question is the one stored in `ciTypeFlow::_method`.

So, I'd use the inner/outer class design pattern, where `Loop` gets an `_outer` link that points to the containing `ciTypeFlow`, which in turn has the `_method` field you need.  Just add a constructor argument to Loop to set up the `_outer` link and you don't need the ad hoc extra argument to `sorted_merge`.

I notice that you cleaned up `clone_loop_heads` in the same way, bravo.

I don't fully follow the open-coded sorting logic in `sorted_merge`, and I wish it could be made more clear.  The expression `current_count == lp_count` is particularly puzzling to me; either it should be a `<=` or `>=` comparison, or it is some sort of tie-breaking logic.  The role of all of those comparisons would be easier to follow if there were (I don't know if it's possible?) a separately factored (partial) ordering function, as a static helper function.  The problem with open-coding the comparison logic in an insertion sort algorithm, as you have here, is that if you have multiple sorting keys, it's really hard to verify that they are correctly applied.

Also (minor nit) you use the pattern `if (pred1) { break; }; if (pred2) { break; }` and also the equivalent pattern `if (pred1) { break; } else if (pred2) { break; }`.  I prefer the first, and would find the code more readable if you just picked one or the other in any case.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7034