RFR: 8319423: Improve Year.isLeap by checking divisibility by 16 [v2]

Sun Nov 5 23:36:07 UTC 2023

On Sun, 5 Nov 2023 22:14:50 GMT, Claes Redestad <redestad at openjdk.org> wrote:

>> Thanks for your interest in my work. I'd love to assist porting our algorithms to Java. Notice, however, that I'm not a Java programmer and I don't have the complete picture of what goes on in the JVM. What follows is based on my experience with C/C++ compilers but, I reckon, most of it might apply to Java as well.
>> 
>> When determining if `year` is leap or not it's very reasonable to believe that checking divisibility by 4 first is the best strategy. As I told in [my talk](https://www.youtube.com/watch?v=0s9F4QWAl-E), virtually every implementation that I found made that assumption. However, this is not the case thanks to modern branch predictors, at least for x86_64 CPUs. My experiments showed that checking divisibility by 100 first is the best way:
>> 
>>     if (year % 100 != 0)
>>       return year % 4 == 0;
>>     return year % 400 == 0;
>> 
>> Maths results show that, in calculations of leap year, it's correct to replace `year % 400 == 0` by the cheaper expression `y % 16 == 0`. Except for [compiler bugs](https://bugs.llvm.org/show_bug.cgi?id=50334), this should always be done. Hence, the implementation that @cl4es  mentioned:
>> 
>>     public static boolean isLeap(long year) {
>>         int d = year % 100 != 0 ? 4 : 16;
>>         return (year & (d - 1)) == 0;
>>     }
>> 
>> In my talk I also said that I did other optimisations around `year % 100 == 0` but that was another story. Let me tell this story.
>> 
>> Similar mathematical arguments as above show that it's also correct to replace `year % 100 == 0` with `year % 25 == 0` and [the latter requires one fewer assembly instruction](https://godbolt.org/z/zKza1Yc83). (I've also discussed this topic in a [PR](https://github.com/nakedible/datealgo-rs/discussions/21#discussioncomment-6907992) for the Rust implementation.) However, contrary to the case of 400-and-16, it's not always profitable to replace `year % 100 == 0` with `year % 25 == 0`. It depends on whether the code executed by the CPU contains branches or not. (Despite the usage of the ternary operator `?`, some CPUs support conditional moves which allow the elimination of some branches.)
>> 
>> If there's no branch, then this is probably be the best option:
>> 
>>     public static boolean is_leap_v0(long year) {
>>       int d = year % 25 != 0 ? 4 : 16;
>>       return (year & (d - 1)) == 0;
>>     }
>> 
>> 
>> If there's a branch, then I'd expect this to perform better:
>> 
>>     public static boolean is_leap_v0(long year) {
>>       int d = year % 100 != 0 ? 4 ...
>
> Thank you for your comments, @cassioneri 
> 
> For reference I included the following variant in the `isLeapYearNS` microbenchmark:
> 
>         int d = year % 100 != 0 ? 4 : 16;
>         return (year & (d - 1)) == 0;
> 
> and as shown above this underperforms even the baseline implementation when executed on the HotSpot JVM with C2: `0,558 ± 0,026 ops/us` vs `0,743 ± 0,009` for the baseline implementation. My experiments then explored how we could trick the C2 JIT to improve over the baseline, and the patch suggested in this PR is what I ended up with. 
> 
> This might suggest a deficiency in the C2 JIT rather than an issue with your code. But let's dig a bit deeper and analyze the ASM generated on x86. 
> 
> With `isLeapYearNS` we seem to generate something like this for the actual calculation: 
> 
>    ...
>    0.88%   │  │││ ││││││  0x00007f603062d39c:   movabs $0xa3d70a3d70a3d70b,%rax
>    1.47%   │  │││ ││││││  0x00007f603062d3a6:   imul   %rbp
>    1.29%   │  │││ ││││││  0x00007f603062d3a9:   add    %rbp,%rdx
>    1.23%   │  │││ ││││││  0x00007f603062d3ac:   mov    %rbp,%rcx
>    0.41%   │  │││ ││││││  0x00007f603062d3af:   sar    $0x3f,%rcx
>    0.43%   │  │││ ││││││  0x00007f603062d3b3:   sar    $0x6,%rdx
>    1.63%   │  │││ ││││││  0x00007f603062d3b7:   sub    %rcx,%rdx
>    2.70%   │  │││ ││││││  0x00007f603062d3ba:   imul   $0x64,%rdx,%rcx
>    9.91%   │  │││ ││││││  0x00007f603062d3be:   sub    %rcx,%rbp
>    ...
> 
> .. so, some sequence of multiplications, shifts and subtractions, not dissimilar to the code you'd expect from Lemire's calculus.
> 
> Testing the PR under test then the inner calculation becomes... exactly the same?
> 
>    0.70%   │  │││ ││││││  0x00007f8fe062ad1c:   movabs $0xa3d70a3d70a3d70b,%rax
>    1.37%   │  │││ ││││││  0x00007f8fe062ad26:   imul   %rbp
>    2.03%   │  │││ ││││││  0x00007f8fe062ad29:   add    %rbp,%rdx
>    1.80%   │  │││ ││││││  0x00007f8fe062ad2c:   mov    %rbp,%rcx
>    0.59%   │  │││ ││││││  0x00007f8fe062ad2f:   sar    $0x3f,%rcx
>    0.66%   │  │││ ││││││  0x00007f8fe062ad33:   sar    $0x6,%rdx
>    1.66%   │  │││ ││││││  0x00007f8fe062ad37:   sub    %rcx,%rdx
>    2.27%   │  │││ ││││││  0x00007f8fe062ad3a:   imul   $0x64,%rdx,%rcx
>   10.05%   │  │││ ││││││  0x00007f8fe062ad3e:   sub    %rcx,%rbp
> 
> 
> Yet faster... The difference seem to boil down to how the JIT is better able to unroll and vectorize the benchmark loop with my PR'd code. While not an irrelevant property, this means the improvement I'm seeing might be a bit overstated for more typical cases, and I'll need t...

Thanks @cl4es for the ASM listing. Now I can understand better what is going on at the very low level.

The compiler is replacing `y % 100 == 0` with `y == 100 * (y / 100)` and is using the traditional Granlund-Montgomery (GM) optimisation for the quotient `q = y / 100`, that is, `q` is obtained as `(y * M) >> 70` where `M = ceil(2^70/100)`. This is confirmed by the appearance of the hex constant `0xa3d70a3d70a3d70b` (this is `M`) and the two `imul` instructions (`y * M` and `100 * q`). (This is not Daniel Lemire's technique but related.)

There's a more recent technique that can give `y % 100 == 0` with just one multiplication (again, it's not Daniel Lemire's) which gcc started using in v9. See the difference [here](https://godbolt.org/z/48YP1Ynv4). This is what I called "minverse" in my first article on quick modular calculations (previously referred).

I think it would be very useful if the Java compiler also implemented this technique. Not only for this piece of code but for any time that `n % d == 0` with constant `d` appears in source code.

For this particular piece of code though, one could manually perform the minverse optimisation. Recall from my previous posts that in leap year calculation, one can replace `100` by `25`. This is helpful since, as seen in the link above, the latter doesn't need the `ror` instruction (which doesn't have an equivalent in C/C++ and, I suppose, Java.) Hence, `y % 100 == 0` can be replace by `y % 25 == 0` which, in turn, can be replaced by

(y * 0x8F5C28F5C28F5C29 + 0x51EB851EB851EB8) < 0xA3D70A3D70A3D70.

(I'm deducing this code from the godbolt link and I hope I got it right but, if you're interested I can double check another time.) It's **very important** for the calculations above to be done in unsigned arithmetic. I don't know how this is done in Java but, FWIW, in C it would be:

(((uint64_t) y) * 0x8F5C28F5C28F5C29ull + 0x51EB851EB851EB8ull) < 0xA3D70A3D70A3D70ull;

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16491#issuecomment-1793882054