RFR: 8319423: Improve Year.isLeap by checking divisibility by 16 [v2]

Sun Nov 5 22:18:09 UTC 2023

On Sat, 4 Nov 2023 20:12:58 GMT, Cassio Neri <duke at openjdk.org> wrote:

>> Claes Redestad has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Apply similar optimization to GregorianCalendar, sun.util.calendar
>
> Thanks for your interest in my work. I'd love to assist porting our algorithms to Java. Notice, however, that I'm not a Java programmer and I don't have the complete picture of what goes on in the JVM. What follows is based on my experience with C/C++ compilers but, I reckon, most of it might apply to Java as well.
> 
> When determining if `year` is leap or not it's very reasonable to believe that checking divisibility by 4 first is the best strategy. As I told in [my talk](https://www.youtube.com/watch?v=0s9F4QWAl-E), virtually every implementation that I found made that assumption. However, this is not the case thanks to modern branch predictors, at least for x86_64 CPUs. My experiments showed that checking divisibility by 100 first is the best way:
> 
>     if (year % 100 != 0)
>       return year % 4 == 0;
>     return year % 400 == 0;
> 
> Maths results show that, in calculations of leap year, it's correct to replace `year % 400 == 0` by the cheaper expression `y % 16 == 0`. Except for [compiler bugs](https://bugs.llvm.org/show_bug.cgi?id=50334), this should always be done. Hence, the implementation that @cl4es  mentioned:
> 
>     public static boolean isLeap(long year) {
>         int d = year % 100 != 0 ? 4 : 16;
>         return (year & (d - 1)) == 0;
>     }
> 
> In my talk I also said that I did other optimisations around `year % 100 == 0` but that was another story. Let me tell this story.
> 
> Similar mathematical arguments as above show that it's also correct to replace `year % 100 == 0` with `year % 25 == 0` and [the latter requires one fewer assembly instruction](https://godbolt.org/z/zKza1Yc83). (I've also discussed this topic in a [PR](https://github.com/nakedible/datealgo-rs/discussions/21#discussioncomment-6907992) for the Rust implementation.) However, contrary to the case of 400-and-16, it's not always profitable to replace `year % 100 == 0` with `year % 25 == 0`. It depends on whether the code executed by the CPU contains branches or not. (Despite the usage of the ternary operator `?`, some CPUs support conditional moves which allow the elimination of some branches.)
> 
> If there's no branch, then this is probably be the best option:
> 
>     public static boolean is_leap_v0(long year) {
>       int d = year % 25 != 0 ? 4 : 16;
>       return (year & (d - 1)) == 0;
>     }
> 
> 
> If there's a branch, then I'd expect this to perform better:
> 
>     public static boolean is_leap_v0(long year) {
>       int d = year % 100 != 0 ? 4 : 16;
>       return (year & (d - 1)) == 0;
>     }
> 
> The reason is...

Thank you for your comments, @cassioneri 

For reference I included the following variant in the `isLeapYearNS` microbenchmark:

        int d = year % 100 != 0 ? 4 : 16;
        return (year & (d - 1)) == 0;

and as shown above this underperforms even the baseline implementation when executed on the HotSpot JVM with C2: `0,558 ± 0,026 ops/us` vs `0,743 ± 0,009` for the baseline implementation. My experiments then explored how we could trick the C2 JIT to improve over the baseline, and the patch suggested in this PR is what I ended up with. 

This might suggest a deficiency in the C2 JIT rather than an issue with your code. But let's dig a bit deeper and analyze the ASM generated on x86. 

With `isLeapYearNS` we seem to generate something like this for the actual calculation: 

   ...
   0.88%   │  │││ ││││││  0x00007f603062d39c:   movabs $0xa3d70a3d70a3d70b,%rax
   1.47%   │  │││ ││││││  0x00007f603062d3a6:   imul   %rbp
   1.29%   │  │││ ││││││  0x00007f603062d3a9:   add    %rbp,%rdx
   1.23%   │  │││ ││││││  0x00007f603062d3ac:   mov    %rbp,%rcx
   0.41%   │  │││ ││││││  0x00007f603062d3af:   sar    $0x3f,%rcx
   0.43%   │  │││ ││││││  0x00007f603062d3b3:   sar    $0x6,%rdx
   1.63%   │  │││ ││││││  0x00007f603062d3b7:   sub    %rcx,%rdx
   2.70%   │  │││ ││││││  0x00007f603062d3ba:   imul   $0x64,%rdx,%rcx
   9.91%   │  │││ ││││││  0x00007f603062d3be:   sub    %rcx,%rbp
   ...

.. so, some sequence of multiplications, shifts and subtractions, not dissimilar to the code you'd expect from Lemire's calculus.

Testing the PR under test then the inner calculation becomes... exactly the same?

   0.70%   │  │││ ││││││  0x00007f8fe062ad1c:   movabs $0xa3d70a3d70a3d70b,%rax
   1.37%   │  │││ ││││││  0x00007f8fe062ad26:   imul   %rbp
   2.03%   │  │││ ││││││  0x00007f8fe062ad29:   add    %rbp,%rdx
   1.80%   │  │││ ││││││  0x00007f8fe062ad2c:   mov    %rbp,%rcx
   0.59%   │  │││ ││││││  0x00007f8fe062ad2f:   sar    $0x3f,%rcx
   0.66%   │  │││ ││││││  0x00007f8fe062ad33:   sar    $0x6,%rdx
   1.66%   │  │││ ││││││  0x00007f8fe062ad37:   sub    %rcx,%rdx
   2.27%   │  │││ ││││││  0x00007f8fe062ad3a:   imul   $0x64,%rdx,%rcx
  10.05%   │  │││ ││││││  0x00007f8fe062ad3e:   sub    %rcx,%rbp

Yet faster... The difference seem to boil down to how the JIT is better able to unroll and vectorize the benchmark loop with my PR'd code. While not an irrelevant property, this means the improvement I'm seeing might be a bit overstated for more typical cases, and I'll need to see if what happens if I set the microbenchmark up not to inline and unroll heavily.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16491#issuecomment-1793862266