RFR: 8339573: Update CodeCacheSegmentSize and CodeEntryAlignment for ARM [v3]

Tue Oct 8 22:02:01 UTC 2024

On Tue, 8 Oct 2024 21:31:36 GMT, Boris Ulasevich <bulasevich at openjdk.org> wrote:

>>> There are reasons against the change for N1/N2
>>> 
>>>     * Vladimir saw a regression when testing -XX:CodeEntryAlignment=16 on Ampere (Neoverse N1)
>> 
>> The regression is for 16 not 32. Maybe 32 won't cause regressions. 
>> 
>>> 
>>>     * with native codes experiment I found that code entry 64B is preferable on G2 (Neoverse N1)
>> 
>> Could you share more data?
>
>> Could you share more data?
> 
> This is a simple native benchmark: https://cr.openjdk.org/~bulasevich/AlignBench.cpp
> 
> The program tests the performance of code starting from different alignments 
> using a generated block of code that primarily consists of NOP (no operation)
> and RET (return from subroutine) instructions.
> 
> In this experiment (though it may not be meaningful for real-world scenarios) 
> on G2 execution time increases with misalignment from the 64-byte boundary, with the 
> best performance at exact 64-byte alignment and slower times as the offset grows.
> 
> align_64+0: 391.457ms 391.157ms 391.481ms
> align_64+4: 392.245ms 392.057ms 391.792ms
> align_64+8: 392.154ms 392.962ms 392.65ms
> align_64+12: 393.485ms 393.307ms 393.658ms
> align_64+16: 394.434ms 394.679ms 394.284ms
> align_64+20: 395.681ms 395.709ms 394.843ms
> align_64+24: 395.799ms 396.396ms 395.977ms
> align_64+28: 397.379ms 397.278ms 397.359ms
> align_64+32: 397.677ms 397.82ms 397.95ms
> align_64+36: 399.08ms 398.88ms 399.075ms
> align_64+40: 399.829ms 400.118ms 399.981ms
> align_64+44: 400.916ms 400.747ms 401.241ms
> align_64+48: 401.736ms 401.831ms 402.54ms
> align_64+52: 402.705ms 402.569ms 402.446ms
> align_64+56: 403.718ms 403.822ms 403.535ms
> align_64+60: 404.722ms 404.824ms 404.726ms
> align_64+64: 390.852ms 390.669ms 391.051ms

This benchmark certainly has some limitations as nops are special and can be discarded and having the 64B alignment means the core can quickly fetch and throw away as many nops as possible. I'm not sure that it's particularly representative of a real output and the benefits that code density can bring here.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20864#discussion_r1792557931