RFR: 8295023: Interpreter(AArch64): Implement -XX:+PrintBytecodeHistogram and -XX:+PrintBytecodePairHistogram options [v3]

Andrew Haley aph at openjdk.org
Wed Oct 12 13:52:03 UTC 2022


On Wed, 12 Oct 2022 10:37:18 GMT, Hao Sun <haosun at openjdk.org> wrote:

>> In this patch, we implement functions histogram_bytecode() and histogram_bytecode_pair() for interpreter AArch64 part. Similar to count_bytecode(), we use atomic operations to update the counters as well.
>> 
>> Here shows part of the message produced with -XX:+PrintBytecodeHistogram and -XX:+PrintBytecodePairHistogram options after this patch.
>> 
>> 
>> $ java -XX:+PrintBytecodeHistogram --version | head -20
>> openjdk 20-internal 2023-03-21
>> OpenJDK Runtime Environment (fastdebug build 20-internal-adhoc.haosun.jdk-src-dev)
>> OpenJDK 64-Bit Server VM (fastdebug build 20-internal-adhoc.haosun.jdk-src-dev, mixed mode)
>> 
>> Histogram of 5004099 executed bytecodes:
>> 
>>   absolute  relative  code    name
>> ----------------------------------------------------------------------
>>     319124     6.38%    dc    fast_aload_0
>>     313397     6.26%    e0    fast_iload
>>     251436     5.02%    b6    invokevirtual
>>     227428     4.54%    19    aload
>>     166054     3.32%    a7    goto
>>     159167     3.18%    2b    aload_1
>>     151803     3.03%    de    fast_aaccess_0
>>     136787     2.73%    1b    iload_1
>>     124037     2.48%    36    istore
>>     118791     2.37%    84    iinc
>>     118121     2.36%    1c    iload_2
>>     110484     2.21%    a2    if_icmpge
>> 
>> $ java -XX:+PrintBytecodePairHistogram --version | head -20
>> openjdk 20-internal 2023-03-21
>> OpenJDK Runtime Environment (fastdebug build 20-internal-adhoc.haosun.jdk-src-dev)
>> OpenJDK 64-Bit Server VM (fastdebug build 20-internal-adhoc.haosun.jdk-src-dev, mixed mode)
>> 
>> Histogram of 4804441 executed bytecode pairs:
>> 
>>   absolute  relative    codes    1st bytecode        2nd bytecode
>> ----------------------------------------------------------------------
>>      77602    1.615%    84 a7    iinc                goto
>>      49749    1.035%    36 e0    istore              fast_iload
>>      48931    1.018%    e0 10    fast_iload          bipush
>>      46294    0.964%    e0 b6    fast_iload          invokevirtual
>>      42661    0.888%    a7 e0    goto                fast_iload
>>      42243    0.879%    3a 19    astore              aload
>>      40138    0.835%    19 b9    aload               invokeinterface
>>      36617    0.762%    dc 2b    fast_aload_0        aload_1
>>      35745    0.744%    b7 dc    invokespecial       fast_aload_0
>>      35384    0.736%    19 b6    aload               invokevirtual
>>      35035    0.729%    b6 de    invokevirtual       fast_aaccess_0
>>      34667    0.722%    dc b6    fast_aload_0        invokevirtual
>> 
>> 
>> In order to verfiy the correctness, I took the trace information produced by -XX:+TraceBytecodes as a cross reference. The hit times for some bytecodes/bytecode pairs can be obtained via parsing the trace. Then I compared the hit times with the corresponding "absolute" columns. I randomly selected several bytecodes/bytecode pairs, and the manual comparion results showed that "absolute" columns are correct.
>> 
>> Note-1: count_bytecode() is updated. 1) caller-saved registers are used as temporary registers and there is no need to save/restore them. 2) atomic_addw() should be used since the counter is of int type.
>> 
>> Note-2: As shown by the update in file templateInterpreterGenerator.cpp, function histogram_bytecode() should be invoked only inside !PRODUCT scope.
>
> Hao Sun has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Remove rscratch3 for count_bytecode() and histogram_bytecode()

src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp line 2004:

> 2002:                  /* kind */ Assembler::LSR,
> 2003:                  /* shift */ BytecodePairHistogram::log2_number_of_codes);
> 2004: 

I've had another look at this. `_index` is a two-element queue of bytecodes, but it is shared between all threads. If two threads access `_index` racily the result will be invalid, regardless of whether the OR into memory is atomic. This will make this PR much simpler.
None of the other ports access `_index` atomically, and should need we. On the other hand, bumping the bytecode counter atomically is fine.
We could make a thread-local `_index`, but it's too much code to be worthwhile.

-------------

PR: https://git.openjdk.org/jdk/pull/10642


More information about the hotspot-dev mailing list