RFR: 8295023: Interpreter(AArch64): Implement -XX:+PrintBytecodeHistogram and -XX:+PrintBytecodePairHistogram options [v3]
Andrew Haley
aph at openjdk.org
Wed Oct 12 13:52:03 UTC 2022
On Wed, 12 Oct 2022 10:37:18 GMT, Hao Sun <haosun at openjdk.org> wrote:
>> In this patch, we implement functions histogram_bytecode() and histogram_bytecode_pair() for interpreter AArch64 part. Similar to count_bytecode(), we use atomic operations to update the counters as well.
>>
>> Here shows part of the message produced with -XX:+PrintBytecodeHistogram and -XX:+PrintBytecodePairHistogram options after this patch.
>>
>>
>> $ java -XX:+PrintBytecodeHistogram --version | head -20
>> openjdk 20-internal 2023-03-21
>> OpenJDK Runtime Environment (fastdebug build 20-internal-adhoc.haosun.jdk-src-dev)
>> OpenJDK 64-Bit Server VM (fastdebug build 20-internal-adhoc.haosun.jdk-src-dev, mixed mode)
>>
>> Histogram of 5004099 executed bytecodes:
>>
>> absolute relative code name
>> ----------------------------------------------------------------------
>> 319124 6.38% dc fast_aload_0
>> 313397 6.26% e0 fast_iload
>> 251436 5.02% b6 invokevirtual
>> 227428 4.54% 19 aload
>> 166054 3.32% a7 goto
>> 159167 3.18% 2b aload_1
>> 151803 3.03% de fast_aaccess_0
>> 136787 2.73% 1b iload_1
>> 124037 2.48% 36 istore
>> 118791 2.37% 84 iinc
>> 118121 2.36% 1c iload_2
>> 110484 2.21% a2 if_icmpge
>>
>> $ java -XX:+PrintBytecodePairHistogram --version | head -20
>> openjdk 20-internal 2023-03-21
>> OpenJDK Runtime Environment (fastdebug build 20-internal-adhoc.haosun.jdk-src-dev)
>> OpenJDK 64-Bit Server VM (fastdebug build 20-internal-adhoc.haosun.jdk-src-dev, mixed mode)
>>
>> Histogram of 4804441 executed bytecode pairs:
>>
>> absolute relative codes 1st bytecode 2nd bytecode
>> ----------------------------------------------------------------------
>> 77602 1.615% 84 a7 iinc goto
>> 49749 1.035% 36 e0 istore fast_iload
>> 48931 1.018% e0 10 fast_iload bipush
>> 46294 0.964% e0 b6 fast_iload invokevirtual
>> 42661 0.888% a7 e0 goto fast_iload
>> 42243 0.879% 3a 19 astore aload
>> 40138 0.835% 19 b9 aload invokeinterface
>> 36617 0.762% dc 2b fast_aload_0 aload_1
>> 35745 0.744% b7 dc invokespecial fast_aload_0
>> 35384 0.736% 19 b6 aload invokevirtual
>> 35035 0.729% b6 de invokevirtual fast_aaccess_0
>> 34667 0.722% dc b6 fast_aload_0 invokevirtual
>>
>>
>> In order to verfiy the correctness, I took the trace information produced by -XX:+TraceBytecodes as a cross reference. The hit times for some bytecodes/bytecode pairs can be obtained via parsing the trace. Then I compared the hit times with the corresponding "absolute" columns. I randomly selected several bytecodes/bytecode pairs, and the manual comparion results showed that "absolute" columns are correct.
>>
>> Note-1: count_bytecode() is updated. 1) caller-saved registers are used as temporary registers and there is no need to save/restore them. 2) atomic_addw() should be used since the counter is of int type.
>>
>> Note-2: As shown by the update in file templateInterpreterGenerator.cpp, function histogram_bytecode() should be invoked only inside !PRODUCT scope.
>
> Hao Sun has updated the pull request incrementally with one additional commit since the last revision:
>
> Remove rscratch3 for count_bytecode() and histogram_bytecode()
src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp line 2004:
> 2002: /* kind */ Assembler::LSR,
> 2003: /* shift */ BytecodePairHistogram::log2_number_of_codes);
> 2004:
I've had another look at this. `_index` is a two-element queue of bytecodes, but it is shared between all threads. If two threads access `_index` racily the result will be invalid, regardless of whether the OR into memory is atomic. This will make this PR much simpler.
None of the other ports access `_index` atomically, and should need we. On the other hand, bumping the bytecode counter atomically is fine.
We could make a thread-local `_index`, but it's too much code to be worthwhile.
-------------
PR: https://git.openjdk.org/jdk/pull/10642
More information about the hotspot-dev
mailing list