From duke at openjdk.java.net Wed Dec 1 00:02:14 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Wed, 1 Dec 2021 00:02:14 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v2] In-Reply-To: References: Message-ID: > Accelerates CRC32-C by utilizing vpclmulqdq similarly to CRC32. This change achieves ~4x throughput improvement. > > 5986.947899319073 MB/s => 24041.05203089616 MB/s > 5840.02689336947 MB/s => 24898.781468710356 MB/s > > ********** Original *********** > > > scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 > offset = 0 > msgSize = 512 bytes > iters = 20000000 > ------------------------------------------------------- > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > CRC32C.update(byte[]) runtime = 1.710387358 seconds > CRC32C.update(byte[]) throughput = 5986.947899319073 MB/s > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > ------------------------------------------------------- > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > CRC32C.update(ByteBuffer) runtime = 1.753416583 seconds > CRC32C.update(ByteBuffer) throughput = 5840.02689336947 MB/s > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > ------------------------------------------------------- > > > > > *********** With my changes: ************* > > > > scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 > offset = 0 > msgSize = 512 bytes > iters = 20000000 > ------------------------------------------------------- > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > CRC32C.update(byte[]) runtime = 0.425938099 seconds > CRC32C.update(byte[]) throughput = 24041.05203089616 MB/s > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > ------------------------------------------------------- > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > CRC32C.update(ByteBuffer) runtime = 0.411265106 seconds > CRC32C.update(ByteBuffer) throughput = 24898.781468710356 MB/s > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > ------------------------------------------------------- Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: Adding CRC32-C microbenchmark. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6595/files - new: https://git.openjdk.java.net/jdk/pull/6595/files/10aeaec6..fd87bb92 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6595&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6595&range=00-01 Stats: 62 lines in 1 file changed: 62 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6595.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6595/head:pull/6595 PR: https://git.openjdk.java.net/jdk/pull/6595 From duke at openjdk.java.net Wed Dec 1 00:13:27 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Wed, 1 Dec 2021 00:13:27 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v2] In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 00:02:14 GMT, Scott Gibbons wrote: >> Accelerates CRC32-C by utilizing vpclmulqdq similarly to CRC32. This change achieves ~4x throughput improvement. >> >> 5986.947899319073 MB/s => 24041.05203089616 MB/s >> 5840.02689336947 MB/s => 24898.781468710356 MB/s >> >> ********** Original *********** >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 1.710387358 seconds >> CRC32C.update(byte[]) throughput = 5986.947899319073 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 1.753416583 seconds >> CRC32C.update(ByteBuffer) throughput = 5840.02689336947 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> >> >> >> >> *********** With my changes: ************* >> >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 0.425938099 seconds >> CRC32C.update(byte[]) throughput = 24041.05203089616 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 0.411265106 seconds >> CRC32C.update(ByteBuffer) throughput = 24898.781468710356 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- > > Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: > > Adding CRC32-C microbenchmark. Hi, Eric. I added a microbenchmark for CRC32-C. I'm waiting for full completion, but it looks like somewhere around 40GB/s throughput on average. I'll post the results once completed. ------------- PR: https://git.openjdk.java.net/jdk/pull/6595 From sviswanathan at openjdk.java.net Wed Dec 1 00:49:47 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 1 Dec 2021 00:49:47 GMT Subject: RFR: 8277793: Support vector F2I and D2L cast operations for X86 [v2] In-Reply-To: References: Message-ID: On Sun, 28 Nov 2021 18:37:40 GMT, Jatin Bhateja wrote: >> - JDK-8275317 extended auto-vectorizer to infer Vector Cast operations if source and destination primitive type have same size. >> - This patch adds the backend support for vector CastF2I and CaseD2L on X86 AVX512 and legacy targets. >> >> Following are the performance measurements of an existing JMH benchmark (test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java) >> >> System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> BENCHMARK | SIZE | BASELINE (AVX3) ns/op | WithOpt (AVX3) ns/op | Gain AVX3(baseline/opt) | BASELINE (AVX2) ns/op | WithOpt (AVX2) ns/op | Gain AVX2 (baseline/opt) >> -- | -- | -- | -- | -- | -- | -- | -- >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 512.00 | 256.26 | 77.50 | 3.31 | 275.49 | 275.65 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 1024.00 | 501.87 | 150.35 | 3.34 | 540.47 | 541.22 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 2048.00 | 993.05 | 293.23 | 3.39 | 1070.56 | 1070.14 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 512.00 | 227.83 | 39.36 | 5.79 | 248.25 | 45.01 | 5.52 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 1024.00 | 449.70 | 77.88 | 5.77 | 487.33 | 86.15 | 5.66 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 2048.00 | 884.95 | 149.58 | 5.92 | 956.58 | 152.45 | 6.27 >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8277793: Further optimizing instruction sequence. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4067: > 4065: * b) Choose fast path if none of the result vector lane contains 0x80000000 value. > 4066: * It signifies that source value could be any of the special floating point > 4067: * values(NaN,-Inf,Int,Max,-Min). I think you meant here (NaN, -Inf, Inf, Max, -Min). src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4077: > 4075: Label done; > 4076: evcvttpd2qq(dst, src, vec_enc); > 4077: evmovdqul(xtmp1, k0, double_sign_flip, true, vec_enc, scratch); merge masking should be false here. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4087: > 4085: > 4086: kxorwl(ktmp1, ktmp1, ktmp2); > 4087: evcmppd(ktmp1, ktmp1, src, xtmp2, Assembler::NLT_US, vec_enc); We should use nonsignaling comparison here (NLT_UQ instead of NLT_US). Also the same in vector_castF2I_evex as well. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4088: > 4086: kxorwl(ktmp1, ktmp1, ktmp2); > 4087: evcmppd(ktmp1, ktmp1, src, xtmp2, Assembler::NLT_US, vec_enc); > 4088: vpternlogq(xtmp2, 0x11, xtmp1, xtmp1, vec_enc); Consider moving the vpternlog instruction earlier after line 4082 using xtmp1 as the destination. vptenlogq(xtmp1, 0x01, xtmp2, xtmp2, vec_enc); Then xtmp1 can be used in the following evmovdquq. This will help to absorb the latency of vpternlogq. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4098: > 4096: Label done; > 4097: vcvttps2dq(dst, src, vec_enc); > 4098: vmovdqu(xtmp1, float_sign_flip, scratch); We will be loading 256 bits here even for 128 bit vector length. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4109: > 4107: vpxor(xtmp2, xtmp2, xtmp3, vec_enc); > 4108: vpand(xtmp4, xtmp2, src, vec_enc); > 4109: vpxor(xtmp3, xtmp2, xtmp4, vec_enc); Some comments here would be good. I understand that we are creating a mask for values in src that cause positive overflow. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4112: > 4110: > 4111: vpcmpeqd(xtmp4, xtmp4, xtmp4, vec_enc); > 4112: vpxor(xtmp1, xtmp1, xtmp4, vec_enc); vpcmpeqd is a high latency instruction. This constant (0x7FFF...) can be formed earlier immediately after 4099, when xtmp1 becomes available. ------------- PR: https://git.openjdk.java.net/jdk/pull/6544 From dlong at openjdk.java.net Wed Dec 1 00:57:29 2021 From: dlong at openjdk.java.net (Dean Long) Date: Wed, 1 Dec 2021 00:57:29 GMT Subject: RFR: 8277882: New subnode ideal optimization: converting "c0 - (x + c1)" into "(c0 - c1) - x" [v4] In-Reply-To: References: Message-ID: On Tue, 30 Nov 2021 22:44:59 GMT, Zang, Zhiqiang wrote: >> Suggest two new optimizations that can be done in SubINode::Ideal. > > Zang, Zhiqiang has updated the pull request incrementally with one additional commit since the last revision: > > add ok_to_convert to the condition. Marked as reviewed by dlong (Reviewer). It looks reasonable to me. I'll run some testing. @vnkozlov, could you please take a look also? ------------- PR: https://git.openjdk.java.net/jdk/pull/6441 From dholmes at openjdk.java.net Wed Dec 1 01:32:28 2021 From: dholmes at openjdk.java.net (David Holmes) Date: Wed, 1 Dec 2021 01:32:28 GMT Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v4] In-Reply-To: References: <7GuyRoJ653qrQDv-vEnRU7JMcZU6qZJi0j7Ty1b5PE4=.c7d00b3d-c23b-47f9-bfb6-258623c2faae@github.com> Message-ID: On Tue, 30 Nov 2021 17:22:28 GMT, Roman Kennke wrote: > > IIUC you are only making UseHeavyMonitors work properly on x86_64, but in that case you cannot convert UseFastLocks to UseHeavyMonitors on all platforms as it won't work correctly on those other platforms. > > Cheers, David > > It would not break as such on other platforms. It would only be partially implemented, that is C1 would emit calls to runtime for and only use monitors while interpreter and C2 would still emit stack locks. That is ok - and that is roughly what +UseFastLocking used to do. Sorry but I don't see how having the interpreter+C2 use stack-locks while C1 ignores them can possibly be correct. ??? ------------- PR: https://git.openjdk.java.net/jdk/pull/6320 From duke at openjdk.java.net Wed Dec 1 01:44:31 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Wed, 1 Dec 2021 01:44:31 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v2] In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 00:02:14 GMT, Scott Gibbons wrote: >> Accelerates CRC32-C by utilizing vpclmulqdq similarly to CRC32. This change achieves ~4x throughput improvement. >> >> 5986.947899319073 MB/s => 24041.05203089616 MB/s >> 5840.02689336947 MB/s => 24898.781468710356 MB/s >> >> ********** Original *********** >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 1.710387358 seconds >> CRC32C.update(byte[]) throughput = 5986.947899319073 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 1.753416583 seconds >> CRC32C.update(ByteBuffer) throughput = 5840.02689336947 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> >> >> >> >> *********** With my changes: ************* >> >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 0.425938099 seconds >> CRC32C.update(byte[]) throughput = 24041.05203089616 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 0.411265106 seconds >> CRC32C.update(ByteBuffer) throughput = 24898.781468710356 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- > > Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: > > Adding CRC32-C microbenchmark. Benchmark results: Benchmark (count) Mode Cnt Score Error Units TestCRC32C.testCRC32CUpdate 64 avgt 6 0.021 ? 0.001 us/op TestCRC32C.testCRC32CUpdate 128 avgt 6 0.031 ? 0.001 us/op TestCRC32C.testCRC32CUpdate 256 avgt 6 0.023 ? 0.001 us/op TestCRC32C.testCRC32CUpdate 512 avgt 6 0.026 ? 0.001 us/op TestCRC32C.testCRC32CUpdate 1024 avgt 6 0.035 ? 0.002 us/op TestCRC32C.testCRC32CUpdate 2048 avgt 6 0.052 ? 0.001 us/op TestCRC32C.testCRC32CUpdate 4096 avgt 6 0.092 ? 0.001 us/op TestCRC32C.testCRC32CUpdate 8192 avgt 6 0.174 ? 0.001 us/op TestCRC32C.testCRC32CUpdate 16384 avgt 6 0.337 ? 0.001 us/op TestCRC32C.testCRC32CUpdate 32768 avgt 6 0.663 ? 0.002 us/op TestCRC32C.testCRC32CUpdate 65536 avgt 6 1.317 ? 0.004 us/op Finished running test 'micro:java.util.TestCRC32C' ------------- PR: https://git.openjdk.java.net/jdk/pull/6595 From sviswanathan at openjdk.java.net Wed Dec 1 02:00:27 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 1 Dec 2021 02:00:27 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v2] In-Reply-To: References: Message-ID: <2IzSrEz0FHnhJDfgbC4vdzW5yIsVCUs_EY8V_AuQyJI=.fe82513c-fe98-49e1-a622-c7805dbdd23e@github.com> On Wed, 1 Dec 2021 00:02:14 GMT, Scott Gibbons wrote: >> Accelerates CRC32-C by utilizing vpclmulqdq similarly to CRC32. This change achieves ~4x throughput improvement. >> >> 5986.947899319073 MB/s => 24041.05203089616 MB/s >> 5840.02689336947 MB/s => 24898.781468710356 MB/s >> >> ********** Original *********** >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 1.710387358 seconds >> CRC32C.update(byte[]) throughput = 5986.947899319073 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 1.753416583 seconds >> CRC32C.update(ByteBuffer) throughput = 5840.02689336947 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> >> >> >> >> *********** With my changes: ************* >> >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 0.425938099 seconds >> CRC32C.update(byte[]) throughput = 24041.05203089616 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 0.411265106 seconds >> CRC32C.update(ByteBuffer) throughput = 24898.781468710356 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- > > Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: > > Adding CRC32-C microbenchmark. src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6588: > 6586: __ push(y); > 6587: __ push(z); > 6588: #endif a, j, y and z are only required on the crc32c_ipl_alg2_alt2() path, so should be initialized and saved/restored only there. This will also help you to pick a save on call register like c_rarg3 for table without having to push/pop. ------------- PR: https://git.openjdk.java.net/jdk/pull/6595 From sviswanathan at openjdk.java.net Wed Dec 1 02:06:28 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 1 Dec 2021 02:06:28 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v2] In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 00:02:14 GMT, Scott Gibbons wrote: >> Accelerates CRC32-C by utilizing vpclmulqdq similarly to CRC32. This change achieves ~4x throughput improvement. >> >> 5986.947899319073 MB/s => 24041.05203089616 MB/s >> 5840.02689336947 MB/s => 24898.781468710356 MB/s >> >> ********** Original *********** >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 1.710387358 seconds >> CRC32C.update(byte[]) throughput = 5986.947899319073 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 1.753416583 seconds >> CRC32C.update(ByteBuffer) throughput = 5840.02689336947 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> >> >> >> >> *********** With my changes: ************* >> >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 0.425938099 seconds >> CRC32C.update(byte[]) throughput = 24041.05203089616 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 0.411265106 seconds >> CRC32C.update(ByteBuffer) throughput = 24898.781468710356 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- > > Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: > > Adding CRC32-C microbenchmark. src/hotspot/cpu/x86/macroAssembler_x86.cpp line 7218: > 7216: // context for the registers used, where all instructions below are using 128-bit mode > 7217: // On EVEX without VL and BW, these instructions will all be AVX. > 7218: notl(crc); We could do this not1(crc) in generate_updateBytesCRC32() thereby remove the need to do the double notl for crc32c. ------------- PR: https://git.openjdk.java.net/jdk/pull/6595 From dholmes at openjdk.java.net Wed Dec 1 02:36:27 2021 From: dholmes at openjdk.java.net (David Holmes) Date: Wed, 1 Dec 2021 02:36:27 GMT Subject: RFR: 8278016: Add compiler tests to tier{2,3} [v2] In-Reply-To: References: Message-ID: <72-LBIFmV82Z0GRieH-cxjyrub8Z-VVC4izGsW_bc4A=.1e021a98-3a4e-4ccc-87a7-c3e01c26d5e5@github.com> On Tue, 30 Nov 2021 20:44:43 GMT, Aleksey Shipilev wrote: >> I have been looking at `hotspot:tier4` (catch-all not in lower tiers) run logs, and realized the whole bunch of compiler tests are running there. >> >> Since `hotspot:tier4` runs a lot of `vmTestbase` tests, contributors seldom run it, as it takes many hours. Which means that many compiler tests are not running regularly for many contributors. But these tests are rather fast themselves and cover important compiler features. >> >> We can properly add compiler tests to `tier{2,3}` to expose them on earlier tiers. The split logic between tiers is roughly: fast feature tests go into tier2, slower feature tests and debugging/printing stuff goes to tier3. >> >> Sample times for new subgroups (think about this as "How much time they add to existing tiers"): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:tier2_compiler 243 243 0 0 >> ============================== >> >> real 2m16.518s >> user 35m40.839s >> sys 1m35.334s >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:tier3_compiler 132 132 0 0 >> ============================== >> >> real 4m31.935s >> user 71m54.617s >> sys 2m13.073s > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Filter out tier1/2 groups too @shipilev again I think we need to examine this in terms of impact to our CI. We run different platforms and configurations in different tiers so the costs are not as simple as looking at one run. Thanks, David ------------- PR: https://git.openjdk.java.net/jdk/pull/6622 From duke at openjdk.java.net Wed Dec 1 02:42:29 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Wed, 1 Dec 2021 02:42:29 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v2] In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 00:02:14 GMT, Scott Gibbons wrote: >> Accelerates CRC32-C by utilizing vpclmulqdq similarly to CRC32. This change achieves ~4x throughput improvement. >> >> 5986.947899319073 MB/s => 24041.05203089616 MB/s >> 5840.02689336947 MB/s => 24898.781468710356 MB/s >> >> ********** Original *********** >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 1.710387358 seconds >> CRC32C.update(byte[]) throughput = 5986.947899319073 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 1.753416583 seconds >> CRC32C.update(ByteBuffer) throughput = 5840.02689336947 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> >> >> >> >> *********** With my changes: ************* >> >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 0.425938099 seconds >> CRC32C.update(byte[]) throughput = 24041.05203089616 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 0.411265106 seconds >> CRC32C.update(ByteBuffer) throughput = 24898.781468710356 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- > > Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: > > Adding CRC32-C microbenchmark. Doesn?t this just move the double nots to a different generator? I?m not comfortable with the cost/benefit of this change. I don?t want to impact CRC32 for the sake of CRC32C. I?ll do it if you think it?s worth it. Please let me know. From: sviswa7 ***@***.***> Sent: Tuesday, November 30, 2021 6:04 PM To: openjdk/jdk ***@***.***> Cc: Gibbons, Scott ***@***.***>; Mention ***@***.***> Subject: Re: [openjdk/jdk] 8277358: Accelerate CRC32-C (PR #6595) @sviswa7 commented on this pull request. ________________________________ In src/hotspot/cpu/x86/macroAssembler_x86.cpp: > @@ -7210,7 +7215,6 @@ void MacroAssembler::kernel_crc32_avx512(Register crc, Register buf, Register le // For EVEX with VL and BW, provide a standard mask, VL = 128 will guide the merge // context for the registers used, where all instructions below are using 128-bit mode // On EVEX without VL and BW, these instructions will all be AVX. - lea(key, ExternalAddress(StubRoutines::x86::crc_table_avx512_addr())); notl(crc); We could do this not1(crc) in generate_updateBytesCRC32() thereby remove the need to do the double notl for crc32c. ? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. ------------- PR: https://git.openjdk.java.net/jdk/pull/6595 From duke at openjdk.java.net Wed Dec 1 02:50:39 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Wed, 1 Dec 2021 02:50:39 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v2] In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 02:03:29 GMT, Sandhya Viswanathan wrote: >> Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: >> >> Adding CRC32-C microbenchmark. > > src/hotspot/cpu/x86/macroAssembler_x86.cpp line 7218: > >> 7216: // context for the registers used, where all instructions below are using 128-bit mode >> 7217: // On EVEX without VL and BW, these instructions will all be AVX. >> 7218: notl(crc); > > We could do this not1(crc) in generate_updateBytesCRC32() thereby remove the need to do the double notl for crc32c. Moved the `notl(crc)` calls to `generate_updateBytesCRC32()` as requested. > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6588: > >> 6586: __ push(y); >> 6587: __ push(z); >> 6588: #endif > > a, j, y and z are only required on the crc32c_ipl_alg2_alt2() path, so should be initialized and saved/restored only there. > This will also help you to pick a save on call register like c_rarg3 for table without having to push/pop. Done. Changed to use `j` instead of save/restore of `r14`. ------------- PR: https://git.openjdk.java.net/jdk/pull/6595 From duke at openjdk.java.net Wed Dec 1 02:56:06 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Wed, 1 Dec 2021 02:56:06 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v3] In-Reply-To: References: Message-ID: <_uLnNO847-foX9jDiBc2nSNfiwhgl5pV-0WV5NXViZg=.4c73d6cb-966a-42f2-aade-87edfc69a316@github.com> > Accelerates CRC32-C by utilizing vpclmulqdq similarly to CRC32. This change achieves ~4x throughput improvement. > > 5986.947899319073 MB/s => 24041.05203089616 MB/s > 5840.02689336947 MB/s => 24898.781468710356 MB/s > > ********** Original *********** > > > scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 > offset = 0 > msgSize = 512 bytes > iters = 20000000 > ------------------------------------------------------- > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > CRC32C.update(byte[]) runtime = 1.710387358 seconds > CRC32C.update(byte[]) throughput = 5986.947899319073 MB/s > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > ------------------------------------------------------- > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > CRC32C.update(ByteBuffer) runtime = 1.753416583 seconds > CRC32C.update(ByteBuffer) throughput = 5840.02689336947 MB/s > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > ------------------------------------------------------- > > > > > *********** With my changes: ************* > > > > scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 > offset = 0 > msgSize = 512 bytes > iters = 20000000 > ------------------------------------------------------- > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > CRC32C.update(byte[]) runtime = 0.425938099 seconds > CRC32C.update(byte[]) throughput = 24041.05203089616 MB/s > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > ------------------------------------------------------- > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > CRC32C.update(ByteBuffer) runtime = 0.411265106 seconds > CRC32C.update(ByteBuffer) throughput = 24898.781468710356 MB/s > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > ------------------------------------------------------- Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: Fixing review comments ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6595/files - new: https://git.openjdk.java.net/jdk/pull/6595/files/fd87bb92..92b4b9fc Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6595&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6595&range=01-02 Stats: 32 lines in 2 files changed: 11 ins; 16 del; 5 mod Patch: https://git.openjdk.java.net/jdk/pull/6595.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6595/head:pull/6595 PR: https://git.openjdk.java.net/jdk/pull/6595 From sviswanathan at openjdk.java.net Wed Dec 1 03:22:29 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 1 Dec 2021 03:22:29 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs In-Reply-To: <1KoRjoyObIS32kwNcojcLdIdUkdqpL1Pon6-IIn-H94=.a986a7bb-a14b-4df8-9ab2-9c66650e6d1b@github.com> References: <1KoRjoyObIS32kwNcojcLdIdUkdqpL1Pon6-IIn-H94=.a986a7bb-a14b-4df8-9ab2-9c66650e6d1b@github.com> Message-ID: <40r83G1b4hCgLKRXjIYAR66Ie7MkXj2mAAOe1_XkyJc=.c6255006-6adf-41f2-816b-548eae911090@github.com> On Tue, 23 Nov 2021 06:49:07 GMT, David Holmes wrote: >> @dholmes-ora I have implemented your review comments. > > Sorry @sviswa7 but could you explain in the comment why/how `avx3_threshold` reporting zero impacts the use 64-byte load/store - the connection is not at all obvious for anyone not fully conversant with AVX3 and how it is used by the code. Thanks. @dholmes-ora @neliasso Please do approve the patch if it looks ok to you. ------------- PR: https://git.openjdk.java.net/jdk/pull/6512 From jiefu at openjdk.java.net Wed Dec 1 03:41:30 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 1 Dec 2021 03:41:30 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs In-Reply-To: References: Message-ID: On Tue, 23 Nov 2021 06:05:48 GMT, Jie Fu wrote: >> @sviswa7 that further restriction and an explanatory comment would be appreciated. Thanks. > >> @dholmes-ora We see about 25% gain on a micro on our latest platform. There is no cpuid bit for this, so the closest was to check for the new serialize ISA supported on this platform. > > It would be better to add a jmh test for this opt. > Thanks. > @DamonFool There are jmh tests for Arraycopy in test/micro/org/openjdk/bench/java/lang/Arraycopy.java. So how about posting the detailed perf data before and after this patch? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6512 From aoqi at openjdk.java.net Wed Dec 1 03:44:52 2021 From: aoqi at openjdk.java.net (Ao Qi) Date: Wed, 1 Dec 2021 03:44:52 GMT Subject: RFR: 8278037: Removed PPC32 dead code in opConvert Message-ID: `_tmp1` and `_tmp2` were [removed](https://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/faaed259df37#l13.173) in [JDK-8160245](https://bugs.openjdk.java.net/browse/JDK-8160245), but they are still used. This should cause a build error. I don't have a ppc32 machine for the test. It's also not found at https://builds.shipilev.net. Is ppc32 a supported compiler1 platform? ------------- Commit messages: - 8278037: Removed PPC32 dead code in opConvert Changes: https://git.openjdk.java.net/jdk/pull/6625/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6625&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278037 Stats: 10 lines in 1 file changed: 0 ins; 10 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6625.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6625/head:pull/6625 PR: https://git.openjdk.java.net/jdk/pull/6625 From duke at openjdk.java.net Wed Dec 1 03:57:36 2021 From: duke at openjdk.java.net (duke) Date: Wed, 1 Dec 2021 03:57:36 GMT Subject: Withdrawn: 8267265: Use new IR Test Framework to create tests for C2 IGV transformations In-Reply-To: References: Message-ID: <2d_WMxO5r1d9GBXIK50u9wQUeIBEWeeu8_rnDMU1_g8=.13dd278a-5f98-47a3-92e3-bd7eb6642ff8@github.com> On Tue, 17 Aug 2021 00:20:13 GMT, John Tortugo wrote: > Hi, can I please get some reviews for this Pull Request? Here is a summary of the changes: > > - Add tests, using the new IR-based test framework, for several of the Ideal transformations on Add, Sub, Mul, Div, Loop nodes and some simple Scalar Replacement transformations. > - Add more default IR regex's to IR-based test framework. > - Changes to Sub, Div and Add Ideal nodes to that transformations on Int and Long types are the whenever possible same. > - Changes to Sub*Node, Div*Node and Add*Node Ideal methods to fix some bugs and include new transformations. > - New JTREG "ir_transformations" test group under test/hotspot/jtreg. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/5135 From jiefu at openjdk.java.net Wed Dec 1 04:01:30 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 1 Dec 2021 04:01:30 GMT Subject: RFR: 8278037: Removed PPC32 dead code in opConvert In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 03:13:38 GMT, Ao Qi wrote: > `_tmp1` and `_tmp2` were [removed](https://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/faaed259df37#l13.173) in [JDK-8160245](https://bugs.openjdk.java.net/browse/JDK-8160245), but they are still used. > > This should cause a build error. I don't have a ppc32 machine for the test. It's also not found at https://builds.shipilev.net. Is ppc32 a supported compiler1 platform? It looks reasonable to me. ------------- Marked as reviewed by jiefu (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6625 From sviswanathan at openjdk.java.net Wed Dec 1 04:14:27 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 1 Dec 2021 04:14:27 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 03:38:00 GMT, Jie Fu wrote: > > @DamonFool There are jmh tests for Arraycopy in test/micro/org/openjdk/bench/java/lang/Arraycopy.java. > > So how about posting the detailed perf data before and after this patch? Thanks. Before: Benchmark Mode Cnt Score Error Units ArrayCopy.arrayCopyObject avgt 5 19.538 ? 0.073 ns/op ArrayCopy.arrayCopyObjectNonConst avgt 5 20.513 ? 0.104 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 15.919 ? 0.652 ns/op ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 15.669 ? 0.359 ns/op After: Benchmark Mode Cnt Score Error Units ArrayCopy.arrayCopyObject avgt 5 16.957 ? 0.584 ns/op ArrayCopy.arrayCopyObjectNonConst avgt 5 17.221 ? 0.036 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 12.952 ? 0.068 ns/op ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 13.562 ? 0.124 ns/op ------------- PR: https://git.openjdk.java.net/jdk/pull/6512 From stuefe at openjdk.java.net Wed Dec 1 06:24:28 2021 From: stuefe at openjdk.java.net (Thomas Stuefe) Date: Wed, 1 Dec 2021 06:24:28 GMT Subject: RFR: 8278037: Removed PPC32 dead code in opConvert In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 03:13:38 GMT, Ao Qi wrote: > `_tmp1` and `_tmp2` were [removed](https://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/faaed259df37#l13.173) in [JDK-8160245](https://bugs.openjdk.java.net/browse/JDK-8160245), but they are still used. > > This should cause a build error. I don't have a ppc32 machine for the test. It's also not found at https://builds.shipilev.net. Is ppc32 a supported compiler1 platform? ppc32 is not supported. ------------- Marked as reviewed by stuefe (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6625 From jbhateja at openjdk.java.net Wed Dec 1 06:38:32 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 1 Dec 2021 06:38:32 GMT Subject: RFR: 8277777: [Vector API] assert(r->is_XMMRegister()) failed: must be in x86_32.ad In-Reply-To: References: Message-ID: On Wed, 24 Nov 2021 11:56:44 GMT, Jie Fu wrote: > Hi all, > > The following vector api tests fail on x86_32/AVX512 with `assert(r->is_XMMRegister()) failed: must be`. > > jdk/incubator/vector/Byte64VectorLoadStoreTests.java > jdk/incubator/vector/Byte256VectorLoadStoreTests.java > jdk/incubator/vector/Byte128VectorLoadStoreTests.java > jdk/incubator/vector/ByteMaxVectorLoadStoreTests.java > jdk/incubator/vector/Double256VectorTests.java > jdk/incubator/vector/Double512VectorTests.java > jdk/incubator/vector/DoubleMaxVectorTests.java > jdk/incubator/vector/Float512VectorTests.java > jdk/incubator/vector/Float256VectorTests.java > jdk/incubator/vector/FloatMaxVectorTests.java > jdk/incubator/vector/Float128VectorTests.java > jdk/incubator/vector/Short128VectorLoadStoreTests.java > jdk/incubator/vector/Short256VectorLoadStoreTests.java > jdk/incubator/vector/Short64VectorLoadStoreTests.java > jdk/incubator/vector/ShortMaxVectorLoadStoreTests.java > > > The reason is that `static enum RC rc_class( OptoReg::Name reg )` [1] missed the case for KRegister. > And the AVX-512 opmask specific spilling code [2] should be located before the size assert [3]. > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86_32.ad#L747 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86_32.ad#L1272 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86_32.ad#L1252 Marked as reviewed by jbhateja (Committer). src/hotspot/cpu/x86/x86_32.ad line 758: > 756: return rc_float; > 757: } > 758: if (r->is_KRegister()) return rc_kreg; Thanks , this looks ok to me. src/hotspot/cpu/x86/x86_32.ad line 1309: > 1307: if( dst_second_rc == rc_int && src_second_rc == rc_stack ) > 1308: return impl_helper(cbuf,do_size,true ,ra_->reg2offset(src_second),dst_second,0x8B,"MOV ",size, st); > 1309: This change looks unrelated to opmask spilling. ------------- PR: https://git.openjdk.java.net/jdk/pull/6535 From jiefu at openjdk.java.net Wed Dec 1 06:53:29 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 1 Dec 2021 06:53:29 GMT Subject: RFR: 8277777: [Vector API] assert(r->is_XMMRegister()) failed: must be in x86_32.ad In-Reply-To: References: Message-ID: <8FL58O487np3dgZnxRy7F4jXuQtvkW98mxju7CyxvLI=.61e8058e-5dcb-4d16-a5d5-92561de2248f@github.com> On Wed, 1 Dec 2021 06:34:41 GMT, Jatin Bhateja wrote: >> Hi all, >> >> The following vector api tests fail on x86_32/AVX512 with `assert(r->is_XMMRegister()) failed: must be`. >> >> jdk/incubator/vector/Byte64VectorLoadStoreTests.java >> jdk/incubator/vector/Byte256VectorLoadStoreTests.java >> jdk/incubator/vector/Byte128VectorLoadStoreTests.java >> jdk/incubator/vector/ByteMaxVectorLoadStoreTests.java >> jdk/incubator/vector/Double256VectorTests.java >> jdk/incubator/vector/Double512VectorTests.java >> jdk/incubator/vector/DoubleMaxVectorTests.java >> jdk/incubator/vector/Float512VectorTests.java >> jdk/incubator/vector/Float256VectorTests.java >> jdk/incubator/vector/FloatMaxVectorTests.java >> jdk/incubator/vector/Float128VectorTests.java >> jdk/incubator/vector/Short128VectorLoadStoreTests.java >> jdk/incubator/vector/Short256VectorLoadStoreTests.java >> jdk/incubator/vector/Short64VectorLoadStoreTests.java >> jdk/incubator/vector/ShortMaxVectorLoadStoreTests.java >> >> >> The reason is that `static enum RC rc_class( OptoReg::Name reg )` [1] missed the case for KRegister. >> And the AVX-512 opmask specific spilling code [2] should be located before the size assert [3]. >> >> Thanks. >> Best regards, >> Jie >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86_32.ad#L747 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86_32.ad#L1272 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86_32.ad#L1252 > > src/hotspot/cpu/x86/x86_32.ad line 1309: > >> 1307: if( dst_second_rc == rc_int && src_second_rc == rc_stack ) >> 1308: return impl_helper(cbuf,do_size,true ,ra_->reg2offset(src_second),dst_second,0x8B,"MOV ",size, st); >> 1309: > > This change looks unrelated to opmask spilling. Thanks @jatin-bhateja for your review. Actually, the part of change just moves the AVX-512 opmask specific spilling code [1] before the size assert [2]. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86_32.ad#L1272 [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86_32.ad#L1252 ------------- PR: https://git.openjdk.java.net/jdk/pull/6535 From jiefu at openjdk.java.net Wed Dec 1 07:11:27 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 1 Dec 2021 07:11:27 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs In-Reply-To: References: Message-ID: <7At_ag4fpFbiKQ81BsVoM9MzmXJJCIPufYSoi1WgIpg=.e2085ff5-34d8-4c7a-a561-9b36100f14a4@github.com> On Wed, 1 Dec 2021 03:38:00 GMT, Jie Fu wrote: >>> @dholmes-ora We see about 25% gain on a micro on our latest platform. There is no cpuid bit for this, so the closest was to check for the new serialize ISA supported on this platform. >> >> It would be better to add a jmh test for this opt. >> Thanks. > >> @DamonFool There are jmh tests for Arraycopy in test/micro/org/openjdk/bench/java/lang/Arraycopy.java. > > So how about posting the detailed perf data before and after this patch? > Thanks. > > > @DamonFool There are jmh tests for Arraycopy in test/micro/org/openjdk/bench/java/lang/Arraycopy.java. > > > > > > So how about posting the detailed perf data before and after this patch? Thanks. > > Before: Benchmark Mode Cnt Score Error Units ArrayCopy.arrayCopyObject avgt 5 19.538 ? 0.073 ns/op ArrayCopy.arrayCopyObjectNonConst avgt 5 20.513 ? 0.104 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 15.919 ? 0.652 ns/op ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 15.669 ? 0.359 ns/op > > After: Benchmark Mode Cnt Score Error Units ArrayCopy.arrayCopyObject avgt 5 16.957 ? 0.584 ns/op ArrayCopy.arrayCopyObjectNonConst avgt 5 17.221 ? 0.036 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 12.952 ? 0.068 ns/op ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 13.562 ? 0.124 ns/op Thanks @sviswa7 for your sharing. So the performance number looks good on Intel's latest AVX512 platform. We don't use the 64-byte instructions as default on Intel's old AVX512 platforms, right? If so, is it possible a performance regression for the old platforms after this patch? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6512 From jiefu at openjdk.java.net Wed Dec 1 07:23:31 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 1 Dec 2021 07:23:31 GMT Subject: RFR: 8277777: [Vector API] assert(r->is_XMMRegister()) failed: must be in x86_32.ad In-Reply-To: References: Message-ID: On Wed, 24 Nov 2021 11:56:44 GMT, Jie Fu wrote: > Hi all, > > The following vector api tests fail on x86_32/AVX512 with `assert(r->is_XMMRegister()) failed: must be`. > > jdk/incubator/vector/Byte64VectorLoadStoreTests.java > jdk/incubator/vector/Byte256VectorLoadStoreTests.java > jdk/incubator/vector/Byte128VectorLoadStoreTests.java > jdk/incubator/vector/ByteMaxVectorLoadStoreTests.java > jdk/incubator/vector/Double256VectorTests.java > jdk/incubator/vector/Double512VectorTests.java > jdk/incubator/vector/DoubleMaxVectorTests.java > jdk/incubator/vector/Float512VectorTests.java > jdk/incubator/vector/Float256VectorTests.java > jdk/incubator/vector/FloatMaxVectorTests.java > jdk/incubator/vector/Float128VectorTests.java > jdk/incubator/vector/Short128VectorLoadStoreTests.java > jdk/incubator/vector/Short256VectorLoadStoreTests.java > jdk/incubator/vector/Short64VectorLoadStoreTests.java > jdk/incubator/vector/ShortMaxVectorLoadStoreTests.java > > > The reason is that `static enum RC rc_class( OptoReg::Name reg )` [1] missed the case for KRegister. > And the AVX-512 opmask specific spilling code [2] should be located before the size assert [3]. > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86_32.ad#L747 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86_32.ad#L1272 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86_32.ad#L1252 The build failure of Windows x64 has nothing to do with this change. So integrate it. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6535 From jiefu at openjdk.java.net Wed Dec 1 07:23:31 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 1 Dec 2021 07:23:31 GMT Subject: Integrated: 8277777: [Vector API] assert(r->is_XMMRegister()) failed: must be in x86_32.ad In-Reply-To: References: Message-ID: On Wed, 24 Nov 2021 11:56:44 GMT, Jie Fu wrote: > Hi all, > > The following vector api tests fail on x86_32/AVX512 with `assert(r->is_XMMRegister()) failed: must be`. > > jdk/incubator/vector/Byte64VectorLoadStoreTests.java > jdk/incubator/vector/Byte256VectorLoadStoreTests.java > jdk/incubator/vector/Byte128VectorLoadStoreTests.java > jdk/incubator/vector/ByteMaxVectorLoadStoreTests.java > jdk/incubator/vector/Double256VectorTests.java > jdk/incubator/vector/Double512VectorTests.java > jdk/incubator/vector/DoubleMaxVectorTests.java > jdk/incubator/vector/Float512VectorTests.java > jdk/incubator/vector/Float256VectorTests.java > jdk/incubator/vector/FloatMaxVectorTests.java > jdk/incubator/vector/Float128VectorTests.java > jdk/incubator/vector/Short128VectorLoadStoreTests.java > jdk/incubator/vector/Short256VectorLoadStoreTests.java > jdk/incubator/vector/Short64VectorLoadStoreTests.java > jdk/incubator/vector/ShortMaxVectorLoadStoreTests.java > > > The reason is that `static enum RC rc_class( OptoReg::Name reg )` [1] missed the case for KRegister. > And the AVX-512 opmask specific spilling code [2] should be located before the size assert [3]. > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86_32.ad#L747 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86_32.ad#L1272 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86_32.ad#L1252 This pull request has now been integrated. Changeset: 349328c9 Author: Jie Fu URL: https://git.openjdk.java.net/jdk/commit/349328c929ccad242a344da69585404e4fea087f Stats: 41 lines in 1 file changed: 21 ins; 20 del; 0 mod 8277777: [Vector API] assert(r->is_XMMRegister()) failed: must be in x86_32.ad Reviewed-by: thartmann, jbhateja ------------- PR: https://git.openjdk.java.net/jdk/pull/6535 From shade at openjdk.java.net Wed Dec 1 08:23:31 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 1 Dec 2021 08:23:31 GMT Subject: RFR: 8278016: Add compiler tests to tier{2,3} [v2] In-Reply-To: <72-LBIFmV82Z0GRieH-cxjyrub8Z-VVC4izGsW_bc4A=.1e021a98-3a4e-4ccc-87a7-c3e01c26d5e5@github.com> References: <72-LBIFmV82Z0GRieH-cxjyrub8Z-VVC4izGsW_bc4A=.1e021a98-3a4e-4ccc-87a7-c3e01c26d5e5@github.com> Message-ID: On Wed, 1 Dec 2021 02:33:18 GMT, David Holmes wrote: > @shipilev again I think we need to examine this in terms of impact to our CI. We run different platforms and configurations in different tiers so the costs are not as simple as looking at one run. Again, I can wait for those who have more insight in Oracle testing pipelines check their workflows with this change. I have no insight in Oracle infra, so somebody else have to do it. Now that Igor left, who should we be talking to? ------------- PR: https://git.openjdk.java.net/jdk/pull/6622 From chagedorn at openjdk.java.net Wed Dec 1 08:27:33 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Wed, 1 Dec 2021 08:27:33 GMT Subject: RFR: 8275326: C2: assert(no_dead_loop) failed: dead loop detected [v4] In-Reply-To: <8ZBmN3KXGkCgYPCmtrkM1lYGopX_j9nc4pN21w5s4CA=.1cc50fcc-bec2-4553-8aca-7de9b04b5120@github.com> References: <8ZBmN3KXGkCgYPCmtrkM1lYGopX_j9nc4pN21w5s4CA=.1cc50fcc-bec2-4553-8aca-7de9b04b5120@github.com> Message-ID: On Mon, 29 Nov 2021 11:23:35 GMT, Christian Hagedorn wrote: >> In the test case, we apply the following optimization in `PhiNode::Ideal()` for the memory phi 989 that is on a dead path but still has both its inputs set to non-top nodes: >> https://github.com/openjdk/jdk/blob/3c0faa73522bd004b66cb9e477f43e15a29842e6/src/hotspot/share/opto/cfgnode.cpp#L2269-L2270 >> ![Screenshot from 2021-11-05 11-57-49](https://user-images.githubusercontent.com/17833009/140502849-9f00fd62-9714-4f54-8f98-f22f74d11430.png) >> >> In this process, we create `11853 Phi` for the new `11850 MergeMem` which is going to replace `989 Phi` (`this`). We then transform `11853 Phi` before returning: >> https://github.com/openjdk/jdk/blob/3c0faa73522bd004b66cb9e477f43e15a29842e6/src/hotspot/share/opto/cfgnode.cpp#L2314 >> >> During `Ideal()` for `11853 Phi`, we transform `11769 MergeMem` into top (because the base memory is top) and use this as new input instead: >> https://github.com/openjdk/jdk/blob/3c0faa73522bd004b66cb9e477f43e15a29842e6/src/hotspot/share/opto/cfgnode.cpp#L2230-L2240 >> >> But even if the `MergeMem` node would not be transformed into top, the slice itself could be top (L2237) and we would still replace the phi input with top. This replacement by top will fold the `11853 Phi` and we will build a cycle `11850 MergeMem` <-> `1064 StoreB` because `989 Phi` will be replaced by `11850 MergeMem`. This results in the assertion failure. >> >> I tried some approaches by marking `11853 Phi` and/or `989 Phi` to specially treat them during the optimizations in `Ideal()` (e.g. skipping `989 Phi` during the dead loop detection etc.) or to improve the dead loop detection before applying the `MergeMem` optimization in `Ideal()`. But that seemed rather complicated/fragile. >> >> I therefore propose to simply not transform the newly created phi nodes directly but wait instead for IGVN to revisit them again. This allows the `this` phi to be replaced with the new `MergeMem` node and the dead loop detection will work correctly when processing the new phis again later in IGVN. >> >> I could only reproduce this bug with the replay file for the attached test case in the JBS issue. The test case itself did not trigger with repeated runs with `StressIGVN` + `RepeatCompilation`. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > Remove igvn checks Confirmed that previously observed regression with the first fix is not occurring anymore. ------------- PR: https://git.openjdk.java.net/jdk/pull/6276 From chagedorn at openjdk.java.net Wed Dec 1 08:27:33 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Wed, 1 Dec 2021 08:27:33 GMT Subject: Integrated: 8275326: C2: assert(no_dead_loop) failed: dead loop detected In-Reply-To: References: Message-ID: On Fri, 5 Nov 2021 13:00:00 GMT, Christian Hagedorn wrote: > In the test case, we apply the following optimization in `PhiNode::Ideal()` for the memory phi 989 that is on a dead path but still has both its inputs set to non-top nodes: > https://github.com/openjdk/jdk/blob/3c0faa73522bd004b66cb9e477f43e15a29842e6/src/hotspot/share/opto/cfgnode.cpp#L2269-L2270 > ![Screenshot from 2021-11-05 11-57-49](https://user-images.githubusercontent.com/17833009/140502849-9f00fd62-9714-4f54-8f98-f22f74d11430.png) > > In this process, we create `11853 Phi` for the new `11850 MergeMem` which is going to replace `989 Phi` (`this`). We then transform `11853 Phi` before returning: > https://github.com/openjdk/jdk/blob/3c0faa73522bd004b66cb9e477f43e15a29842e6/src/hotspot/share/opto/cfgnode.cpp#L2314 > > During `Ideal()` for `11853 Phi`, we transform `11769 MergeMem` into top (because the base memory is top) and use this as new input instead: > https://github.com/openjdk/jdk/blob/3c0faa73522bd004b66cb9e477f43e15a29842e6/src/hotspot/share/opto/cfgnode.cpp#L2230-L2240 > > But even if the `MergeMem` node would not be transformed into top, the slice itself could be top (L2237) and we would still replace the phi input with top. This replacement by top will fold the `11853 Phi` and we will build a cycle `11850 MergeMem` <-> `1064 StoreB` because `989 Phi` will be replaced by `11850 MergeMem`. This results in the assertion failure. > > I tried some approaches by marking `11853 Phi` and/or `989 Phi` to specially treat them during the optimizations in `Ideal()` (e.g. skipping `989 Phi` during the dead loop detection etc.) or to improve the dead loop detection before applying the `MergeMem` optimization in `Ideal()`. But that seemed rather complicated/fragile. > > I therefore propose to simply not transform the newly created phi nodes directly but wait instead for IGVN to revisit them again. This allows the `this` phi to be replaced with the new `MergeMem` node and the dead loop detection will work correctly when processing the new phis again later in IGVN. > > I could only reproduce this bug with the replay file for the attached test case in the JBS issue. The test case itself did not trigger with repeated runs with `StressIGVN` + `RepeatCompilation`. > > Thanks, > Christian This pull request has now been integrated. Changeset: 70d5dffb Author: Christian Hagedorn URL: https://git.openjdk.java.net/jdk/commit/70d5dffb4e7110902b59b56efaef31614916148c Stats: 16 lines in 1 file changed: 8 ins; 3 del; 5 mod 8275326: C2: assert(no_dead_loop) failed: dead loop detected Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/6276 From shade at openjdk.java.net Wed Dec 1 08:35:29 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 1 Dec 2021 08:35:29 GMT Subject: RFR: 8278037: Removed PPC32 dead code in opConvert In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 03:13:38 GMT, Ao Qi wrote: > `_tmp1` and `_tmp2` were [removed](https://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/faaed259df37#l13.173) in [JDK-8160245](https://bugs.openjdk.java.net/browse/JDK-8160245), but they are still used. > > This should cause a build error. I don't have a ppc32 machine for the test. It's also not found at https://builds.shipilev.net. Is ppc32 a supported compiler1 platform? So, shouldn't we just clean the entire C1 of PPC32 macros? PPC32 C1/C2 is not supported. builds.shipilev.net only have binaries for ppc(32)-zero, which are not affected by any of this. ------------- PR: https://git.openjdk.java.net/jdk/pull/6625 From aoqi at openjdk.java.net Wed Dec 1 08:50:27 2021 From: aoqi at openjdk.java.net (Ao Qi) Date: Wed, 1 Dec 2021 08:50:27 GMT Subject: RFR: 8278037: Removed PPC32 dead code in opConvert In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 08:32:44 GMT, Aleksey Shipilev wrote: >> `_tmp1` and `_tmp2` were [removed](https://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/faaed259df37#l13.173) in [JDK-8160245](https://bugs.openjdk.java.net/browse/JDK-8160245), but they are still used. >> >> This should cause a build error. I don't have a ppc32 machine for the test. It's also not found at https://builds.shipilev.net. Is ppc32 a supported compiler1 platform? > > So, shouldn't we just clean the entire C1 of PPC32 macros? > > PPC32 C1/C2 is not supported. builds.shipilev.net only have binaries for ppc(32)-zero, which are not affected by any of this. @shipilev, yes, there are also some other PPC32 macros in C1. Before this patch, I didn't know "ppc32 is not supported.". I plan to remove other PPC32 macros in C1 in a new patch. Do you think I should do it in this patch? ------------- PR: https://git.openjdk.java.net/jdk/pull/6625 From shade at openjdk.java.net Wed Dec 1 09:04:29 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 1 Dec 2021 09:04:29 GMT Subject: RFR: 8278037: Removed PPC32 dead code in opConvert In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 08:32:44 GMT, Aleksey Shipilev wrote: >> `_tmp1` and `_tmp2` were [removed](https://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/faaed259df37#l13.173) in [JDK-8160245](https://bugs.openjdk.java.net/browse/JDK-8160245), but they are still used. >> >> This should cause a build error. I don't have a ppc32 machine for the test. It's also not found at https://builds.shipilev.net. Is ppc32 a supported compiler1 platform? > > So, shouldn't we just clean the entire C1 of PPC32 macros? > > PPC32 C1/C2 is not supported. builds.shipilev.net only have binaries for ppc(32)-zero, which are not affected by any of this. > @shipilev, yes, there are also some other PPC32 macros in C1. Before this patch, I didn't know "ppc32 is not supported.". I plan to remove other PPC32 macros in C1 in a new patch. Do you think I should do it in this patch? I see no reason to do it separately, but whatever is simpler for you. ------------- PR: https://git.openjdk.java.net/jdk/pull/6625 From david.holmes at oracle.com Wed Dec 1 09:11:46 2021 From: david.holmes at oracle.com (David Holmes) Date: Wed, 1 Dec 2021 19:11:46 +1000 Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs In-Reply-To: <7At_ag4fpFbiKQ81BsVoM9MzmXJJCIPufYSoi1WgIpg=.e2085ff5-34d8-4c7a-a561-9b36100f14a4@github.com> References: <7At_ag4fpFbiKQ81BsVoM9MzmXJJCIPufYSoi1WgIpg=.e2085ff5-34d8-4c7a-a561-9b36100f14a4@github.com> Message-ID: <23d6d736-3559-16fe-f1f4-88efebe4dd41@oracle.com> On 1/12/2021 5:11 pm, Jie Fu wrote: > On Wed, 1 Dec 2021 03:38:00 GMT, Jie Fu wrote: > >>>> @dholmes-ora We see about 25% gain on a micro on our latest platform. There is no cpuid bit for this, so the closest was to check for the new serialize ISA supported on this platform. >>> >>> It would be better to add a jmh test for this opt. >>> Thanks. >> >>> @DamonFool There are jmh tests for Arraycopy in test/micro/org/openjdk/bench/java/lang/Arraycopy.java. >> >> So how about posting the detailed perf data before and after this patch? >> Thanks. > >>>> @DamonFool There are jmh tests for Arraycopy in test/micro/org/openjdk/bench/java/lang/Arraycopy.java. >>> >>> >>> So how about posting the detailed perf data before and after this patch? Thanks. >> >> Before: Benchmark Mode Cnt Score Error Units ArrayCopy.arrayCopyObject avgt 5 19.538 ? 0.073 ns/op ArrayCopy.arrayCopyObjectNonConst avgt 5 20.513 ? 0.104 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 15.919 ? 0.652 ns/op ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 15.669 ? 0.359 ns/op >> >> After: Benchmark Mode Cnt Score Error Units ArrayCopy.arrayCopyObject avgt 5 16.957 ? 0.584 ns/op ArrayCopy.arrayCopyObjectNonConst avgt 5 17.221 ? 0.036 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 12.952 ? 0.068 ns/op ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 13.562 ? 0.124 ns/op > > Thanks @sviswa7 for your sharing. > So the performance number looks good on Intel's latest AVX512 platform. > > We don't use the 64-byte instructions as default on Intel's old AVX512 platforms, right? > If so, is it possible a performance regression for the old platforms after this patch? > Thanks. The old platforms, for which serialize() is not true, will just use AVX3Threshold as they do today. David ---- > ------------- > > PR: https://git.openjdk.java.net/jdk/pull/6512 > From shade at openjdk.java.net Wed Dec 1 09:13:36 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 1 Dec 2021 09:13:36 GMT Subject: RFR: 8277893: Arraycopy stress tests [v2] In-Reply-To: References: Message-ID: <3UVCSS5k6cAvWIzxF_1egpBrr69f5Bu8AlhWCMmmINw=.9dd5b6d5-375b-481d-a4e9-cb2ef7a1629f@github.com> > I would like to fork the new tests off the JDK-8150730. These tests were instrumental in capturing many bugs in my arraycopy work, and I think they are good on their own merit, because they provide a test for the current baseline and on-going minor improvements in arraycopy on all platforms, not only x86_64, and they might be cleanly backportable. > > A brief tour of these tests: > > - Tests all data types; > - Tests small arrays exhaustively, which captures conjoint/disjoint cases, errors near the edges, etc; > - Tests large arrays with fuzzing around powers of two and powers of ten, both conjoint and disjoint cases; > - Tests all available compilation modes for arraycopy stubs; for example, running on AVX-512 enabled machine runs all versions down to `-XX:UseAVX=0 -XX:UseSSE=0` cases; > - Tests with/without compressed oops mode -- theoretically only needed for `Object` copies, but Hotspot cobbles together int+coops and long+no-coops loops, so I decided to alternate coops mode for all data types; > > My previous version used individual `@run` clauses for all configurations, but I think the Java driver is cleaner and easier to maintain. > > Test times: > > > # x86_64 (TR 3970X) > real 6m37.855s > user 56m23.004s > sys 0m20.148s > > # x86_32 (TR 3970X) > real 11m22.877s > user 168m8.137s > sys 5m7.037s > > # x86_64 (i5-11500) > real 15m55.424s > user 118m0.969s > sys 0m12.039s > > # AArch64 (ThunderX2) > real 4m5.177s > user 32m7.295s > sys 0m19.689s > > > Since these tests are quite long, especially on small machines, I hooked them up to `hotspot:tier3`. > > Additional testing: > - [x] Linux x86_64 fastdebug `compiler/stress/arraycopy` > - [x] Linux x86_32 fastdebug `compiler/stress/arraycopy` > - [x] Linux AArch64 fastdebug `compiler/stress/arraycopy` Aleksey Shipilev has updated the pull request incrementally with two additional commits since the last revision: - Separate test group and hooks into hotspot_slow_compiler - Trim down MAX_SIZE and explain the choice ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6594/files - new: https://git.openjdk.java.net/jdk/pull/6594/files/678086a7..da7ed51e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6594&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6594&range=00-01 Stats: 17 lines in 2 files changed: 12 ins; 1 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/6594.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6594/head:pull/6594 PR: https://git.openjdk.java.net/jdk/pull/6594 From shade at openjdk.java.net Wed Dec 1 09:13:36 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 1 Dec 2021 09:13:36 GMT Subject: RFR: 8277893: Arraycopy stress tests [v2] In-Reply-To: References: <86VRDdE8F6Q0b4CNj2otyPX2z07QA1fBlV0TN0Vn1cs=.43f1fcde-4727-4925-a7fd-51afca9d30cf@github.com> Message-ID: On Tue, 30 Nov 2021 20:45:35 GMT, Vladimir Kozlov wrote: >> Yes, we can. Actually, working on #6622, I realized these test groups would be introduced anyway. So these new arraycopy tests should probably go to `hotspot_slow_compiler` group, along with other `stress` tests. This would hook arraycopy tests into `hotspot:tier3` automatically if #6622 lands. Tell me if you still want a completely separate test group, or `hotspot_slow_compiler` is enough for current Oracle testing infra. > > Please, create separate test group and add it to `hotspot_slow_compiler`. We would not need to change infra settings if more testing is added to this new group later. Done in new commit. ------------- PR: https://git.openjdk.java.net/jdk/pull/6594 From shade at openjdk.java.net Wed Dec 1 09:13:37 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 1 Dec 2021 09:13:37 GMT Subject: RFR: 8277893: Arraycopy stress tests [v2] In-Reply-To: <9L5CHY8n-6csbW9jfsnXt4pSqnabXH5R7dt2pZFDmdA=.e53d343e-f7fd-46b1-a8af-02dba3fad3ec@github.com> References: <86VRDdE8F6Q0b4CNj2otyPX2z07QA1fBlV0TN0Vn1cs=.43f1fcde-4727-4925-a7fd-51afca9d30cf@github.com> <9L5CHY8n-6csbW9jfsnXt4pSqnabXH5R7dt2pZFDmdA=.e53d343e-f7fd-46b1-a8af-02dba3fad3ec@github.com> Message-ID: On Tue, 30 Nov 2021 21:22:09 GMT, Aleksey Shipilev wrote: >> Okay. I was concern because of times you show. I am fine with running tests upto 10-15 mins but not this: >> >> # x86_64 (i5-11500) >> real 41m32.622s >> user 447m19.986s >> sys 0m21.026s >> >> >> Do you know why it takes so much time on it? > > That small machine has very slow memory compared to other ones. The parallelism in stress tests (9 types, 2 forked VMs each) puts that machine on its knees. There is a blurb about that effect here: https://github.com/openjdk/jdk/pull/6594/files#diff-f72fee20a49daaf4e05002372e93f426407ecd429a227393e2ec79e821042c90R40-R47 -- I don't think it would matter much if we trim `MAX_SIZE`, but I'll try tomorrow. Edit: I also remembered that machine also the only AVX-512 capable one in the mix, so the power/frequency mess that AVX-512 is probably does not help, will look into it tomorrow too. All right. `MAX_SIZE` actually makes a lot of difference for that machine. I trimmed it down to 128K to cater for 64K pages, and added some explanation for the choice. See new commit. Also updated the PR body with new timings. ------------- PR: https://git.openjdk.java.net/jdk/pull/6594 From neliasso at openjdk.java.net Wed Dec 1 09:23:28 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Wed, 1 Dec 2021 09:23:28 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs In-Reply-To: <7At_ag4fpFbiKQ81BsVoM9MzmXJJCIPufYSoi1WgIpg=.e2085ff5-34d8-4c7a-a561-9b36100f14a4@github.com> References: <7At_ag4fpFbiKQ81BsVoM9MzmXJJCIPufYSoi1WgIpg=.e2085ff5-34d8-4c7a-a561-9b36100f14a4@github.com> Message-ID: On Wed, 1 Dec 2021 07:08:33 GMT, Jie Fu wrote: > > We don't use the 64-byte instructions as default on Intel's old AVX512 platforms, right? If so, is it possible a performance regression for the old platforms after this patch? Thanks. As I understand it - old AVX512 platforms will continue to work as before. The new case is that new platforms (that have avx_threshold set to 0) will use 64 byte instructions. But it would be nice with some benchmarks that verify that there are no regression on old avx512 hardware. ------------- PR: https://git.openjdk.java.net/jdk/pull/6512 From neliasso at openjdk.java.net Wed Dec 1 09:29:28 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Wed, 1 Dec 2021 09:29:28 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs [v6] In-Reply-To: References: Message-ID: On Tue, 30 Nov 2021 00:10:39 GMT, Sandhya Viswanathan wrote: >> Currently 32-byte instructions are used for small array copy and clear. >> This can be optimized by using 64-byte instructions. >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Fix whitespace I am happy with the change but would like to see some benchmarks that verify that there are no regressions - [before/after]*[avx2/old avx512/new avx512]. You have already posted some of them - please complete with the missing ones. ------------- Changes requested by neliasso (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6512 From jiefu at openjdk.java.net Wed Dec 1 09:42:30 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 1 Dec 2021 09:42:30 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs In-Reply-To: References: <7At_ag4fpFbiKQ81BsVoM9MzmXJJCIPufYSoi1WgIpg=.e2085ff5-34d8-4c7a-a561-9b36100f14a4@github.com> Message-ID: <869rFsjOuFR2Yt9bfmz45l4s6uRaFQc5IkPW3eJyZ8E=.e1a6695f-258d-4ec6-a28a-9fcba5e53ce9@github.com> On Wed, 1 Dec 2021 09:20:48 GMT, Nils Eliasson wrote: > As I understand it - old AVX512 platforms will continue to work as before. According to @sviswa7 's comments (no cupid bit for the latest ISA), `is_intel_family_core() && supports_serialize()` can't distinguish all the old AVX512 platforms from the latest ones. So I think it may be possible some old AVX512 machines will behave differently after this opt. @sviswa7 , can you further explain what's the difference of the 64-byte instructions between Intel's old and latest AVX512 platforms? Why can't we enable them as default on old platforms? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6512 From duke at openjdk.java.net Wed Dec 1 10:13:54 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Wed, 1 Dec 2021 10:13:54 GMT Subject: RFR: JDK-8277496 Remove duplication in c1 Block successor lists [v2] In-Reply-To: <4wdDk_HKbqPRXF47C895FeMhJpmrT6rt0aVT8q0miH0=.7b4bcdd0-671a-409b-ad6f-e433a3507e24@github.com> References: <4wdDk_HKbqPRXF47C895FeMhJpmrT6rt0aVT8q0miH0=.7b4bcdd0-671a-409b-ad6f-e433a3507e24@github.com> Message-ID: <_UPeVsHsTFqDNA9DJ5zfJzW1WO9lIXbVJxQPqw-GnF4=.3d05c1e0-4492-451a-b836-80af2e8e9e90@github.com> > Remove `BlockBegin::_successors`, leaving `BlockEnd::_sux` as the SSOT for the successors of a block. Prior to this PR, these two lists were both tracking the same list of successors of the same block. This necessitated a lot of syncing and verification code. > > With this PR, as long as a block has its end pointer assigned, its successors can always be reached by querying the `BlockEnd`. `BlockEnd::_sux` becomes the single place where the list of successors is maintained. When modified, the successor list no longer needs to be synchronized in two places, reducing complexity and confusion. Asserts on the two lists corresponding no longer need to be made. > > While being created in `GraphBuilder`, `BlockBegin`s don't have a `BlockEnd` assigned yet. To temporarily track block successors in this small interval, add a lookup structure `BlockListBuilder::_bci2block_successors`. > > This PR affects debug printing code. If the end pointer of a `BlockBegin `is NULL for some reason, then the successor list can no longer be printed (for obious reasons). > > This PR introduces an additional check to IR::verify to check that `BlockBegin::_end` is not set to null. > > This PR also performs some minor refactoring, polishing, inlining, and removing of dead code around the affected areas. > > The commit history has been polished to attempt to guide the reader through the changes. > > hs-tier1 and hs-tier2 tests pass. Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: remove stray comment ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6614/files - new: https://git.openjdk.java.net/jdk/pull/6614/files/284cc91a..ec61358b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6614&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6614&range=00-01 Stats: 4 lines in 1 file changed: 0 ins; 4 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6614.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6614/head:pull/6614 PR: https://git.openjdk.java.net/jdk/pull/6614 From duke at openjdk.java.net Wed Dec 1 10:13:58 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Wed, 1 Dec 2021 10:13:58 GMT Subject: RFR: JDK-8277496 Remove duplication in c1 Block successor lists [v2] In-Reply-To: <2qmMSA8n_RWx9P5NhUOZUOzEhjJgeMgItBMMt_jC_aI=.56ae7470-a415-4506-a0dc-206ffaffb2c3@github.com> References: <4wdDk_HKbqPRXF47C895FeMhJpmrT6rt0aVT8q0miH0=.7b4bcdd0-671a-409b-ad6f-e433a3507e24@github.com> <2qmMSA8n_RWx9P5NhUOZUOzEhjJgeMgItBMMt_jC_aI=.56ae7470-a415-4506-a0dc-206ffaffb2c3@github.com> Message-ID: On Tue, 30 Nov 2021 15:52:32 GMT, Nils Eliasson wrote: >> Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: >> >> remove stray comment > > src/hotspot/share/c1/c1_GraphBuilder.cpp line 3397: > >> 3395: #endif >> 3396: >> 3397: // JANIUK: If we iterate all the blocks in _blocks, some of them have end NULL. > > Left over comment? Indeed! Removed it ------------- PR: https://git.openjdk.java.net/jdk/pull/6614 From roland at openjdk.java.net Wed Dec 1 10:45:02 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 1 Dec 2021 10:45:02 GMT Subject: RFR: 8277906: Incorrect type for IV phi of long counted loops after CCP Message-ID: This failure occurs because the iv phi of a long counted loop has the wrong type after CCP. That happens because, during CCP, after the type of the limit of the counted loop is updated, the type of the iv phi is not recomputed. The fix is to apply to long counted loops the logic that already exists for int counted loop. ------------- Commit messages: - test - fix Changes: https://git.openjdk.java.net/jdk/pull/6632/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6632&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8277906 Stats: 66 lines in 2 files changed: 61 ins; 0 del; 5 mod Patch: https://git.openjdk.java.net/jdk/pull/6632.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6632/head:pull/6632 PR: https://git.openjdk.java.net/jdk/pull/6632 From simonis at openjdk.java.net Wed Dec 1 11:09:55 2021 From: simonis at openjdk.java.net (Volker Simonis) Date: Wed, 1 Dec 2021 11:09:55 GMT Subject: RFR: 8273563: Improve performance of implicit exceptions with -XX:-OmitStackTraceInFastThrow [v11] In-Reply-To: References: Message-ID: > Currently, if running with `-XX:-OmitStackTraceInFastThrow`, C2 has no possibility to create implicit exceptions like AIOOBE, NullPointerExceptions, etc. in compiled code. This means that such methods will always be deoptimized and re-executed in the interpreter if such exceptions are happening. > > If implicit exceptions are used for normal control flow, that can have a dramatic impact on performance. A prominent example for such code is [Tomcat's `HttpParser::isAlpha()` method](https://github.com/apache/tomcat/blob/26ba86cdbd40ca718e43b82e62b3eb49d004c3d6/java/org/apache/tomcat/util/http/parser/HttpParser.java#L266-L274): > > public static boolean isAlpha(int c) { > try { > return IS_ALPHA[c]; > } catch (ArrayIndexOutOfBoundsException ex) { > return false; > } > } > > > ### Solution > > Instead of deoptimizing and resorting to the interpreter, we can generate code which allocates and initializes the corresponding exceptions right in compiled code. This results in a ten-times performance improvement for the above code: > > -XX:-OmitStackTraceInFastThrow -XX:-OptimizeImplicitExceptions > Benchmark (exceptionProbability) Mode Cnt Score Error Units > ImplicitExceptions.bench 0.0 avgt 5 1.430 ? 0.353 ns/op > ImplicitExceptions.bench 0.33 avgt 5 3563.038 ? 77.358 ns/op > ImplicitExceptions.bench 0.66 avgt 5 8609.693 ? 1205.104 ns/op > ImplicitExceptions.bench 1.00 avgt 5 12842.401 ? 1022.728 ns/op > > -XX:-OmitStackTraceInFastThrow -XX:+OptimizeImplicitExceptions > Benchmark (exceptionProbability) Mode Cnt Score Error Units > ImplicitExceptions.bench 0.0 avgt 5 1.432 ? 0.352 ns/op > ImplicitExceptions.bench 0.33 avgt 5 355.723 ? 16.641 ns/op > ImplicitExceptions.bench 0.66 avgt 5 887.068 ? 166.728 ns/op > ImplicitExceptions.bench 1.00 avgt 5 1274.418 ? 88.235 ns/op > > > ### Implementation details > > - The new optimization is guarded by the option `OptimizeImplicitExceptions` which is on by default. > - In `GraphKit::builtin_throw()` we can't simply use `CallGenerator::for_direct_call()` to create a `DirectCallGenerator` for the call to the exception's `` function because `DirectCallGenerator` assumes in various places that calls are only issued at `invoke*` bytecodes. This is is not true in genral for bytecode which can cause an implicit exception. > - Instead, we manually wire up the call based on the code in `DirectCallGenerator::generate()`. > - We use a similar trick like for method handle intrinsics where the callee from the bytecode is replaced by a direct call and this fact is recorded in the call's `_override_symbolic_info` field. For calling constructors of implicit exceptions I've introduced the new field `_implicit_exception_init`. This field is also used in various assertions to prevent queries for the bytecode's symbolic method information which doesn't exist because we're not at an `invoke*` bytecode at the place where we generate the call. > - The PR contains a micro-benchmark which compares the old and the new implementation for [Tomcat's `HttpParser::isAlpha()` method](https://github.com/apache/tomcat/blob/26ba86cdbd40ca718e43b82e62b3eb49d004c3d6/java/org/apache/tomcat/util/http/parser/HttpParser.java#L266-L274). Except for the trivial case where the exception probability is 0 (i.e. no exceptions are happening at all) the new implementation is about 10 times faster. Volker Simonis has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: - Fix jit/t/t105/t105.java to also use -XX:-OptimizeImplicitExceptions in addition to -XX:-OmitStacktracesInFastThrow - Fix IR Framework test Traps::classCheck() which now behaves differently with -XX:+OptimizeImplicitExceptions - Added jtreg test and extended the Whitebox API to export decompile, deopt and trap counters Rebased on top of '8275908: Record null_check traps for calls and array_check traps in the interpreter' - Fix special case where we're creating an implicit exception for a regular invoke* bytecode - Minor updates as requested by @TheRealMDoerr - 8273563: Improve performance of implicit exceptions with -XX:-OmitStackTraceInFastThrow ------------- Changes: https://git.openjdk.java.net/jdk/pull/5488/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=5488&range=10 Stats: 238 lines in 13 files changed: 214 ins; 0 del; 24 mod Patch: https://git.openjdk.java.net/jdk/pull/5488.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/5488/head:pull/5488 PR: https://git.openjdk.java.net/jdk/pull/5488 From simonis at openjdk.java.net Wed Dec 1 11:12:34 2021 From: simonis at openjdk.java.net (Volker Simonis) Date: Wed, 1 Dec 2021 11:12:34 GMT Subject: RFR: 8273563: Improve performance of implicit exceptions with -XX:-OmitStackTraceInFastThrow [v10] In-Reply-To: <3DyX38fUwXmYfYuInLP-xhm1toijhtr2U7pHK2zhNqU=.b91e17bd-bea6-4323-96e0-03c59e3f0573@github.com> References: <3DyX38fUwXmYfYuInLP-xhm1toijhtr2U7pHK2zhNqU=.b91e17bd-bea6-4323-96e0-03c59e3f0573@github.com> Message-ID: On Thu, 18 Nov 2021 10:21:01 GMT, Volker Simonis wrote: >> Currently, if running with `-XX:-OmitStackTraceInFastThrow`, C2 has no possibility to create implicit exceptions like AIOOBE, NullPointerExceptions, etc. in compiled code. This means that such methods will always be deoptimized and re-executed in the interpreter if such exceptions are happening. >> >> If implicit exceptions are used for normal control flow, that can have a dramatic impact on performance. A prominent example for such code is [Tomcat's `HttpParser::isAlpha()` method](https://github.com/apache/tomcat/blob/26ba86cdbd40ca718e43b82e62b3eb49d004c3d6/java/org/apache/tomcat/util/http/parser/HttpParser.java#L266-L274): >> >> public static boolean isAlpha(int c) { >> try { >> return IS_ALPHA[c]; >> } catch (ArrayIndexOutOfBoundsException ex) { >> return false; >> } >> } >> >> >> ### Solution >> >> Instead of deoptimizing and resorting to the interpreter, we can generate code which allocates and initializes the corresponding exceptions right in compiled code. This results in a ten-times performance improvement for the above code: >> >> -XX:-OmitStackTraceInFastThrow -XX:-OptimizeImplicitExceptions >> Benchmark (exceptionProbability) Mode Cnt Score Error Units >> ImplicitExceptions.bench 0.0 avgt 5 1.430 ? 0.353 ns/op >> ImplicitExceptions.bench 0.33 avgt 5 3563.038 ? 77.358 ns/op >> ImplicitExceptions.bench 0.66 avgt 5 8609.693 ? 1205.104 ns/op >> ImplicitExceptions.bench 1.00 avgt 5 12842.401 ? 1022.728 ns/op >> >> -XX:-OmitStackTraceInFastThrow -XX:+OptimizeImplicitExceptions >> Benchmark (exceptionProbability) Mode Cnt Score Error Units >> ImplicitExceptions.bench 0.0 avgt 5 1.432 ? 0.352 ns/op >> ImplicitExceptions.bench 0.33 avgt 5 355.723 ? 16.641 ns/op >> ImplicitExceptions.bench 0.66 avgt 5 887.068 ? 166.728 ns/op >> ImplicitExceptions.bench 1.00 avgt 5 1274.418 ? 88.235 ns/op >> >> >> ### Implementation details >> >> - The new optimization is guarded by the option `OptimizeImplicitExceptions` which is on by default. >> - In `GraphKit::builtin_throw()` we can't simply use `CallGenerator::for_direct_call()` to create a `DirectCallGenerator` for the call to the exception's `` function because `DirectCallGenerator` assumes in various places that calls are only issued at `invoke*` bytecodes. This is is not true in genral for bytecode which can cause an implicit exception. >> - Instead, we manually wire up the call based on the code in `DirectCallGenerator::generate()`. >> - We use a similar trick like for method handle intrinsics where the callee from the bytecode is replaced by a direct call and this fact is recorded in the call's `_override_symbolic_info` field. For calling constructors of implicit exceptions I've introduced the new field `_implicit_exception_init`. This field is also used in various assertions to prevent queries for the bytecode's symbolic method information which doesn't exist because we're not at an `invoke*` bytecode at the place where we generate the call. >> - The PR contains a micro-benchmark which compares the old and the new implementation for [Tomcat's `HttpParser::isAlpha()` method](https://github.com/apache/tomcat/blob/26ba86cdbd40ca718e43b82e62b3eb49d004c3d6/java/org/apache/tomcat/util/http/parser/HttpParser.java#L266-L274). Except for the trivial case where the exception probability is 0 (i.e. no exceptions are happening at all) the new implementation is about 10 times faster. > > Volker Simonis has updated the pull request with a new target base due to a merge or a rebase. I've rebased the PR on top of [JDK-8275908](https://bugs.openjdk.java.net/browse/JDK-8275908) which already adds the required WhiteBox functionality and considerably simplifies the test for this change. Awaiting @vnkozlov review. ------------- PR: https://git.openjdk.java.net/jdk/pull/5488 From jbhateja at openjdk.java.net Wed Dec 1 11:35:55 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 1 Dec 2021 11:35:55 GMT Subject: RFR: 8277793: Support vector F2I and D2L cast operations for X86 [v3] In-Reply-To: References: Message-ID: <2dggI8qABtygqAVieUgvvwXV9pcTWrt-fXXW3CaGHXQ=.e73cad93-fc58-44af-aa05-1ff0fdf72fb6@github.com> > - JDK-8275317 extended auto-vectorizer to infer Vector Cast operations if source and destination primitive type have same size. > - This patch adds the backend support for vector CastF2I and CaseD2L on X86 AVX512 and legacy targets. > > Following are the performance measurements of an existing JMH benchmark (test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java) > > System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) > > BENCHMARK | SIZE | BASELINE (AVX3) ns/op | WithOpt (AVX3) ns/op | Gain AVX3(baseline/opt) | BASELINE (AVX2) ns/op | WithOpt (AVX2) ns/op | Gain AVX2 (baseline/opt) > -- | -- | -- | -- | -- | -- | -- | -- > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 512.00 | 256.26 | 77.50 | 3.31 | 275.49 | 275.65 | 1.00 > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 1024.00 | 501.87 | 150.35 | 3.34 | 540.47 | 541.22 | 1.00 > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 2048.00 | 993.05 | 293.23 | 3.39 | 1070.56 | 1070.14 | 1.00 > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 512.00 | 227.83 | 39.36 | 5.79 | 248.25 | 45.01 | 5.52 > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 1024.00 | 449.70 | 77.88 | 5.77 | 487.33 | 86.15 | 5.66 > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 2048.00 | 884.95 | 149.58 | 5.92 | 956.58 | 152.45 | 6.27 > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8277793: Review comments resolution ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6544/files - new: https://git.openjdk.java.net/jdk/pull/6544/files/d01d938a..9f784eb9 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6544&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6544&range=01-02 Stats: 25 lines in 3 files changed: 19 ins; 3 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/6544.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6544/head:pull/6544 PR: https://git.openjdk.java.net/jdk/pull/6544 From jbhateja at openjdk.java.net Wed Dec 1 11:36:01 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 1 Dec 2021 11:36:01 GMT Subject: RFR: 8277793: Support vector F2I and D2L cast operations for X86 [v2] In-Reply-To: References: Message-ID: On Tue, 30 Nov 2021 21:22:44 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8277793: Further optimizing instruction sequence. > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4077: > >> 4075: Label done; >> 4076: evcvttpd2qq(dst, src, vec_enc); >> 4077: evmovdqul(xtmp1, k0, double_sign_flip, true, vec_enc, scratch); > > merge masking should be false here. K0 register will enable all the lanes hence true/false value will not change the semantics. > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4088: > >> 4086: kxorwl(ktmp1, ktmp1, ktmp2); >> 4087: evcmppd(ktmp1, ktmp1, src, xtmp2, Assembler::NLT_US, vec_enc); >> 4088: vpternlogq(xtmp2, 0x11, xtmp1, xtmp1, vec_enc); > > Consider moving the vpternlog instruction earlier after line 4082 using xtmp1 as the destination. > vptenlogq(xtmp1, 0x01, xtmp2, xtmp2, vec_enc); > Then xtmp1 can be used in the following evmovdquq. > > This will help to absorb the latency of vpternlogq. evcmppd and vpternlog should be issued in parallel to exaction ports given that there is no dependency between them, given that succeeding instruction has data dependency on both these instructions it can be issued only once both its operands are ready. Since evcmppd has higher latency so it will mask the latency of vpternlog. > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4112: > >> 4110: >> 4111: vpcmpeqd(xtmp4, xtmp4, xtmp4, vec_enc); >> 4112: vpxor(xtmp1, xtmp1, xtmp4, vec_enc); > > vpcmpeqd is a high latency instruction. This constant (0x7FFF...) can be formed earlier immediately after 4099, when xtmp1 becomes available. This is on a slow path which handles special values, moving it prior to 4099 will penalize fast path. ------------- PR: https://git.openjdk.java.net/jdk/pull/6544 From jbhateja at openjdk.java.net Wed Dec 1 11:39:29 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 1 Dec 2021 11:39:29 GMT Subject: RFR: 8277793: Support vector F2I and D2L cast operations for X86 [v3] In-Reply-To: <2dggI8qABtygqAVieUgvvwXV9pcTWrt-fXXW3CaGHXQ=.e73cad93-fc58-44af-aa05-1ff0fdf72fb6@github.com> References: <2dggI8qABtygqAVieUgvvwXV9pcTWrt-fXXW3CaGHXQ=.e73cad93-fc58-44af-aa05-1ff0fdf72fb6@github.com> Message-ID: On Wed, 1 Dec 2021 11:35:55 GMT, Jatin Bhateja wrote: >> - JDK-8275317 extended auto-vectorizer to infer Vector Cast operations if source and destination primitive type have same size. >> - This patch adds the backend support for vector CastF2I and CaseD2L on X86 AVX512 and legacy targets. >> >> Following are the performance measurements of an existing JMH benchmark (test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java) >> >> System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> BENCHMARK | SIZE | BASELINE (AVX3) ns/op | WithOpt (AVX3) ns/op | Gain AVX3(baseline/opt) | BASELINE (AVX2) ns/op | WithOpt (AVX2) ns/op | Gain AVX2 (baseline/opt) >> -- | -- | -- | -- | -- | -- | -- | -- >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 512.00 | 256.26 | 77.50 | 3.31 | 275.49 | 275.65 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 1024.00 | 501.87 | 150.35 | 3.34 | 540.47 | 541.22 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 2048.00 | 993.05 | 293.23 | 3.39 | 1070.56 | 1070.14 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 512.00 | 227.83 | 39.36 | 5.79 | 248.25 | 45.01 | 5.52 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 1024.00 | 449.70 | 77.88 | 5.77 | 487.33 | 86.15 | 5.66 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 2048.00 | 884.95 | 149.58 | 5.92 | 956.58 | 152.45 | 6.27 >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8277793: Review comments resolution @sviswa7 , @neliasso , all your outstanding comments addressed. ------------- PR: https://git.openjdk.java.net/jdk/pull/6544 From jbhateja at openjdk.java.net Wed Dec 1 11:39:30 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 1 Dec 2021 11:39:30 GMT Subject: RFR: 8277793: Support vector F2I and D2L cast operations for X86 [v2] In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 11:31:57 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4112: >> >>> 4110: >>> 4111: vpcmpeqd(xtmp4, xtmp4, xtmp4, vec_enc); >>> 4112: vpxor(xtmp1, xtmp1, xtmp4, vec_enc); >> >> vpcmpeqd is a high latency instruction. This constant (0x7FFF...) can be formed earlier immediately after 4099, when xtmp1 becomes available. > > This is on a slow path which handles special values, moving it prior to 4099 will penalize fast path. Moving flipping pattern after fast path. ------------- PR: https://git.openjdk.java.net/jdk/pull/6544 From neliasso at openjdk.java.net Wed Dec 1 12:13:29 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Wed, 1 Dec 2021 12:13:29 GMT Subject: RFR: JDK-8277496 Remove duplication in c1 Block successor lists [v2] In-Reply-To: <_UPeVsHsTFqDNA9DJ5zfJzW1WO9lIXbVJxQPqw-GnF4=.3d05c1e0-4492-451a-b836-80af2e8e9e90@github.com> References: <4wdDk_HKbqPRXF47C895FeMhJpmrT6rt0aVT8q0miH0=.7b4bcdd0-671a-409b-ad6f-e433a3507e24@github.com> <_UPeVsHsTFqDNA9DJ5zfJzW1WO9lIXbVJxQPqw-GnF4=.3d05c1e0-4492-451a-b836-80af2e8e9e90@github.com> Message-ID: On Wed, 1 Dec 2021 10:13:54 GMT, Ludvig Janiuk wrote: >> Remove `BlockBegin::_successors`, leaving `BlockEnd::_sux` as the SSOT for the successors of a block. Prior to this PR, these two lists were both tracking the same list of successors of the same block. This necessitated a lot of syncing and verification code. >> >> With this PR, as long as a block has its end pointer assigned, its successors can always be reached by querying the `BlockEnd`. `BlockEnd::_sux` becomes the single place where the list of successors is maintained. When modified, the successor list no longer needs to be synchronized in two places, reducing complexity and confusion. Asserts on the two lists corresponding no longer need to be made. >> >> While being created in `GraphBuilder`, `BlockBegin`s don't have a `BlockEnd` assigned yet. To temporarily track block successors in this small interval, add a lookup structure `BlockListBuilder::_bci2block_successors`. >> >> This PR affects debug printing code. If the end pointer of a `BlockBegin `is NULL for some reason, then the successor list can no longer be printed (for obious reasons). >> >> This PR introduces an additional check to IR::verify to check that `BlockBegin::_end` is not set to null. >> >> This PR also performs some minor refactoring, polishing, inlining, and removing of dead code around the affected areas. >> >> The commit history has been polished to attempt to guide the reader through the changes. >> >> hs-tier1 and hs-tier2 tests pass. > > Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: > > remove stray comment src/hotspot/share/c1/c1_LIR.cpp line 1589: > 1587: } > 1588: > 1589: if (end != NULL) Missing braces on if statement ------------- PR: https://git.openjdk.java.net/jdk/pull/6614 From neliasso at openjdk.java.net Wed Dec 1 12:24:27 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Wed, 1 Dec 2021 12:24:27 GMT Subject: RFR: 8277793: Support vector F2I and D2L cast operations for X86 [v3] In-Reply-To: <2dggI8qABtygqAVieUgvvwXV9pcTWrt-fXXW3CaGHXQ=.e73cad93-fc58-44af-aa05-1ff0fdf72fb6@github.com> References: <2dggI8qABtygqAVieUgvvwXV9pcTWrt-fXXW3CaGHXQ=.e73cad93-fc58-44af-aa05-1ff0fdf72fb6@github.com> Message-ID: On Wed, 1 Dec 2021 11:35:55 GMT, Jatin Bhateja wrote: >> - JDK-8275317 extended auto-vectorizer to infer Vector Cast operations if source and destination primitive type have same size. >> - This patch adds the backend support for vector CastF2I and CaseD2L on X86 AVX512 and legacy targets. >> >> Following are the performance measurements of an existing JMH benchmark (test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java) >> >> System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> BENCHMARK | SIZE | BASELINE (AVX3) ns/op | WithOpt (AVX3) ns/op | Gain AVX3(baseline/opt) | BASELINE (AVX2) ns/op | WithOpt (AVX2) ns/op | Gain AVX2 (baseline/opt) >> -- | -- | -- | -- | -- | -- | -- | -- >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 512.00 | 256.26 | 77.50 | 3.31 | 275.49 | 275.65 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 1024.00 | 501.87 | 150.35 | 3.34 | 540.47 | 541.22 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 2048.00 | 993.05 | 293.23 | 3.39 | 1070.56 | 1070.14 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 512.00 | 227.83 | 39.36 | 5.79 | 248.25 | 45.01 | 5.52 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 1024.00 | 449.70 | 77.88 | 5.77 | 487.33 | 86.15 | 5.66 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 2048.00 | 884.95 | 149.58 | 5.92 | 956.58 | 152.45 | 6.27 >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8277793: Review comments resolution Looks good! ------------- Marked as reviewed by neliasso (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6544 From rkennke at openjdk.java.net Wed Dec 1 12:36:02 2021 From: rkennke at openjdk.java.net (Roman Kennke) Date: Wed, 1 Dec 2021 12:36:02 GMT Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v10] In-Reply-To: References: Message-ID: <___2nS2ZNEiggHv_C70jADW_cE-T2wUJKqRK_J5gIj0=.23da20f8-ba19-4279-b154-f0a67626ccf4@github.com> > The flag UseHeavyMonitors seems to imply that it makes Hotspot always use inflated monitors, rather than stack locks. However, it is only implemented in the interpreter that way. When it calls into runtime, it would still happily stack-lock. Even worse, C1 uses another flag UseFastLocking to achieve something similar (with the same caveat that runtime would stack-lock anyway). C2 doesn't have any such mechanism at all. > I would like to experiment with disabling stack-locking, and thus, having this flag work as expected would seem very useful. > > The change removes the C1 flag UseFastLocking, and replaces its uses with equivalent (i.e. inverted) UseHeavyMonitors instead. I think it makes sense to make UseHeavyMonitors develop (I wouldn't want anybody to use this in production, not currently without this change, and not with this change). I also added a flag VerifyHeavyMonitors to be able to verify that stack-locking is really disabled. We can't currently verify this uncondiftionally (e.g. in debug builds) because all non-x86_64 platforms would need work. > > Testing: > - [x] tier1 > - [x] tier2 > - [x] tier3 > - [ ] tier4 Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: Use heavy monitors in runtime only on supported architectures ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6320/files - new: https://git.openjdk.java.net/jdk/pull/6320/files/d1ec5b65..79090ed1 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6320&range=09 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6320&range=08-09 Stats: 10 lines in 1 file changed: 8 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/6320.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6320/head:pull/6320 PR: https://git.openjdk.java.net/jdk/pull/6320 From rkennke at openjdk.java.net Wed Dec 1 12:38:26 2021 From: rkennke at openjdk.java.net (Roman Kennke) Date: Wed, 1 Dec 2021 12:38:26 GMT Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v4] In-Reply-To: References: <7GuyRoJ653qrQDv-vEnRU7JMcZU6qZJi0j7Ty1b5PE4=.c7d00b3d-c23b-47f9-bfb6-258623c2faae@github.com> Message-ID: <245yKlIBwKaEFaRmxc3abYF3EveGUgezLqc0Vd7PY54=.912ff7ec-3bf1-46e5-a573-51ac9df83246@github.com> On Wed, 1 Dec 2021 01:28:58 GMT, David Holmes wrote: > > > IIUC you are only making UseHeavyMonitors work properly on x86_64, but in that case you cannot convert UseFastLocks to UseHeavyMonitors on all platforms as it won't work correctly on those other platforms. > > > Cheers, David > > > > > > It would not break as such on other platforms. It would only be partially implemented, that is C1 would emit calls to runtime for and only use monitors while interpreter and C2 would still emit stack locks. That is ok - and that is roughly what +UseFastLocking used to do. > > Sorry but I don't see how having the interpreter+C2 use stack-locks while C1 ignores them can possibly be correct. ??? Ok, right. It worked before, because -UseFastLocking (C1) and +UseHeavyMonitors (interpreter) would generate runtime calls (instead of fast stack locking paths), and the runtime implementation would still do stack-locking. For arches where UseHeavyMonitors is not (fully) supported, I am fixing this by letting the runtime do stack-locks. TBH, it would be nice if this change could be properly implemented on remaining arches... (ping @TheRealMDoerr for PPC, not sure who could do arm or s390). ------------- PR: https://git.openjdk.java.net/jdk/pull/6320 From dholmes at openjdk.java.net Wed Dec 1 12:44:26 2021 From: dholmes at openjdk.java.net (David Holmes) Date: Wed, 1 Dec 2021 12:44:26 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs In-Reply-To: <869rFsjOuFR2Yt9bfmz45l4s6uRaFQc5IkPW3eJyZ8E=.e1a6695f-258d-4ec6-a28a-9fcba5e53ce9@github.com> References: <7At_ag4fpFbiKQ81BsVoM9MzmXJJCIPufYSoi1WgIpg=.e2085ff5-34d8-4c7a-a561-9b36100f14a4@github.com> <869rFsjOuFR2Yt9bfmz45l4s6uRaFQc5IkPW3eJyZ8E=.e1a6695f-258d-4ec6-a28a-9fcba5e53ce9@github.com> Message-ID: On Wed, 1 Dec 2021 09:39:18 GMT, Jie Fu wrote: > > As I understand it - old AVX512 platforms will continue to work as before. > > According to @sviswa7 's comments (no cupid bit for the latest ISA), `is_intel_family_core() && supports_serialize()` can't distinguish all the old AVX512 platforms from the latest ones. So I think it may be possible some old AVX512 machines will behave differently after this opt. I do not see such comments. From my previous questions on this it was indicated that any CPU that supports `serialize` has the improved performance. ------------- PR: https://git.openjdk.java.net/jdk/pull/6512 From thartmann at openjdk.java.net Wed Dec 1 13:01:22 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 1 Dec 2021 13:01:22 GMT Subject: RFR: 8277906: Incorrect type for IV phi of long counted loops after CCP In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 10:35:32 GMT, Roland Westrelin wrote: > This failure occurs because the iv phi of a long counted loop has the > wrong type after CCP. That happens because, during CCP, after the type > of the limit of the counted loop is updated, the type of the iv phi is > not recomputed. The fix is to apply to long counted loops the logic > that already exists for int counted loop. Looks good to me but what about this code? https://github.com/openjdk/jdk/blob/37ff7f3b66eaa74d62d6a93f2f34ec744db21834/src/hotspot/share/opto/phaseX.cpp#L1573-L1575 ------------- PR: https://git.openjdk.java.net/jdk/pull/6632 From roland at openjdk.java.net Wed Dec 1 13:14:26 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 1 Dec 2021 13:14:26 GMT Subject: RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert [v2] In-Reply-To: References: <0GbtDPLsXoyXYNJv2ZJ6vwTSdbzJOXqWUNENCTCrZmA=.d0b24bd8-be69-435d-8db1-d291afcd7f62@github.com> Message-ID: On Tue, 30 Nov 2021 18:49:17 GMT, Dean Long wrote: > Out of curiosity, I checked when the "Keep its stack, for now" comment was added, and it was for JDK-4432078. The comment in the bug says: > > "Missing stack contents in graphkit-based exceptions make it impossible to re-run the trapping bytecode in the interpreter. Fix is to retain stack information a little longer." Thanks for investigating that further. Where would deoptimization occur then? In Parse::catch_inline_exceptions() when the right exception handlers is picked with subtype checks? I don't see any uncommon trap there. Maybe the requirement to keep the stack is no longer necessary? The reason I didn't go with Vladimir's work around for 8273165 is that I think it could have a performance impact that would be more likely to be noticed than in the case of 8273165 (because late inlining of method handle has been around for many releases and is likely something that's relied on). We could extend Vladimir's work around by checking that the receiver may be null and that the null check would cause the exception to be thrown rather than deoptimization. Another way to deal with this could be to pop the stack if the current method has no exception handlers because then the exception is passed on to the caller and the entire frame is popped anyway. That would work nicely for this case as AFAIU, the method handle invoker can only be inlined from a lambda form that wouldn't have exception handlers. ------------- PR: https://git.openjdk.java.net/jdk/pull/6572 From roland at openjdk.java.net Wed Dec 1 13:15:24 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 1 Dec 2021 13:15:24 GMT Subject: RFR: 8277906: Incorrect type for IV phi of long counted loops after CCP In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 10:35:32 GMT, Roland Westrelin wrote: > This failure occurs because the iv phi of a long counted loop has the > wrong type after CCP. That happens because, during CCP, after the type > of the limit of the counted loop is updated, the type of the iv phi is > not recomputed. The fix is to apply to long counted loops the logic > that already exists for int counted loop. Thanks for the review. > Looks good to me but what about this code? Good catch. I didn't notice that one. I think it is specific to int counted loops when the pre/main/post loops are created. ------------- PR: https://git.openjdk.java.net/jdk/pull/6632 From duke at openjdk.java.net Wed Dec 1 13:20:49 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Wed, 1 Dec 2021 13:20:49 GMT Subject: RFR: JDK-8277496 Remove duplication in c1 Block successor lists [v3] In-Reply-To: <4wdDk_HKbqPRXF47C895FeMhJpmrT6rt0aVT8q0miH0=.7b4bcdd0-671a-409b-ad6f-e433a3507e24@github.com> References: <4wdDk_HKbqPRXF47C895FeMhJpmrT6rt0aVT8q0miH0=.7b4bcdd0-671a-409b-ad6f-e433a3507e24@github.com> Message-ID: > Remove `BlockBegin::_successors`, leaving `BlockEnd::_sux` as the SSOT for the successors of a block. Prior to this PR, these two lists were both tracking the same list of successors of the same block. This necessitated a lot of syncing and verification code. > > With this PR, as long as a block has its end pointer assigned, its successors can always be reached by querying the `BlockEnd`. `BlockEnd::_sux` becomes the single place where the list of successors is maintained. When modified, the successor list no longer needs to be synchronized in two places, reducing complexity and confusion. Asserts on the two lists corresponding no longer need to be made. > > While being created in `GraphBuilder`, `BlockBegin`s don't have a `BlockEnd` assigned yet. To temporarily track block successors in this small interval, add a lookup structure `BlockListBuilder::_bci2block_successors`. > > This PR affects debug printing code. If the end pointer of a `BlockBegin `is NULL for some reason, then the successor list can no longer be printed (for obious reasons). > > This PR introduces an additional check to IR::verify to check that `BlockBegin::_end` is not set to null. > > This PR also performs some minor refactoring, polishing, inlining, and removing of dead code around the affected areas. > > The commit history has been polished to attempt to guide the reader through the changes. > > hs-tier1 and hs-tier2 tests pass. Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: changed if statement style ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6614/files - new: https://git.openjdk.java.net/jdk/pull/6614/files/ec61358b..dbe13012 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6614&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6614&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 1 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/6614.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6614/head:pull/6614 PR: https://git.openjdk.java.net/jdk/pull/6614 From duke at openjdk.java.net Wed Dec 1 13:20:53 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Wed, 1 Dec 2021 13:20:53 GMT Subject: RFR: JDK-8277496 Remove duplication in c1 Block successor lists [v2] In-Reply-To: References: <4wdDk_HKbqPRXF47C895FeMhJpmrT6rt0aVT8q0miH0=.7b4bcdd0-671a-409b-ad6f-e433a3507e24@github.com> <_UPeVsHsTFqDNA9DJ5zfJzW1WO9lIXbVJxQPqw-GnF4=.3d05c1e0-4492-451a-b836-80af2e8e9e90@github.com> Message-ID: On Wed, 1 Dec 2021 12:10:01 GMT, Nils Eliasson wrote: >> Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: >> >> remove stray comment > > src/hotspot/share/c1/c1_LIR.cpp line 1589: > >> 1587: } >> 1588: >> 1589: if (end != NULL) > > Missing braces on if statement Corrected it ------------- PR: https://git.openjdk.java.net/jdk/pull/6614 From aph at openjdk.java.net Wed Dec 1 13:28:28 2021 From: aph at openjdk.java.net (Andrew Haley) Date: Wed, 1 Dec 2021 13:28:28 GMT Subject: RFR: 8251216: Implement MD5 intrinsics on AArch64 In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 09:24:45 GMT, Patric Hedlin wrote: > Implementation of MD5 intrinsic support for AArch64. > > Contributed by Ludovic Henry (@luhenry). > > Speedup measured (in Aurora running Ampere Altra) as follows: > > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1048576-provider:...29.39% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:2047-provider:.........28.91% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:2048-provider:.........28.81% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1023-provider:.........28.43% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1024-provider:.........28.32% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:511-provider:...........27.78% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:512-provider:...........27.62% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:255-provider:...........26.52% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:256-provider:...........26.38% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:127-provider:...........25.41% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:128-provider:...........24.66% > > Testing tier1-7. MD5 has been proven insecure, and its weaknesses have been exploited in the field. It is disabled in many systems. I am surprised that we are thinking of accelerating it for possible future use, and that we're adding a worse-then-useless crypto algorithm to the AArch64 startup. ------------- PR: https://git.openjdk.java.net/jdk/pull/6628 From thartmann at openjdk.java.net Wed Dec 1 13:33:26 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 1 Dec 2021 13:33:26 GMT Subject: RFR: 8277906: Incorrect type for IV phi of long counted loops after CCP In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 10:35:32 GMT, Roland Westrelin wrote: > This failure occurs because the iv phi of a long counted loop has the > wrong type after CCP. That happens because, during CCP, after the type > of the limit of the counted loop is updated, the type of the iv phi is > not recomputed. The fix is to apply to long counted loops the logic > that already exists for int counted loop. Okay, makes sense. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6632 From duke at openjdk.java.net Wed Dec 1 13:37:27 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Wed, 1 Dec 2021 13:37:27 GMT Subject: RFR: 8277882: New subnode ideal optimization: converting "c0 - (x + c1)" into "(c0 - c1) - x" [v4] In-Reply-To: References: Message-ID: On Tue, 30 Nov 2021 22:44:59 GMT, Zang, Zhiqiang wrote: >> Suggest two new optimizations that can be done in SubINode::Ideal. > > Zang, Zhiqiang has updated the pull request incrementally with one additional commit since the last revision: > > add ok_to_convert to the condition. src/hotspot/share/opto/subnode.cpp line 195: > 193: if (in2->Opcode() == Op_AddI > 194: && phase->type(in1)->isa_int() != NULL > 195: && phase->type(in1)->isa_int()->is_con() Line 194 and 195 can be expressed as `in1->Opcode() == Op_ConI`, same for the next 2 lines. ------------- PR: https://git.openjdk.java.net/jdk/pull/6441 From phedlin at openjdk.java.net Wed Dec 1 13:44:30 2021 From: phedlin at openjdk.java.net (Patric Hedlin) Date: Wed, 1 Dec 2021 13:44:30 GMT Subject: RFR: 8251216: Implement MD5 intrinsics on AArch64 In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 09:24:45 GMT, Patric Hedlin wrote: > Implementation of MD5 intrinsic support for AArch64. > > Contributed by Ludovic Henry (@luhenry). > > Speedup measured (in Aurora running Ampere Altra) as follows: > > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1048576-provider:...29.39% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:2047-provider:.........28.91% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:2048-provider:.........28.81% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1023-provider:.........28.43% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1024-provider:.........28.32% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:511-provider:...........27.78% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:512-provider:...........27.62% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:255-provider:...........26.52% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:256-provider:...........26.38% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:127-provider:...........25.41% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:128-provider:...........24.66% > > Testing tier1-7. Fair point. But does that also rule out all uses, as long as it's supported. Not all hashes have to be exposed and on the upside, it's rather fast (well, it's also one of its down sides). Who should make the choice to use it or not? ------------- PR: https://git.openjdk.java.net/jdk/pull/6628 From neliasso at openjdk.java.net Wed Dec 1 13:46:22 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Wed, 1 Dec 2021 13:46:22 GMT Subject: RFR: JDK-8277496 Remove duplication in c1 Block successor lists [v3] In-Reply-To: References: <4wdDk_HKbqPRXF47C895FeMhJpmrT6rt0aVT8q0miH0=.7b4bcdd0-671a-409b-ad6f-e433a3507e24@github.com> Message-ID: On Wed, 1 Dec 2021 13:20:49 GMT, Ludvig Janiuk wrote: >> Remove `BlockBegin::_successors`, leaving `BlockEnd::_sux` as the SSOT for the successors of a block. Prior to this PR, these two lists were both tracking the same list of successors of the same block. This necessitated a lot of syncing and verification code. >> >> With this PR, as long as a block has its end pointer assigned, its successors can always be reached by querying the `BlockEnd`. `BlockEnd::_sux` becomes the single place where the list of successors is maintained. When modified, the successor list no longer needs to be synchronized in two places, reducing complexity and confusion. Asserts on the two lists corresponding no longer need to be made. >> >> While being created in `GraphBuilder`, `BlockBegin`s don't have a `BlockEnd` assigned yet. To temporarily track block successors in this small interval, add a lookup structure `BlockListBuilder::_bci2block_successors`. >> >> This PR affects debug printing code. If the end pointer of a `BlockBegin `is NULL for some reason, then the successor list can no longer be printed (for obious reasons). >> >> This PR introduces an additional check to IR::verify to check that `BlockBegin::_end` is not set to null. >> >> This PR also performs some minor refactoring, polishing, inlining, and removing of dead code around the affected areas. >> >> The commit history has been polished to attempt to guide the reader through the changes. >> >> hs-tier1 and hs-tier2 tests pass. > > Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: > > changed if statement style Looks good! Thanks for fixing! ------------- Marked as reviewed by neliasso (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6614 From duke at openjdk.java.net Wed Dec 1 13:47:25 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Wed, 1 Dec 2021 13:47:25 GMT Subject: RFR: 8277882: New subnode ideal optimization: converting "c0 - (x + c1)" into "(c0 - c1) - x" [v4] In-Reply-To: References: Message-ID: On Tue, 30 Nov 2021 22:44:59 GMT, Zang, Zhiqiang wrote: >> Suggest two new optimizations that can be done in SubINode::Ideal. > > Zang, Zhiqiang has updated the pull request incrementally with one additional commit since the last revision: > > add ok_to_convert to the condition. src/hotspot/share/opto/subnode.cpp line 198: > 196: && phase->type(in2->in(2))->isa_int() != NULL > 197: && phase->type(in2->in(2))->isa_int()->is_con() > 198: && ok_to_convert(in2)) { `ok_to_convert` should go right after `in2->Opcode() == Op_AddI` for uniformity. Also you could reuse the method `ok_to_convert(Node*, Node*)` for this, just feed `in1` as the second argument similar to the check in line 218. test/hotspot/jtreg/compiler/c2/TestSubIdeal.java line 1: > 1: /* You missed the copyright header here, the same for the microbenchmark ------------- PR: https://git.openjdk.java.net/jdk/pull/6441 From jiefu at openjdk.java.net Wed Dec 1 13:54:22 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 1 Dec 2021 13:54:22 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs In-Reply-To: References: <7At_ag4fpFbiKQ81BsVoM9MzmXJJCIPufYSoi1WgIpg=.e2085ff5-34d8-4c7a-a561-9b36100f14a4@github.com> <869rFsjOuFR2Yt9bfmz45l4s6uRaFQc5IkPW3eJyZ8E=.e1a6695f-258d-4ec6-a28a-9fcba5e53ce9@github.com> Message-ID: On Wed, 1 Dec 2021 12:41:14 GMT, David Holmes wrote: > From my previous questions on this it was indicated that any CPU that supports `serialize` has the improved performance. If so, CPUs that don't support `serialize` would behave as before. Then there shouldn't be any performance regression. ------------- PR: https://git.openjdk.java.net/jdk/pull/6512 From duke at openjdk.java.net Wed Dec 1 13:57:27 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Wed, 1 Dec 2021 13:57:27 GMT Subject: RFR: 8277882: New subnode ideal optimization: converting "c0 - (x + c1)" into "(c0 - c1) - x" [v4] In-Reply-To: References: Message-ID: On Tue, 30 Nov 2021 22:44:59 GMT, Zang, Zhiqiang wrote: >> Suggest two new optimizations that can be done in SubINode::Ideal. > > Zang, Zhiqiang has updated the pull request incrementally with one additional commit since the last revision: > > add ok_to_convert to the condition. I think this could be merged with the transformation on line 216 (the one converts "x - (y+c0)" into "(x-y) - c0") and doing a branch on the constness of `in1` to determine the appropriate term of `in2` to do the transformation. ------------- PR: https://git.openjdk.java.net/jdk/pull/6441 From luhenry at openjdk.java.net Wed Dec 1 15:27:29 2021 From: luhenry at openjdk.java.net (Ludovic Henry) Date: Wed, 1 Dec 2021 15:27:29 GMT Subject: RFR: 8251216: Implement MD5 intrinsics on AArch64 In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 13:25:49 GMT, Andrew Haley wrote: > MD5 has been proven insecure, and its weaknesses have been exploited in the field. It is disabled in many systems. I am surprised that we are thinking of accelerating it for possible future use, and that we're adding a worse-then-useless crypto algorithm to the AArch64 startup. I wholeheartedly agree with your take. Unfortunately, it's still used on many systems, like for verifying the integrity of downloads ([Azure Blob Storage](https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.storage.blob.blobproperties.contentmd5?view=azure-dotnet-legacy) for example). ------------- PR: https://git.openjdk.java.net/jdk/pull/6628 From redestad at openjdk.java.net Wed Dec 1 15:36:25 2021 From: redestad at openjdk.java.net (Claes Redestad) Date: Wed, 1 Dec 2021 15:36:25 GMT Subject: RFR: 8251216: Implement MD5 intrinsics on AArch64 In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 09:24:45 GMT, Patric Hedlin wrote: > Implementation of MD5 intrinsic support for AArch64. > > Contributed by Ludovic Henry (@luhenry). > > Speedup measured (in Aurora running Ampere Altra) as follows: > > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1048576-provider:...29.39% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:2047-provider:.........28.91% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:2048-provider:.........28.81% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1023-provider:.........28.43% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1024-provider:.........28.32% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:511-provider:...........27.78% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:512-provider:...........27.62% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:255-provider:...........26.52% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:256-provider:...........26.38% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:127-provider:...........25.41% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:128-provider:...........24.66% > > Testing tier1-7. While I think it's good that distributions are free to omit MD5 now, there's still various non-cryptographic uses that warrant continued support and enhancements. Checksumming, JDK APIs such as `UUID.nameUUIDFromBytes`, etc.. Perhaps there should be a build flag to omit all of this, though? For startup it remains a sore point that stubs like these are generated eagerly on bootstrap. I hope we'll be able to make this lazy in the near future ([JDK-8231349](https://bugs.openjdk.java.net/browse/JDK-8231349)) to make adding intrinsics come with fewer trade-offs. This particular stub is very simple and likely adds unnoticably to bootstrap, but in accumulation it's grown to be a bit of a concern in places, especially on x86 with large AVX-512 intrinsics. I'm not sure if there's been any progress on this recently, though. @vnkozlov? (I'm not qualified to review Aarch64 code, but this contribution looks ok to me.) ------------- PR: https://git.openjdk.java.net/jdk/pull/6628 From home.josef.lehner at gmail.com Wed Dec 1 15:35:57 2021 From: home.josef.lehner at gmail.com (Josef Lehner) Date: Wed, 1 Dec 2021 16:35:57 +0100 Subject: JDK-8231460: Java update from 11.0.11 to 11.0.13 changes JVM code cache behavior and results in more process cpu usage and unexpected profiled nmethods memory usage Message-ID: Dear OpenJDK team, as described in this StackOverflow question, I want to reach out to you and question whether the JVM code cache / codeheap still works as designed. https://stackoverflow.com/questions/70086548/java-update-from-11-0-11-to-11-0-13-changes-jvm-code-cache-behavior-and-results What we experience in our huge application is that with Java 11.0.13 and -XX:ReservedCodeCacheSize=375m the code cache / codeheap 'profiled nmethods' (C1 optimized, ~ 180 MB) drops after a very short time to a very low level (less than 50 MB) and stays at this level forever while 'non-profiled nmethods' (C2 optimized) is already at its limits. After we tripled the -XX:ReservedCodeCacheSize to 1024 MB, both areas 'profiled nmethods' and 'non-profiled nmethods' have stayed at a much higher constant level (~ 258 MB) for over a week now instead of dropping to less than 50 MB after 15 min (-XX:ReservedCodeCacheSize=375m) or dropping after 3 hours (-XX:ReservedCodeCacheSize=512m). >From my point of view as a non-expert I would expect that the C1 optimized code does not get removed (or at least not so much) from 'profiled nmethods' as there is no space left in 'non-profiled nmethods' to optimize it further. What do you think? Important changes in 11.0.12: https://bugs.openjdk.java.net/browse/JDK-8223444 Improve CodeHeap Free Space Management https://bugs.openjdk.java.net/browse/JDK-8231460 Performance issue (CodeHeap) with large free blocks Best regards Josef Lehner From ddong at openjdk.java.net Wed Dec 1 16:05:46 2021 From: ddong at openjdk.java.net (Denghui Dong) Date: Wed, 1 Dec 2021 16:05:46 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp Message-ID: Hi, Could I have a review of this small fix that makes expand_dtrace_alloc_probe take effect? Thanks, Denghui ------------- Commit messages: - 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp Changes: https://git.openjdk.java.net/jdk/pull/6639/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6639&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278079 Stats: 4 lines in 1 file changed: 0 ins; 1 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/6639.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6639/head:pull/6639 PR: https://git.openjdk.java.net/jdk/pull/6639 From duke at openjdk.java.net Wed Dec 1 16:25:40 2021 From: duke at openjdk.java.net (Danil Bubnov) Date: Wed, 1 Dec 2021 16:25:40 GMT Subject: RFR: 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> Message-ID: This is the fix of aarch64 jvmci calling convention. On MacOS/aarch64 "Function arguments may consume slots on the stack that are not multiples of 8 bytes" [1], but current approach uses only wordsize or bigger slots, which is incorrect (that is why tests were failing [4]). Now arguments consume the right amount of bytes. Another problem is that current approach don't make 16-byte alignment of Stack Pointer [1][2][3]. However, tests not fail on Linux/aarch64 and Windows/aarch64. They pass because in this tests all functions have even number of argumets, that is why 16-byte alignment comes automatically. But if you try to add or delete one argumets, tests will fail with SIGBUS. I've tested this patch on MacOS/aarch64 and Linux/aarch64, all tests have passed. Also I don't understand, why current tests (NativeCallTest) use only int, long, float and double as arguments types. Is it possible to add functions with another types like byte or short? I tried, but it fails on every platform. [1] https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms [2] https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#the-stack [3] https://docs.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-160#stack [4] https://bugs.openjdk.java.net/browse/JDK-8262901 ------------- Commit messages: - Remove NativeCallTest from ProblemList for macosx-aarch64 - Add stack alignment and MacOS specific code Changes: https://git.openjdk.java.net/jdk/pull/6641/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6641&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8262901 Stats: 25 lines in 4 files changed: 18 ins; 1 del; 6 mod Patch: https://git.openjdk.java.net/jdk/pull/6641.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6641/head:pull/6641 PR: https://git.openjdk.java.net/jdk/pull/6641 From mdoerr at openjdk.java.net Wed Dec 1 16:44:28 2021 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Wed, 1 Dec 2021 16:44:28 GMT Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v10] In-Reply-To: <___2nS2ZNEiggHv_C70jADW_cE-T2wUJKqRK_J5gIj0=.23da20f8-ba19-4279-b154-f0a67626ccf4@github.com> References: <___2nS2ZNEiggHv_C70jADW_cE-T2wUJKqRK_J5gIj0=.23da20f8-ba19-4279-b154-f0a67626ccf4@github.com> Message-ID: On Wed, 1 Dec 2021 12:36:02 GMT, Roman Kennke wrote: >> The flag UseHeavyMonitors seems to imply that it makes Hotspot always use inflated monitors, rather than stack locks. However, it is only implemented in the interpreter that way. When it calls into runtime, it would still happily stack-lock. Even worse, C1 uses another flag UseFastLocking to achieve something similar (with the same caveat that runtime would stack-lock anyway). C2 doesn't have any such mechanism at all. >> I would like to experiment with disabling stack-locking, and thus, having this flag work as expected would seem very useful. >> >> The change removes the C1 flag UseFastLocking, and replaces its uses with equivalent (i.e. inverted) UseHeavyMonitors instead. I think it makes sense to make UseHeavyMonitors develop (I wouldn't want anybody to use this in production, not currently without this change, and not with this change). I also added a flag VerifyHeavyMonitors to be able to verify that stack-locking is really disabled. We can't currently verify this uncondiftionally (e.g. in debug builds) because all non-x86_64 platforms would need work. >> >> Testing: >> - [x] tier1 >> - [x] tier2 >> - [x] tier3 >> - [ ] tier4 > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > Use heavy monitors in runtime only on supported architectures PPC64 could be implemented like this: diff --git a/src/hotspot/cpu/ppc/ppc.ad b/src/hotspot/cpu/ppc/ppc.ad index 958059e1ca2..dc96bd15836 100644 --- a/src/hotspot/cpu/ppc/ppc.ad +++ b/src/hotspot/cpu/ppc/ppc.ad @@ -12132,7 +12132,7 @@ instruct partialSubtypeCheck(iRegPdst result, iRegP_N2P subklass, iRegP_N2P supe instruct cmpFastLock(flagsReg crx, iRegPdst oop, iRegPdst box, iRegPdst tmp1, iRegPdst tmp2) %{ match(Set crx (FastLock oop box)); effect(TEMP tmp1, TEMP tmp2); - predicate(!Compile::current()->use_rtm()); + predicate(!Compile::current()->use_rtm() && !UseHeavyMonitors); format %{ "FASTLOCK $oop, $box, $tmp1, $tmp2" %} ins_encode %{ @@ -12149,7 +12149,7 @@ instruct cmpFastLock(flagsReg crx, iRegPdst oop, iRegPdst box, iRegPdst tmp1, iR instruct cmpFastLock_tm(flagsReg crx, iRegPdst oop, rarg2RegP box, iRegPdst tmp1, iRegPdst tmp2, iRegPdst tmp3) %{ match(Set crx (FastLock oop box)); effect(TEMP tmp1, TEMP tmp2, TEMP tmp3, USE_KILL box); - predicate(Compile::current()->use_rtm()); + predicate(Compile::current()->use_rtm() && !UseHeavyMonitors); format %{ "FASTLOCK $oop, $box, $tmp1, $tmp2, $tmp3 (TM)" %} ins_encode %{ @@ -12165,6 +12165,18 @@ instruct cmpFastLock_tm(flagsReg crx, iRegPdst oop, rarg2RegP box, iRegPdst tmp1 ins_pipe(pipe_class_compare); %} +instruct cmpFastLock_hm(flagsReg crx, iRegPdst oop, rarg2RegP box) %{ + match(Set crx (FastLock oop box)); + predicate(UseHeavyMonitors); + + format %{ "FASTLOCK $oop, $box (HM)" %} + ins_encode %{ + // Set NE to indicate 'failure' -> take slow-path. + __ crandc($crx$$CondRegister, Assembler::equal, $crx$$CondRegister, Assembler::equal); + %} + ins_pipe(pipe_class_compare); +%} + instruct cmpFastUnlock(flagsReg crx, iRegPdst oop, iRegPdst box, iRegPdst tmp1, iRegPdst tmp2, iRegPdst tmp3) %{ match(Set crx (FastUnlock oop box)); effect(TEMP tmp1, TEMP tmp2, TEMP tmp3); diff --git a/src/hotspot/cpu/ppc/sharedRuntime_ppc.cpp b/src/hotspot/cpu/ppc/sharedRuntime_ppc.cpp index a834fa1af36..bac8ef164f8 100644 --- a/src/hotspot/cpu/ppc/sharedRuntime_ppc.cpp +++ b/src/hotspot/cpu/ppc/sharedRuntime_ppc.cpp @@ -2014,8 +2014,10 @@ nmethod *SharedRuntime::generate_native_wrapper(MacroAssembler *masm, // Try fastpath for locking. // fast_lock kills r_temp_1, r_temp_2, r_temp_3. - __ compiler_fast_lock_object(r_flag, r_oop, r_box, r_temp_1, r_temp_2, r_temp_3); - __ beq(r_flag, locked); + if (!UseHeavyMonitors) { + __ compiler_fast_lock_object(r_flag, r_oop, r_box, r_temp_1, r_temp_2, r_temp_3); + __ beq(r_flag, locked); + } // None of the above fast optimizations worked so we have to get into the // slow case of monitor enter. Inline a special case of call_VM that diff --git a/src/hotspot/share/runtime/synchronizer.cpp b/src/hotspot/share/runtime/synchronizer.cpp index 4c5ea4a6e40..4f9c7c21a9b 100644 --- a/src/hotspot/share/runtime/synchronizer.cpp +++ b/src/hotspot/share/runtime/synchronizer.cpp @@ -418,7 +418,7 @@ void ObjectSynchronizer::handle_sync_on_value_based_class(Handle obj, JavaThread } static bool useHeavyMonitors() { -#if defined(X86) || defined(AARCH64) +#if defined(X86) || defined(AARCH64) || defined(PPC64) return UseHeavyMonitors; #else return false; I don't like hacking the regular assembler implementations. Better would be to change C2 such that it doesn't generate FastLockNodes. But that may be a bit cumbersome. ------------- PR: https://git.openjdk.java.net/jdk/pull/6320 From aph at openjdk.java.net Wed Dec 1 17:01:20 2021 From: aph at openjdk.java.net (Andrew Haley) Date: Wed, 1 Dec 2021 17:01:20 GMT Subject: RFR: 8251216: Implement MD5 intrinsics on AArch64 In-Reply-To: References: Message-ID: <-j7mfpZ2dqvbuODtPI7RYpIX8HQVLwYhuLnGR-LnQN4=.e5119f03-3fd0-47db-94ec-3d240a236f49@github.com> On Wed, 1 Dec 2021 15:24:40 GMT, Ludovic Henry wrote: > > MD5 has been proven insecure, and its weaknesses have been exploited in the field. It is disabled in many systems. I am surprised that we are thinking of accelerating it for possible future use, and that we're adding a worse-then-useless crypto algorithm to the AArch64 startup. > > I wholeheartedly agree with your take. Unfortunately, it's still used on many systems, like for verifying the integrity of downloads ([Azure Blob Storage](https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.storage.blob.blobproperties.contentmd5?view=azure-dotnet-legacy) for example). Ha! ?? OK. This seems like a really weird time to be adding MD5 support, almost four years after MD5 was disabled for jarfile signing, and 15 years after the first practical break. But I guess it's harmless enough, even though I hate having to carry such baggage around. ------------- PR: https://git.openjdk.java.net/jdk/pull/6628 From aph at openjdk.java.net Wed Dec 1 17:04:22 2021 From: aph at openjdk.java.net (Andrew Haley) Date: Wed, 1 Dec 2021 17:04:22 GMT Subject: RFR: 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 16:17:21 GMT, Danil Bubnov wrote: > This is the fix of aarch64 jvmci calling convention. > > On MacOS/aarch64 "Function arguments may consume slots on the stack that are not multiples of 8 bytes" [1], but current approach uses only wordsize or bigger slots, which is incorrect (that is why tests were failing [4]). Now arguments consume the right amount of bytes. > > Another problem is that current approach don't make 16-byte alignment of Stack Pointer [1][2][3]. However, tests not fail on Linux/aarch64 and Windows/aarch64. They pass because in this tests all functions have even number of argumets, that is why 16-byte alignment comes automatically. But if you try to add or delete one argumets, tests will fail with SIGBUS. > > I've tested this patch on MacOS/aarch64 and Linux/aarch64, all tests have passed. > > Also I don't understand, why current tests (NativeCallTest) use only int, long, float and double as arguments types. Is it possible to add functions with another types like byte or short? I tried, but it fails on every platform. > > [1] https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms > [2] https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#the-stack > [3] https://docs.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-160#stack > [4] https://bugs.openjdk.java.net/browse/JDK-8262901 Thanks for this contribution. Which set of tests have you been using? ------------- PR: https://git.openjdk.java.net/jdk/pull/6641 From aoqi at openjdk.java.net Wed Dec 1 17:14:58 2021 From: aoqi at openjdk.java.net (Ao Qi) Date: Wed, 1 Dec 2021 17:14:58 GMT Subject: RFR: 8278037: Clean up PPC32 related code in C1 [v2] In-Reply-To: References: Message-ID: > `_tmp1` and `_tmp2` were [removed](https://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/faaed259df37#l13.173) in [JDK-8160245](https://bugs.openjdk.java.net/browse/JDK-8160245), but they are still used. > > This should cause a build error. I don't have a ppc32 machine for the test. It's also not found at https://builds.shipilev.net. Is ppc32 a supported compiler1 platform? Ao Qi has updated the pull request incrementally with one additional commit since the last revision: 8278037: Removed PPC32 C1 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6625/files - new: https://git.openjdk.java.net/jdk/pull/6625/files/90f402a6..27c3dc91 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6625&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6625&range=00-01 Stats: 57 lines in 3 files changed: 0 ins; 53 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/6625.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6625/head:pull/6625 PR: https://git.openjdk.java.net/jdk/pull/6625 From rkennke at openjdk.java.net Wed Dec 1 17:18:29 2021 From: rkennke at openjdk.java.net (Roman Kennke) Date: Wed, 1 Dec 2021 17:18:29 GMT Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v10] In-Reply-To: <___2nS2ZNEiggHv_C70jADW_cE-T2wUJKqRK_J5gIj0=.23da20f8-ba19-4279-b154-f0a67626ccf4@github.com> References: <___2nS2ZNEiggHv_C70jADW_cE-T2wUJKqRK_J5gIj0=.23da20f8-ba19-4279-b154-f0a67626ccf4@github.com> Message-ID: <5oeP5b_jS2eu5lsDpNkfIBINEPPT23A-kqeHwOSXrYQ=.e32eb661-903d-4778-a56f-2a1d525d1874@github.com> On Wed, 1 Dec 2021 12:36:02 GMT, Roman Kennke wrote: >> The flag UseHeavyMonitors seems to imply that it makes Hotspot always use inflated monitors, rather than stack locks. However, it is only implemented in the interpreter that way. When it calls into runtime, it would still happily stack-lock. Even worse, C1 uses another flag UseFastLocking to achieve something similar (with the same caveat that runtime would stack-lock anyway). C2 doesn't have any such mechanism at all. >> I would like to experiment with disabling stack-locking, and thus, having this flag work as expected would seem very useful. >> >> The change removes the C1 flag UseFastLocking, and replaces its uses with equivalent (i.e. inverted) UseHeavyMonitors instead. I think it makes sense to make UseHeavyMonitors develop (I wouldn't want anybody to use this in production, not currently without this change, and not with this change). I also added a flag VerifyHeavyMonitors to be able to verify that stack-locking is really disabled. We can't currently verify this uncondiftionally (e.g. in debug builds) because all non-x86_64 platforms would need work. >> >> Testing: >> - [x] tier1 >> - [x] tier2 >> - [x] tier3 >> - [ ] tier4 > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > Use heavy monitors in runtime only on supported architectures Thank you, Martin! > I don't like hacking the regular assembler implementations. Better would be to change C2 such that it doesn't generate FastLockNodes. But that may be a bit cumbersome. That is a good suggestion, and would help ease the work in the backend. I believe you still have to change something in sharedRuntime_ppc.cpp, similar to what I did in, e.g., sharedRuntime_aarch64.cpp. ------------- PR: https://git.openjdk.java.net/jdk/pull/6320 From aoqi at openjdk.java.net Wed Dec 1 17:22:53 2021 From: aoqi at openjdk.java.net (Ao Qi) Date: Wed, 1 Dec 2021 17:22:53 GMT Subject: RFR: 8278037: Clean up PPC32 related code in C1 [v3] In-Reply-To: References: Message-ID: > `_tmp1` and `_tmp2` were [removed](https://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/faaed259df37#l13.173) in [JDK-8160245](https://bugs.openjdk.java.net/browse/JDK-8160245), but they are still used. > > This should cause a build error. I don't have a ppc32 machine for the test. It's also not found at https://builds.shipilev.net. Is ppc32 a supported compiler1 platform? Ao Qi has updated the pull request incrementally with one additional commit since the last revision: missing ")" ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6625/files - new: https://git.openjdk.java.net/jdk/pull/6625/files/27c3dc91..2e815956 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6625&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6625&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/6625.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6625/head:pull/6625 PR: https://git.openjdk.java.net/jdk/pull/6625 From sviswanathan at openjdk.java.net Wed Dec 1 17:23:24 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 1 Dec 2021 17:23:24 GMT Subject: RFR: 8277793: Support vector F2I and D2L cast operations for X86 [v2] In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 11:31:10 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4077: >> >>> 4075: Label done; >>> 4076: evcvttpd2qq(dst, src, vec_enc); >>> 4077: evmovdqul(xtmp1, k0, double_sign_flip, true, vec_enc, scratch); >> >> merge masking should be false here. > > K0 register will enable all the lanes hence true/false value will not change the semantics. In vector_castF2I_evex, we are using false and here we are using true for similar usage, Consistency will be good. >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4088: >> >>> 4086: kxorwl(ktmp1, ktmp1, ktmp2); >>> 4087: evcmppd(ktmp1, ktmp1, src, xtmp2, Assembler::NLT_US, vec_enc); >>> 4088: vpternlogq(xtmp2, 0x11, xtmp1, xtmp1, vec_enc); >> >> Consider moving the vpternlog instruction earlier after line 4082 using xtmp1 as the destination. >> vptenlogq(xtmp1, 0x01, xtmp2, xtmp2, vec_enc); >> Then xtmp1 can be used in the following evmovdquq. >> >> This will help to absorb the latency of vpternlogq. > > evcmppd and vpternlog should be issued in parallel to exaction ports given that there is no dependency between them, given that succeeding instruction has data dependency on both these instructions it can be issued only once both its operands are ready. Since evcmppd has higher latency so it will mask the latency of vpternlog. sounds good. ------------- PR: https://git.openjdk.java.net/jdk/pull/6544 From sviswanathan at openjdk.java.net Wed Dec 1 17:23:25 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 1 Dec 2021 17:23:25 GMT Subject: RFR: 8277793: Support vector F2I and D2L cast operations for X86 [v3] In-Reply-To: <2dggI8qABtygqAVieUgvvwXV9pcTWrt-fXXW3CaGHXQ=.e73cad93-fc58-44af-aa05-1ff0fdf72fb6@github.com> References: <2dggI8qABtygqAVieUgvvwXV9pcTWrt-fXXW3CaGHXQ=.e73cad93-fc58-44af-aa05-1ff0fdf72fb6@github.com> Message-ID: On Wed, 1 Dec 2021 11:35:55 GMT, Jatin Bhateja wrote: >> - JDK-8275317 extended auto-vectorizer to infer Vector Cast operations if source and destination primitive type have same size. >> - This patch adds the backend support for vector CastF2I and CaseD2L on X86 AVX512 and legacy targets. >> >> Following are the performance measurements of an existing JMH benchmark (test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java) >> >> System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> BENCHMARK | SIZE | BASELINE (AVX3) ns/op | WithOpt (AVX3) ns/op | Gain AVX3(baseline/opt) | BASELINE (AVX2) ns/op | WithOpt (AVX2) ns/op | Gain AVX2 (baseline/opt) >> -- | -- | -- | -- | -- | -- | -- | -- >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 512.00 | 256.26 | 77.50 | 3.31 | 275.49 | 275.65 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 1024.00 | 501.87 | 150.35 | 3.34 | 540.47 | 541.22 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 2048.00 | 993.05 | 293.23 | 3.39 | 1070.56 | 1070.14 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 512.00 | 227.83 | 39.36 | 5.79 | 248.25 | 45.01 | 5.52 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 1024.00 | 449.70 | 77.88 | 5.77 | 487.33 | 86.15 | 5.66 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 2048.00 | 884.95 | 149.58 | 5.92 | 956.58 | 152.45 | 6.27 >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8277793: Review comments resolution src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4137: > 4135: > 4136: kxorwl(ktmp1, ktmp1, ktmp2); > 4137: evcmpps(ktmp1, ktmp1, src, xtmp2, Assembler::NLT_US, vec_enc); This should be non signaling comparison NLT_UQ as well. ------------- PR: https://git.openjdk.java.net/jdk/pull/6544 From mdoerr at openjdk.java.net Wed Dec 1 17:29:30 2021 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Wed, 1 Dec 2021 17:29:30 GMT Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v10] In-Reply-To: <5oeP5b_jS2eu5lsDpNkfIBINEPPT23A-kqeHwOSXrYQ=.e32eb661-903d-4778-a56f-2a1d525d1874@github.com> References: <___2nS2ZNEiggHv_C70jADW_cE-T2wUJKqRK_J5gIj0=.23da20f8-ba19-4279-b154-f0a67626ccf4@github.com> <5oeP5b_jS2eu5lsDpNkfIBINEPPT23A-kqeHwOSXrYQ=.e32eb661-903d-4778-a56f-2a1d525d1874@github.com> Message-ID: On Wed, 1 Dec 2021 17:15:33 GMT, Roman Kennke wrote: > I believe you still have to change something in sharedRuntime_ppc.cpp, similar to what I did in, e.g., sharedRuntime_aarch64.cpp. You mean in `generate_native_wrapper`? I already did. It uses the same assembler function as C2 on PPC64. Did I miss anything else? I think hacking `unlock` is optional. The additional checks don't really disturb. ------------- PR: https://git.openjdk.java.net/jdk/pull/6320 From rkennke at openjdk.java.net Wed Dec 1 18:00:28 2021 From: rkennke at openjdk.java.net (Roman Kennke) Date: Wed, 1 Dec 2021 18:00:28 GMT Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v10] In-Reply-To: References: <___2nS2ZNEiggHv_C70jADW_cE-T2wUJKqRK_J5gIj0=.23da20f8-ba19-4279-b154-f0a67626ccf4@github.com> <5oeP5b_jS2eu5lsDpNkfIBINEPPT23A-kqeHwOSXrYQ=.e32eb661-903d-4778-a56f-2a1d525d1874@github.com> Message-ID: On Wed, 1 Dec 2021 17:26:06 GMT, Martin Doerr wrote: > > I believe you still have to change something in sharedRuntime_ppc.cpp, similar to what I did in, e.g., sharedRuntime_aarch64.cpp. > > You mean in `generate_native_wrapper`? I already did. It uses the same assembler function as C2 on PPC64. Did I miss anything else? I think hacking `unlock` is optional. The additional checks don't really disturb. Ah I haven't seen it, sorry. It turns out, I cannot avoid emitting FastLockNode, some backends (x86 and aarch64) also generate fast-path code that deals with ObjectMonitor, and we want this even when running with +UseHeavyMonitors. Can you verify the new testcase, and perhaps some test programs that do some locking with -XX:+UseHeavyMonitors -XX:+VerifyHeavyMonitors ? You also need to include PPC in arguments.cpp and synchronizer.cpp changes to enable that stuff on PPC: ------------- PR: https://git.openjdk.java.net/jdk/pull/6320 From mdoerr at openjdk.java.net Wed Dec 1 18:04:27 2021 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Wed, 1 Dec 2021 18:04:27 GMT Subject: RFR: 8278037: Clean up PPC32 related code in C1 [v3] In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 17:22:53 GMT, Ao Qi wrote: >> `_tmp1` and `_tmp2` were [removed](https://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/faaed259df37#l13.173) in [JDK-8160245](https://bugs.openjdk.java.net/browse/JDK-8160245), but they are still used. >> >> This should cause a build error. I don't have a ppc32 machine for the test. It's also not found at https://builds.shipilev.net. Is ppc32 a supported compiler1 platform? > > Ao Qi has updated the pull request incrementally with one additional commit since the last revision: > > missing ")" I appreciate to see this go away. Thanks! ------------- Marked as reviewed by mdoerr (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6625 From duke at openjdk.java.net Wed Dec 1 18:05:25 2021 From: duke at openjdk.java.net (Danil Bubnov) Date: Wed, 1 Dec 2021 18:05:25 GMT Subject: RFR: 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 17:01:13 GMT, Andrew Haley wrote: > Thanks for this contribution. Which set of tests have you been using? Tier1 and tests from compiler/jvmci/jdk.vm.ci.code.test (maybe they are included to Tier1, but I also run them separately) ------------- PR: https://git.openjdk.java.net/jdk/pull/6641 From jbhateja at openjdk.java.net Wed Dec 1 18:06:58 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 1 Dec 2021 18:06:58 GMT Subject: RFR: 8277793: Support vector F2I and D2L cast operations for X86 [v4] In-Reply-To: References: Message-ID: > - JDK-8275317 extended auto-vectorizer to infer Vector Cast operations if source and destination primitive type have same size. > - This patch adds the backend support for vector CastF2I and CaseD2L on X86 AVX512 and legacy targets. > > Following are the performance measurements of an existing JMH benchmark (test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java) > > System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) > > BENCHMARK | SIZE | BASELINE (AVX3) ns/op | WithOpt (AVX3) ns/op | Gain AVX3(baseline/opt) | BASELINE (AVX2) ns/op | WithOpt (AVX2) ns/op | Gain AVX2 (baseline/opt) > -- | -- | -- | -- | -- | -- | -- | -- > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 512.00 | 256.26 | 77.50 | 3.31 | 275.49 | 275.65 | 1.00 > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 1024.00 | 501.87 | 150.35 | 3.34 | 540.47 | 541.22 | 1.00 > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 2048.00 | 993.05 | 293.23 | 3.39 | 1070.56 | 1070.14 | 1.00 > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 512.00 | 227.83 | 39.36 | 5.79 | 248.25 | 45.01 | 5.52 > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 1024.00 | 449.70 | 77.88 | 5.77 | 487.33 | 86.15 | 5.66 > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 2048.00 | 884.95 | 149.58 | 5.92 | 956.58 | 152.45 | 6.27 > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8277793: Review comments resolution. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6544/files - new: https://git.openjdk.java.net/jdk/pull/6544/files/9f784eb9..95ea1812 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6544&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6544&range=02-03 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/6544.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6544/head:pull/6544 PR: https://git.openjdk.java.net/jdk/pull/6544 From sviswanathan at openjdk.java.net Wed Dec 1 18:10:27 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 1 Dec 2021 18:10:27 GMT Subject: RFR: 8277793: Support vector F2I and D2L cast operations for X86 [v4] In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 18:06:58 GMT, Jatin Bhateja wrote: >> - JDK-8275317 extended auto-vectorizer to infer Vector Cast operations if source and destination primitive type have same size. >> - This patch adds the backend support for vector CastF2I and CaseD2L on X86 AVX512 and legacy targets. >> >> Following are the performance measurements of an existing JMH benchmark (test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java) >> >> System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> BENCHMARK | SIZE | BASELINE (AVX3) ns/op | WithOpt (AVX3) ns/op | Gain AVX3(baseline/opt) | BASELINE (AVX2) ns/op | WithOpt (AVX2) ns/op | Gain AVX2 (baseline/opt) >> -- | -- | -- | -- | -- | -- | -- | -- >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 512.00 | 256.26 | 77.50 | 3.31 | 275.49 | 275.65 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 1024.00 | 501.87 | 150.35 | 3.34 | 540.47 | 541.22 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 2048.00 | 993.05 | 293.23 | 3.39 | 1070.56 | 1070.14 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 512.00 | 227.83 | 39.36 | 5.79 | 248.25 | 45.01 | 5.52 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 1024.00 | 449.70 | 77.88 | 5.77 | 487.33 | 86.15 | 5.66 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 2048.00 | 884.95 | 149.58 | 5.92 | 956.58 | 152.45 | 6.27 >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8277793: Review comments resolution. Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6544 From aph at openjdk.java.net Wed Dec 1 18:19:28 2021 From: aph at openjdk.java.net (Andrew Haley) Date: Wed, 1 Dec 2021 18:19:28 GMT Subject: RFR: 8251216: Implement MD5 intrinsics on AArch64 In-Reply-To: References: Message-ID: <7FVlxJ3bpkRZS43P7VdQJzXh-p7fh7xIOtRo56Mf2qg=.7da106b9-4638-4ff4-a717-5ecf9c4bc2d2@github.com> On Wed, 1 Dec 2021 09:24:45 GMT, Patric Hedlin wrote: > Implementation of MD5 intrinsic support for AArch64. > > Contributed by Ludovic Henry (@luhenry). > > Speedup measured (in Aurora running Ampere Altra) as follows: > > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1048576-provider:...29.39% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:2047-provider:.........28.91% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:2048-provider:.........28.81% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1023-provider:.........28.43% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1024-provider:.........28.32% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:511-provider:...........27.78% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:512-provider:...........27.62% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:255-provider:...........26.52% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:256-provider:...........26.38% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:127-provider:...........25.41% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:128-provider:...........24.66% > > Testing tier1-7. Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6628 From aph at openjdk.java.net Wed Dec 1 18:20:36 2021 From: aph at openjdk.java.net (Andrew Haley) Date: Wed, 1 Dec 2021 18:20:36 GMT Subject: RFR: 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> In-Reply-To: References: Message-ID: <-Wzb3RXPFeEY0E7HDT-7lWw1_2mxReE-J8FPeI3UKl8=.4211598e-f032-46b8-8aa4-6f5254785193@github.com> On Wed, 1 Dec 2021 16:17:21 GMT, Danil Bubnov wrote: > This is the fix of aarch64 jvmci calling convention. > > On MacOS/aarch64 "Function arguments may consume slots on the stack that are not multiples of 8 bytes" [1], but current approach uses only wordsize or bigger slots, which is incorrect (that is why tests were failing [4]). Now arguments consume the right amount of bytes. > > Another problem is that current approach don't make 16-byte alignment of Stack Pointer [1][2][3]. However, tests not fail on Linux/aarch64 and Windows/aarch64. They pass because in this tests all functions have even number of argumets, that is why 16-byte alignment comes automatically. But if you try to add or delete one argumets, tests will fail with SIGBUS. > > I've tested this patch on MacOS/aarch64 and Linux/aarch64, all tests have passed. > > Also I don't understand, why current tests (NativeCallTest) use only int, long, float and double as arguments types. Is it possible to add functions with another types like byte or short? I tried, but it fails on every platform. > > [1] https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms > [2] https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#the-stack > [3] https://docs.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-160#stack > [4] https://bugs.openjdk.java.net/browse/JDK-8262901 Please add some test with an odd number of arguments. ------------- PR: https://git.openjdk.java.net/jdk/pull/6641 From jbhateja at openjdk.java.net Wed Dec 1 18:29:31 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 1 Dec 2021 18:29:31 GMT Subject: RFR: 8277793: Support vector F2I and D2L cast operations for X86 [v3] In-Reply-To: References: <2dggI8qABtygqAVieUgvvwXV9pcTWrt-fXXW3CaGHXQ=.e73cad93-fc58-44af-aa05-1ff0fdf72fb6@github.com> Message-ID: On Wed, 1 Dec 2021 12:20:55 GMT, Nils Eliasson wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8277793: Review comments resolution > > Looks good! Hi @neliasso , can you kindly do a final run of the patch though your regression suite. ------------- PR: https://git.openjdk.java.net/jdk/pull/6544 From jbhateja at openjdk.java.net Wed Dec 1 18:30:48 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 1 Dec 2021 18:30:48 GMT Subject: RFR: 8277997: Intrinsic creation for VectorMask.fromLong API Message-ID: Summary of changes: 1) Inline expansion of VectorMask.fromLong API, this includes Java API implementation and C2 IR changes. 2) X86 backend support for AVX512 and AVX2 targets. 3) New IR transformation to handle following patterns:- a) Mask2Long + Long2Mask -> MaskCast (when source and destination mask lengths are equal) b) Long2Mask + Mask2Long -> Long 4) Following performance data is collected for new JMH micro included with the patch:- System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) Benchmark | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain factor | Baseline AVX3 (ops/ms) | Withopt AVX3(ops/ms) | Gain factor -- | -- | -- | -- | -- | -- | -- MaskFromLongBenchmark.microMaskFromLong_Byte128 | 20050.884 | 36414.349 | 1.816096936 | 19699.631 | 36412.252 | 1.848372287 MaskFromLongBenchmark.microMaskFromLong_Byte256 | 17589.496 | 36418.368 | 2.070461143 | 17211.451 | 36407.44 | 2.115303352 MaskFromLongBenchmark.microMaskFromLong_Byte512 | 2824.411 | 2492.795 | 0.882589326 | 6359.071 | 36405.344 | 5.72494693 MaskFromLongBenchmark.microMaskFromLong_Byte64 | 23507.28 | 36424.668 | 1.549505855 | 22659.666 | 36420.345 | 1.607276338 MaskFromLongBenchmark.microMaskFromLong_Integer128 | 24567.895 | 36411.602 | 1.482080659 | 24620.619 | 36397.005 | 1.478313969 MaskFromLongBenchmark.microMaskFromLong_Integer256 | 23495.078 | 36411.981 | 1.549770595 | 22823.846 | 36395.703 | 1.594634971 MaskFromLongBenchmark.microMaskFromLong_Integer512 | 12377.022 | 11478.101 | 0.927371786 | 19701.118 | 36394.878 | 1.847350897 MaskFromLongBenchmark.microMaskFromLong_Integer64 | 22169.231 | 17791.849 | 0.802546962 | 23603.169 | 18055.166 | 0.76494669 MaskFromLongBenchmark.microMaskFromLong_Long128 | 22312.568 | 17859.474 | 0.800422166 | 22171.303 | 18106.295 | 0.816654529 MaskFromLongBenchmark.microMaskFromLong_Long256 | 24271.19 | 36416.883 | 1.500416049 | 24621.327 | 36390.41 | 1.478003602 MaskFromLongBenchmark.microMaskFromLong_Long512 | 15289.749 | 13860.775 | 0.906540389 | 23003.816 | 36396.033 | 1.582173714 MaskFromLongBenchmark.microMaskFromLong_Long64 | 27086.471 | 20490.828 | 0.756496777 | 27177.133 | 20441.112 | 0.752143797 MaskFromLongBenchmark.microMaskFromLong_Short128 | 23504.216 | 36412.66 | 1.549196961 | 22823.401 | 36417.799 | 1.595634191 MaskFromLongBenchmark.microMaskFromLong_Short256 | 20056.61 | 36403.277 | 1.815026418 | 19699.502 | 36412.605 | 1.84840231 MaskFromLongBenchmark.microMaskFromLong_Short512 | 4775.721 | 6827.594 | 1.429646749 | 17209.782 | 36388.226 | 2.114392036 MaskFromLongBenchmark.microMaskFromLong_Short64 | 24759.049 | 36381.539 | 1.469423927 | 24506.013 | 36413.099 | 1.48588426 Kindly review and share feedback. Best Regards, Jatin ------------- Commit messages: - 8277997: Adding needed JVM arguments in micro. - 8277997: Intrinsic creation for VectorMask.fromLong API Changes: https://git.openjdk.java.net/jdk/pull/6646/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6646&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8277997 Stats: 516 lines in 77 files changed: 371 ins; 27 del; 118 mod Patch: https://git.openjdk.java.net/jdk/pull/6646.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6646/head:pull/6646 PR: https://git.openjdk.java.net/jdk/pull/6646 From sviswanathan at openjdk.java.net Wed Dec 1 18:44:30 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 1 Dec 2021 18:44:30 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs [v6] In-Reply-To: References: Message-ID: On Tue, 30 Nov 2021 00:10:39 GMT, Sandhya Viswanathan wrote: >> Currently 32-byte instructions are used for small array copy and clear. >> This can be optimized by using 64-byte instructions. >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Fix whitespace Yes, the patch doesn't change behavior on AVX2 and older AVX512 systems. The additional performance numbers with the patch requested by Nils are as below: Old AVX512 Before: Benchmark Mode Cnt Score Error Units ArrayCopy.arrayCopyObject avgt 5 18.650 ? 2.773 ns/op ArrayCopy.arrayCopyObjectNonConst avgt 5 20.241 ? 1.398 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 16.252 ? 0.076 ns/op ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 15.965 ? 0.172 ns/op After: Benchmark Mode Cnt Score Error Units ArrayCopy.arrayCopyObject avgt 5 17.701 ? 2.623 ns/op ArrayCopy.arrayCopyObjectNonConst avgt 5 20.588 ? 0.775 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 16.219 ? 0.066 ns/op ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 15.937 ? 0.185 ns/op AVX2 Before: Benchmark Mode Cnt Score Error Units ArrayCopy.arrayCopyObject avgt 5 23.801 ? 0.090 ns/op ArrayCopy.arrayCopyObjectNonConst avgt 5 24.376 ? 0.867 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 14.015 ? 0.016 ns/op ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 15.355 ? 0.024 ns/op After: Benchmark Mode Cnt Score Error Units ArrayCopy.arrayCopyObject avgt 5 23.373 ? 0.629 ns/op ArrayCopy.arrayCopyObjectNonConst avgt 5 24.390 ? 0.875 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 13.995 ? 0.056 ns/op ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 15.383 ? 0.051 ns/op ------------- PR: https://git.openjdk.java.net/jdk/pull/6512 From dlong at openjdk.java.net Wed Dec 1 20:12:26 2021 From: dlong at openjdk.java.net (Dean Long) Date: Wed, 1 Dec 2021 20:12:26 GMT Subject: RFR: 8277882: New subnode ideal optimization: converting "c0 - (x + c1)" into "(c0 - c1) - x" [v4] In-Reply-To: References: Message-ID: On Tue, 30 Nov 2021 22:44:59 GMT, Zang, Zhiqiang wrote: >> Suggest two new optimizations that can be done in SubINode::Ideal. > > Zang, Zhiqiang has updated the pull request incrementally with one additional commit since the last revision: > > add ok_to_convert to the condition. Testing results for 5965a8cf70fec00355bf07315d9a450a24771ed2 look good. ------------- PR: https://git.openjdk.java.net/jdk/pull/6441 From kvn at openjdk.java.net Wed Dec 1 20:30:30 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 1 Dec 2021 20:30:30 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v3] In-Reply-To: <_uLnNO847-foX9jDiBc2nSNfiwhgl5pV-0WV5NXViZg=.4c73d6cb-966a-42f2-aade-87edfc69a316@github.com> References: <_uLnNO847-foX9jDiBc2nSNfiwhgl5pV-0WV5NXViZg=.4c73d6cb-966a-42f2-aade-87edfc69a316@github.com> Message-ID: On Wed, 1 Dec 2021 02:56:06 GMT, Scott Gibbons wrote: >> Accelerates CRC32-C by utilizing vpclmulqdq similarly to CRC32. This change achieves ~4x throughput improvement. >> >> 5986.947899319073 MB/s => 24041.05203089616 MB/s >> 5840.02689336947 MB/s => 24898.781468710356 MB/s >> >> ********** Original *********** >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 1.710387358 seconds >> CRC32C.update(byte[]) throughput = 5986.947899319073 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 1.753416583 seconds >> CRC32C.update(ByteBuffer) throughput = 5840.02689336947 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> >> >> >> >> *********** With my changes: ************* >> >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 0.425938099 seconds >> CRC32C.update(byte[]) throughput = 24041.05203089616 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 0.411265106 seconds >> CRC32C.update(ByteBuffer) throughput = 24898.781468710356 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- > > Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: > > Fixing review comments Nice work. Let me test it before approval. ------------- PR: https://git.openjdk.java.net/jdk/pull/6595 From kvn at openjdk.java.net Wed Dec 1 20:59:35 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 1 Dec 2021 20:59:35 GMT Subject: RFR: 8278016: Add compiler tests to tier{2,3} [v2] In-Reply-To: References: <72-LBIFmV82Z0GRieH-cxjyrub8Z-VVC4izGsW_bc4A=.1e021a98-3a4e-4ccc-87a7-c3e01c26d5e5@github.com> Message-ID: On Wed, 1 Dec 2021 08:20:22 GMT, Aleksey Shipilev wrote: > > @shipilev again I think we need to examine this in terms of impact to our CI. We run different platforms and configurations in different tiers so the costs are not as simple as looking at one run. > > Again, I can wait for those who have more insight in Oracle testing pipelines check their workflows with this change. I have no insight in Oracle infra, so somebody else have to do it. Now that Igor left, who should we be talking to? @dholmes-ora I checked and this change does not interfere with our CI. `tier2` and `tier3` introduced by #5241 are not used by our CI. New `tier2_compiler` and `tier3_compiler` groups are also not used. We use different sets in CI. I am not sure how else it can affect our testing. I also submitted our testing. I will let you know results. ------------- PR: https://git.openjdk.java.net/jdk/pull/6622 From kvn at openjdk.java.net Wed Dec 1 21:03:30 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 1 Dec 2021 21:03:30 GMT Subject: RFR: 8277893: Arraycopy stress tests [v2] In-Reply-To: <3UVCSS5k6cAvWIzxF_1egpBrr69f5Bu8AlhWCMmmINw=.9dd5b6d5-375b-481d-a4e9-cb2ef7a1629f@github.com> References: <3UVCSS5k6cAvWIzxF_1egpBrr69f5Bu8AlhWCMmmINw=.9dd5b6d5-375b-481d-a4e9-cb2ef7a1629f@github.com> Message-ID: On Wed, 1 Dec 2021 09:13:36 GMT, Aleksey Shipilev wrote: >> I would like to fork the new tests off the JDK-8150730. These tests were instrumental in capturing many bugs in my arraycopy work, and I think they are good on their own merit, because they provide a test for the current baseline and on-going minor improvements in arraycopy on all platforms, not only x86_64, and they might be cleanly backportable. >> >> A brief tour of these tests: >> >> - Tests all data types; >> - Tests small arrays exhaustively, which captures conjoint/disjoint cases, errors near the edges, etc; >> - Tests large arrays with fuzzing around powers of two and powers of ten, both conjoint and disjoint cases; >> - Tests all available compilation modes for arraycopy stubs; for example, running on AVX-512 enabled machine runs all versions down to `-XX:UseAVX=0 -XX:UseSSE=0` cases; >> - Tests with/without compressed oops mode -- theoretically only needed for `Object` copies, but Hotspot cobbles together int+coops and long+no-coops loops, so I decided to alternate coops mode for all data types; >> >> My previous version used individual `@run` clauses for all configurations, but I think the Java driver is cleaner and easier to maintain. >> >> Test times: >> >> >> # x86_64 (TR 3970X) >> real 6m37.855s >> user 56m23.004s >> sys 0m20.148s >> >> # x86_32 (TR 3970X) >> real 11m22.877s >> user 168m8.137s >> sys 5m7.037s >> >> # x86_64 (i5-11500) >> real 15m55.424s >> user 118m0.969s >> sys 0m12.039s >> >> # AArch64 (ThunderX2) >> real 4m5.177s >> user 32m7.295s >> sys 0m19.689s >> >> >> Since these tests are quite long, especially on small machines, I hooked them up to `hotspot:tier3`. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `compiler/stress/arraycopy` >> - [x] Linux x86_32 fastdebug `compiler/stress/arraycopy` >> - [x] Linux AArch64 fastdebug `compiler/stress/arraycopy` > > Aleksey Shipilev has updated the pull request incrementally with two additional commits since the last revision: > > - Separate test group and hooks into hotspot_slow_compiler > - Trim down MAX_SIZE and explain the choice Good. Let me test it before approval. ------------- PR: https://git.openjdk.java.net/jdk/pull/6594 From sviswanathan at openjdk.java.net Wed Dec 1 22:10:29 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 1 Dec 2021 22:10:29 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v3] In-Reply-To: <_uLnNO847-foX9jDiBc2nSNfiwhgl5pV-0WV5NXViZg=.4c73d6cb-966a-42f2-aade-87edfc69a316@github.com> References: <_uLnNO847-foX9jDiBc2nSNfiwhgl5pV-0WV5NXViZg=.4c73d6cb-966a-42f2-aade-87edfc69a316@github.com> Message-ID: On Wed, 1 Dec 2021 02:56:06 GMT, Scott Gibbons wrote: >> Accelerates CRC32-C by utilizing vpclmulqdq similarly to CRC32. This change achieves ~4x throughput improvement. >> >> 5986.947899319073 MB/s => 24041.05203089616 MB/s >> 5840.02689336947 MB/s => 24898.781468710356 MB/s >> >> ********** Original *********** >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 1.710387358 seconds >> CRC32C.update(byte[]) throughput = 5986.947899319073 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 1.753416583 seconds >> CRC32C.update(ByteBuffer) throughput = 5840.02689336947 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> >> >> >> >> *********** With my changes: ************* >> >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 0.425938099 seconds >> CRC32C.update(byte[]) throughput = 24041.05203089616 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 0.411265106 seconds >> CRC32C.update(ByteBuffer) throughput = 24898.781468710356 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- > > Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: > > Fixing review comments The patch looks good to me. Please wait for Vladimir Kozlov's testing and approval. ------------- Marked as reviewed by sviswanathan (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6595 From eric.caspole at oracle.com Wed Dec 1 22:16:42 2021 From: eric.caspole at oracle.com (eric.caspole at oracle.com) Date: Wed, 1 Dec 2021 17:16:42 -0500 Subject: RFR: 8277358: Accelerate CRC32-C [v2] In-Reply-To: References: Message-ID: <5857282e-4067-d8f2-a6d3-d341ecd8940d@oracle.com> Hi Scott, Thanks for the JMH. I would like to use Mode.Throughput (i.e. 9368.786 ?? 96.956? ops/ms) so the scores are not very tiny numbers, and just use the default iterations so the runs are about 35 minutes instead of 1h30, what do you think? The iterations are very stable so the defaults are fine in my testing. Regards, Eric diff --git a/test/micro/org/openjdk/bench/java/util/TestCRC32C.java b/test/micro/org/openjdk/bench/java/util/TestCRC32C.java index 10681e19bbf..0c3b39fc59a 100644 --- a/test/micro/org/openjdk/bench/java/util/TestCRC32C.java +++ b/test/micro/org/openjdk/bench/java/util/TestCRC32C.java @@ -27,12 +27,10 @@ import java.util.concurrent.TimeUnit; ?import java.util.zip.CRC32C; ?import org.openjdk.jmh.annotations.*; - at BenchmarkMode(Mode.AverageTime) - at OutputTimeUnit(TimeUnit.MICROSECONDS) + at BenchmarkMode(Mode.Throughput) + at OutputTimeUnit(TimeUnit.MILLISECONDS) ?@State(Scope.Benchmark) ?@Fork(value = 2) - at Warmup(iterations = 2, time = 30, timeUnit = TimeUnit.SECONDS) - at Measurement(iterations = 3, time = 60, timeUnit = TimeUnit.SECONDS) ?public class TestCRC32C { On 11/30/21 7:13 PM, Scott Gibbons wrote: > On Wed, 1 Dec 2021 00:02:14 GMT, Scott Gibbons wrote: > >>> Accelerates CRC32-C by utilizing vpclmulqdq similarly to CRC32. This change achieves ~4x throughput improvement. >>> >>> 5986.947899319073 MB/s => 24041.05203089616 MB/s >>> 5840.02689336947 MB/s => 24898.781468710356 MB/s >>> >>> ********** Original *********** >>> >>> >>> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >>> offset = 0 >>> msgSize = 512 bytes >>> iters = 20000000 >>> ------------------------------------------------------- >>> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >>> CRC32C.update(byte[]) runtime = 1.710387358 seconds >>> CRC32C.update(byte[]) throughput = 5986.947899319073 MB/s >>> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >>> ------------------------------------------------------- >>> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >>> CRC32C.update(ByteBuffer) runtime = 1.753416583 seconds >>> CRC32C.update(ByteBuffer) throughput = 5840.02689336947 MB/s >>> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >>> ------------------------------------------------------- >>> >>> >>> >>> >>> *********** With my changes: ************* >>> >>> >>> >>> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >>> offset = 0 >>> msgSize = 512 bytes >>> iters = 20000000 >>> ------------------------------------------------------- >>> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >>> CRC32C.update(byte[]) runtime = 0.425938099 seconds >>> CRC32C.update(byte[]) throughput = 24041.05203089616 MB/s >>> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >>> ------------------------------------------------------- >>> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >>> CRC32C.update(ByteBuffer) runtime = 0.411265106 seconds >>> CRC32C.update(ByteBuffer) throughput = 24898.781468710356 MB/s >>> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >>> ------------------------------------------------------- >> Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: >> >> Adding CRC32-C microbenchmark. > Hi, Eric. I added a microbenchmark for CRC32-C. I'm waiting for full completion, but it looks like somewhere around 40GB/s throughput on average. I'll post the results once completed. > > ------------- > > PR: https://git.openjdk.java.net/jdk/pull/6595 From psandoz at openjdk.java.net Wed Dec 1 23:09:26 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Wed, 1 Dec 2021 23:09:26 GMT Subject: RFR: 8277997: Intrinsic creation for VectorMask.fromLong API In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 18:23:27 GMT, Jatin Bhateja wrote: > Summary of changes: > > 1) Inline expansion of VectorMask.fromLong API, this includes Java API implementation and C2 IR changes. > 2) X86 backend support for AVX512 and AVX2 targets. > 3) New IR transformation to handle following patterns:- > a) Mask2Long + Long2Mask -> MaskCast (when source and destination mask lengths are equal) > b) Long2Mask + Mask2Long -> Long > 4) Following performance data is collected for new JMH micro included with the patch:- > > System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) > > Benchmark | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain factor | Baseline AVX3 (ops/ms) | Withopt AVX3(ops/ms) | Gain factor > -- | -- | -- | -- | -- | -- | -- > MaskFromLongBenchmark.microMaskFromLong_Byte128 | 20050.884 | 36414.349 | 1.816096936 | 19699.631 | 36412.252 | 1.848372287 > MaskFromLongBenchmark.microMaskFromLong_Byte256 | 17589.496 | 36418.368 | 2.070461143 | 17211.451 | 36407.44 | 2.115303352 > MaskFromLongBenchmark.microMaskFromLong_Byte512 | 2824.411 | 2492.795 | 0.882589326 | 6359.071 | 36405.344 | 5.72494693 > MaskFromLongBenchmark.microMaskFromLong_Byte64 | 23507.28 | 36424.668 | 1.549505855 | 22659.666 | 36420.345 | 1.607276338 > MaskFromLongBenchmark.microMaskFromLong_Integer128 | 24567.895 | 36411.602 | 1.482080659 | 24620.619 | 36397.005 | 1.478313969 > MaskFromLongBenchmark.microMaskFromLong_Integer256 | 23495.078 | 36411.981 | 1.549770595 | 22823.846 | 36395.703 | 1.594634971 > MaskFromLongBenchmark.microMaskFromLong_Integer512 | 12377.022 | 11478.101 | 0.927371786 | 19701.118 | 36394.878 | 1.847350897 > MaskFromLongBenchmark.microMaskFromLong_Integer64 | 22169.231 | 17791.849 | 0.802546962 | 23603.169 | 18055.166 | 0.76494669 > MaskFromLongBenchmark.microMaskFromLong_Long128 | 22312.568 | 17859.474 | 0.800422166 | 22171.303 | 18106.295 | 0.816654529 > MaskFromLongBenchmark.microMaskFromLong_Long256 | 24271.19 | 36416.883 | 1.500416049 | 24621.327 | 36390.41 | 1.478003602 > MaskFromLongBenchmark.microMaskFromLong_Long512 | 15289.749 | 13860.775 | 0.906540389 | 23003.816 | 36396.033 | 1.582173714 > MaskFromLongBenchmark.microMaskFromLong_Long64 | 27086.471 | 20490.828 | 0.756496777 | 27177.133 | 20441.112 | 0.752143797 > MaskFromLongBenchmark.microMaskFromLong_Short128 | 23504.216 | 36412.66 | 1.549196961 | 22823.401 | 36417.799 | 1.595634191 > MaskFromLongBenchmark.microMaskFromLong_Short256 | 20056.61 | 36403.277 | 1.815026418 | 19699.502 | 36412.605 | 1.84840231 > MaskFromLongBenchmark.microMaskFromLong_Short512 | 4775.721 | 6827.594 | 1.429646749 | 17209.782 | 36388.226 | 2.114392036 > MaskFromLongBenchmark.microMaskFromLong_Short64 | 24759.049 | 36381.539 | 1.469423927 | 24506.013 | 36413.099 | 1.48588426 > > > > Kindly review and share feedback. > > Best Regards, > Jatin Arguably broadcasting is not the correct term to associate with conversion of a long value to a mask, but it is very convenient to reuse `VectorSupport.broadcastCoerced` and i don't have a better solution in that regard. The addition of a new intrinsic seems overly heavy. We could rename to `fromBitsCoerced` then the `bitwise` parameter can be renamed `mode`. Can we define named constants on the Java and HotSpot side: `0`, for broadcasting; and `1` for mask conversion e.g. `BITS_COERCED_BROADCAST = 0`, `BITS_COERCED_MASK_TO_LONG=1`. This potentially allows for future modes such as broadcast only to the first lane. ------------- PR: https://git.openjdk.java.net/jdk/pull/6646 From jiefu at openjdk.java.net Wed Dec 1 23:22:28 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 1 Dec 2021 23:22:28 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs [v6] In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 18:41:33 GMT, Sandhya Viswanathan wrote: > Yes, the patch doesn't change behavior on AVX2 and older AVX512 systems. Thanks for your clarification. But it still remains unknown why the 64-byte instructions shouldn't be used on CPUs which don't support `serialize`. I will test the 64-byte instructions on older AVX512 systems today and feedback here. ------------- PR: https://git.openjdk.java.net/jdk/pull/6512 From duke at openjdk.java.net Thu Dec 2 00:20:52 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Thu, 2 Dec 2021 00:20:52 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v4] In-Reply-To: References: Message-ID: <41Q1gB0N_YPjDB9nr4l74d2o0QKaYKsjC60s9xg4nk8=.45d4275e-fd64-44aa-8adf-13e480612ea1@github.com> > Accelerates CRC32-C by utilizing vpclmulqdq similarly to CRC32. This change achieves ~4x throughput improvement. > > 5986.947899319073 MB/s => 24041.05203089616 MB/s > 5840.02689336947 MB/s => 24898.781468710356 MB/s > > ********** Original *********** > > > scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 > offset = 0 > msgSize = 512 bytes > iters = 20000000 > ------------------------------------------------------- > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > CRC32C.update(byte[]) runtime = 1.710387358 seconds > CRC32C.update(byte[]) throughput = 5986.947899319073 MB/s > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > ------------------------------------------------------- > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > CRC32C.update(ByteBuffer) runtime = 1.753416583 seconds > CRC32C.update(ByteBuffer) throughput = 5840.02689336947 MB/s > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > ------------------------------------------------------- > > > > > *********** With my changes: ************* > > > > scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 > offset = 0 > msgSize = 512 bytes > iters = 20000000 > ------------------------------------------------------- > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > CRC32C.update(byte[]) runtime = 0.425938099 seconds > CRC32C.update(byte[]) throughput = 24041.05203089616 MB/s > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > ------------------------------------------------------- > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > CRC32C.update(ByteBuffer) runtime = 0.411265106 seconds > CRC32C.update(ByteBuffer) throughput = 24898.781468710356 MB/s > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > ------------------------------------------------------- Scott Gibbons has updated the pull request incrementally with two additional commits since the last revision: - MICRO to MILLI as requested. - Fixing benchmark to throughput with default iterations. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6595/files - new: https://git.openjdk.java.net/jdk/pull/6595/files/92b4b9fc..906a57d6 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6595&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6595&range=02-03 Stats: 4 lines in 1 file changed: 0 ins; 2 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/6595.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6595/head:pull/6595 PR: https://git.openjdk.java.net/jdk/pull/6595 From duke at openjdk.java.net Thu Dec 2 00:34:27 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Thu, 2 Dec 2021 00:34:27 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v4] In-Reply-To: <41Q1gB0N_YPjDB9nr4l74d2o0QKaYKsjC60s9xg4nk8=.45d4275e-fd64-44aa-8adf-13e480612ea1@github.com> References: <41Q1gB0N_YPjDB9nr4l74d2o0QKaYKsjC60s9xg4nk8=.45d4275e-fd64-44aa-8adf-13e480612ea1@github.com> Message-ID: <8Bf2Ltih4gJC_tPCSzDfm-ANaovXOJhMJWP204N76D8=.1dab2f77-f62e-4041-8c60-d68532347bfa@github.com> On Thu, 2 Dec 2021 00:20:52 GMT, Scott Gibbons wrote: >> Accelerates CRC32-C by utilizing vpclmulqdq similarly to CRC32. This change achieves ~4x throughput improvement. >> >> 5986.947899319073 MB/s => 24041.05203089616 MB/s >> 5840.02689336947 MB/s => 24898.781468710356 MB/s >> >> ********** Original *********** >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 1.710387358 seconds >> CRC32C.update(byte[]) throughput = 5986.947899319073 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 1.753416583 seconds >> CRC32C.update(ByteBuffer) throughput = 5840.02689336947 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> >> >> >> >> *********** With my changes: ************* >> >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 0.425938099 seconds >> CRC32C.update(byte[]) throughput = 24041.05203089616 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 0.411265106 seconds >> CRC32C.update(ByteBuffer) throughput = 24898.781468710356 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- > > Scott Gibbons has updated the pull request incrementally with two additional commits since the last revision: > > - MICRO to MILLI as requested. > - Fixing benchmark to throughput with default iterations. Hi, Eric. Thanks for the suggestions. I?ve made the changes. Thanks, --Scott Gibbons Software Development Engineer, Runtime Engineering ***@***.*** DEVELOPER SOFTWARE ENGINEERING Ph: 1-503-456-7756 Cell: 1-469-450-8390 2501 NE Century Blvd Hillsboro, OR 97124 Intel Corporation | www.intel.com From: mlbridge[bot] ***@***.***> Sent: Wednesday, December 1, 2021 4:02 PM To: openjdk/jdk ***@***.***> Cc: Gibbons, Scott ***@***.***>; Mention ***@***.***> Subject: Re: [openjdk/jdk] 8277358: Accelerate CRC32-C (PR #6595) Mailing list message from eric.caspole at ***@***.***> on ***@***.***>: Hi Scott, Thanks for the JMH. I would like to use Mode.Throughput (i.e. 9368.786 ?? 96.956? ops/ms) so the scores are not very tiny numbers, and just use the default iterations so the runs are about 35 minutes instead of 1h30, what do you think? The iterations are very stable so the defaults are fine in my testing. Regards, Eric diff --git a/test/micro/org/openjdk/bench/java/util/TestCRC32C.java b/test/micro/org/openjdk/bench/java/util/TestCRC32C.java index 10681e19bbf..0c3b39fc59a 100644 --- a/test/micro/org/openjdk/bench/java/util/TestCRC32C.java +++ b/test/micro/org/openjdk/bench/java/util/TestCRC32C.java @@ -27,12 +27,10 @@ import java.util.concurrent.TimeUnit; ?import java.util.zip.CRC32C; ?import org.openjdk.jmh.annotations.*; - at BenchmarkMode(Mode.AverageTime) - at OutputTimeUnit(TimeUnit.MICROSECONDS) + at BenchmarkMode(Mode.Throughput) + at OutputTimeUnit(TimeUnit.MILLISECONDS) ***@***.******@***.***(Scope.Benchmark)> ***@***.***(value = 2) - at Warmup(iterations = 2, time = 30, timeUnit = TimeUnit.SECONDS) - at Measurement(iterations = 3, time = 60, timeUnit = TimeUnit.SECONDS) ?public class TestCRC32C { On 11/30/21 7:13 PM, Scott Gibbons wrote: ? You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. ------------- PR: https://git.openjdk.java.net/jdk/pull/6595 From mdoerr at openjdk.java.net Thu Dec 2 00:58:28 2021 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Thu, 2 Dec 2021 00:58:28 GMT Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v10] In-Reply-To: References: <___2nS2ZNEiggHv_C70jADW_cE-T2wUJKqRK_J5gIj0=.23da20f8-ba19-4279-b154-f0a67626ccf4@github.com> <5oeP5b_jS2eu5lsDpNkfIBINEPPT23A-kqeHwOSXrYQ=.e32eb661-903d-4778-a56f-2a1d525d1874@github.com> Message-ID: On Wed, 1 Dec 2021 17:57:11 GMT, Roman Kennke wrote: > It turns out, I cannot avoid emitting FastLockNode, some backends (x86 and aarch64) also generate fast-path code that deals with ObjectMonitor, and we want this even when running with +UseHeavyMonitors. Ok, thanks for checking. You have convinced me that your version is fine. We should do it the same way on PPC64: diff --git a/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp b/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp index 98565003691..cb58e775422 100644 --- a/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp +++ b/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp @@ -2660,27 +2660,32 @@ void MacroAssembler::compiler_fast_lock_object(ConditionRegister flag, Register andi_(temp, displaced_header, markWord::monitor_value); bne(CCR0, object_has_monitor); - // Set displaced_header to be (markWord of object | UNLOCK_VALUE). - ori(displaced_header, displaced_header, markWord::unlocked_value); - - // Load Compare Value application register. - - // Initialize the box. (Must happen before we update the object mark!) - std(displaced_header, BasicLock::displaced_header_offset_in_bytes(), box); - - // Must fence, otherwise, preceding store(s) may float below cmpxchg. - // Compare object markWord with mark and if equal exchange scratch1 with object markWord. - cmpxchgd(/*flag=*/flag, - /*current_value=*/current_header, - /*compare_value=*/displaced_header, - /*exchange_value=*/box, - /*where=*/oop, - MacroAssembler::MemBarRel | MacroAssembler::MemBarAcq, - MacroAssembler::cmpxchgx_hint_acquire_lock(), - noreg, - &cas_failed, - /*check without membar and ldarx first*/true); - assert(oopDesc::mark_offset_in_bytes() == 0, "offset of _mark is not 0"); + if (!UseHeavyMonitors) { + // Set displaced_header to be (markWord of object | UNLOCK_VALUE). + ori(displaced_header, displaced_header, markWord::unlocked_value); + + // Load Compare Value application register. + + // Initialize the box. (Must happen before we update the object mark!) + std(displaced_header, BasicLock::displaced_header_offset_in_bytes(), box); + + // Must fence, otherwise, preceding store(s) may float below cmpxchg. + // Compare object markWord with mark and if equal exchange scratch1 with object markWord. + cmpxchgd(/*flag=*/flag, + /*current_value=*/current_header, + /*compare_value=*/displaced_header, + /*exchange_value=*/box, + /*where=*/oop, + MacroAssembler::MemBarRel | MacroAssembler::MemBarAcq, + MacroAssembler::cmpxchgx_hint_acquire_lock(), + noreg, + &cas_failed, + /*check without membar and ldarx first*/true); + assert(oopDesc::mark_offset_in_bytes() == 0, "offset of _mark is not 0"); + } else { + // Set NE to indicate 'failure' -> take slow-path. + crandc(flag, Assembler::equal, flag, Assembler::equal); + } // If the compare-and-exchange succeeded, then we found an unlocked // object and we have now locked it. @@ -2768,12 +2773,14 @@ void MacroAssembler::compiler_fast_unlock_object(ConditionRegister flag, Registe } #endif - // Find the lock address and load the displaced header from the stack. - ld(displaced_header, BasicLock::displaced_header_offset_in_bytes(), box); + if (!UseHeavyMonitors) { + // Find the lock address and load the displaced header from the stack. + ld(displaced_header, BasicLock::displaced_header_offset_in_bytes(), box); - // If the displaced header is 0, we have a recursive unlock. - cmpdi(flag, displaced_header, 0); - beq(flag, cont); + // If the displaced header is 0, we have a recursive unlock. + cmpdi(flag, displaced_header, 0); + beq(flag, cont); + } // Handle existing monitor. // The object has an existing monitor iff (mark & monitor_value) != 0. @@ -2782,20 +2789,24 @@ void MacroAssembler::compiler_fast_unlock_object(ConditionRegister flag, Registe andi_(R0, current_header, markWord::monitor_value); bne(CCR0, object_has_monitor); - // Check if it is still a light weight lock, this is is true if we see - // the stack address of the basicLock in the markWord of the object. - // Cmpxchg sets flag to cmpd(current_header, box). - cmpxchgd(/*flag=*/flag, - /*current_value=*/current_header, - /*compare_value=*/box, - /*exchange_value=*/displaced_header, - /*where=*/oop, - MacroAssembler::MemBarRel, - MacroAssembler::cmpxchgx_hint_release_lock(), - noreg, - &cont); - - assert(oopDesc::mark_offset_in_bytes() == 0, "offset of _mark is not 0"); + if (!UseHeavyMonitors) { + // Check if it is still a light weight lock, this is is true if we see + // the stack address of the basicLock in the markWord of the object. + // Cmpxchg sets flag to cmpd(current_header, box). + cmpxchgd(/*flag=*/flag, + /*current_value=*/current_header, + /*compare_value=*/box, + /*exchange_value=*/displaced_header, + /*where=*/oop, + MacroAssembler::MemBarRel, + MacroAssembler::cmpxchgx_hint_release_lock(), + noreg, + &cont); + assert(oopDesc::mark_offset_in_bytes() == 0, "offset of _mark is not 0"); + } else { + // Set NE to indicate 'failure' -> take slow-path. + crandc(flag, Assembler::equal, flag, Assembler::equal); + } // Handle existing monitor. b(cont); diff --git a/src/hotspot/share/runtime/arguments.cpp b/src/hotspot/share/runtime/arguments.cpp index 3396adc0799..969c8e82b91 100644 --- a/src/hotspot/share/runtime/arguments.cpp +++ b/src/hotspot/share/runtime/arguments.cpp @@ -2021,12 +2021,12 @@ bool Arguments::check_vm_args_consistency() { } #endif -#if !defined(X86) && !defined(AARCH64) +#if !defined(X86) && !defined(AARCH64) && !defined(PPC64) if (UseHeavyMonitors) { warning("UseHeavyMonitors is not fully implemented on this architecture"); } #endif -#ifdef X86 +#if defined(X86) || defined(PPC64) if (UseHeavyMonitors && UseRTMForStackLocks) { fatal("-XX:+UseHeavyMonitors and -XX:+UseRTMForStackLocks are mutually exclusive"); } diff --git a/src/hotspot/share/runtime/synchronizer.cpp b/src/hotspot/share/runtime/synchronizer.cpp index 4c5ea4a6e40..4f9c7c21a9b 100644 --- a/src/hotspot/share/runtime/synchronizer.cpp +++ b/src/hotspot/share/runtime/synchronizer.cpp @@ -418,7 +418,7 @@ void ObjectSynchronizer::handle_sync_on_value_based_class(Handle obj, JavaThread } static bool useHeavyMonitors() { -#if defined(X86) || defined(AARCH64) +#if defined(X86) || defined(AARCH64) || defined(PPC64) return UseHeavyMonitors; #else return false; diff --git a/test/jdk/java/util/concurrent/ConcurrentHashMap/MapLoops.java b/test/jdk/java/util/concurrent/ConcurrentHashMap/MapLoops.java index cd32e222f68..922b18836dd 100644 --- a/test/jdk/java/util/concurrent/ConcurrentHashMap/MapLoops.java +++ b/test/jdk/java/util/concurrent/ConcurrentHashMap/MapLoops.java @@ -48,7 +48,7 @@ /* * @test * @summary Exercise multithreaded maps, using only heavy monitors. - * @requires os.arch=="x86" | os.arch=="i386" | os.arch=="amd64" | os.arch=="x86_64" | os.arch=="aarch64" + * @requires os.arch=="x86" | os.arch=="i386" | os.arch=="amd64" | os.arch=="x86_64" | os.arch=="aarch64" | os.arch == "ppc64" | os.arch == "ppc64le" * @library /test/lib * @run main/othervm/timeout=1600 -XX:+IgnoreUnrecognizedVMOptions -XX:+UseHeavyMonitors -XX:+VerifyHeavyMonitors MapLoops */ Note that this version does no longer require changes in sharedRuntime_ppc because the native wrapper generator uses the same code as C2. The test case has passed. ------------- PR: https://git.openjdk.java.net/jdk/pull/6320 From kvn at openjdk.java.net Thu Dec 2 00:58:32 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Dec 2021 00:58:32 GMT Subject: RFR: 8273563: Improve performance of implicit exceptions with -XX:-OmitStackTraceInFastThrow [v11] In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 11:09:55 GMT, Volker Simonis wrote: >> Currently, if running with `-XX:-OmitStackTraceInFastThrow`, C2 has no possibility to create implicit exceptions like AIOOBE, NullPointerExceptions, etc. in compiled code. This means that such methods will always be deoptimized and re-executed in the interpreter if such exceptions are happening. >> >> If implicit exceptions are used for normal control flow, that can have a dramatic impact on performance. A prominent example for such code is [Tomcat's `HttpParser::isAlpha()` method](https://github.com/apache/tomcat/blob/26ba86cdbd40ca718e43b82e62b3eb49d004c3d6/java/org/apache/tomcat/util/http/parser/HttpParser.java#L266-L274): >> >> public static boolean isAlpha(int c) { >> try { >> return IS_ALPHA[c]; >> } catch (ArrayIndexOutOfBoundsException ex) { >> return false; >> } >> } >> >> >> ### Solution >> >> Instead of deoptimizing and resorting to the interpreter, we can generate code which allocates and initializes the corresponding exceptions right in compiled code. This results in a ten-times performance improvement for the above code: >> >> -XX:-OmitStackTraceInFastThrow -XX:-OptimizeImplicitExceptions >> Benchmark (exceptionProbability) Mode Cnt Score Error Units >> ImplicitExceptions.bench 0.0 avgt 5 1.430 ? 0.353 ns/op >> ImplicitExceptions.bench 0.33 avgt 5 3563.038 ? 77.358 ns/op >> ImplicitExceptions.bench 0.66 avgt 5 8609.693 ? 1205.104 ns/op >> ImplicitExceptions.bench 1.00 avgt 5 12842.401 ? 1022.728 ns/op >> >> -XX:-OmitStackTraceInFastThrow -XX:+OptimizeImplicitExceptions >> Benchmark (exceptionProbability) Mode Cnt Score Error Units >> ImplicitExceptions.bench 0.0 avgt 5 1.432 ? 0.352 ns/op >> ImplicitExceptions.bench 0.33 avgt 5 355.723 ? 16.641 ns/op >> ImplicitExceptions.bench 0.66 avgt 5 887.068 ? 166.728 ns/op >> ImplicitExceptions.bench 1.00 avgt 5 1274.418 ? 88.235 ns/op >> >> >> ### Implementation details >> >> - The new optimization is guarded by the option `OptimizeImplicitExceptions` which is on by default. >> - In `GraphKit::builtin_throw()` we can't simply use `CallGenerator::for_direct_call()` to create a `DirectCallGenerator` for the call to the exception's `` function because `DirectCallGenerator` assumes in various places that calls are only issued at `invoke*` bytecodes. This is is not true in genral for bytecode which can cause an implicit exception. >> - Instead, we manually wire up the call based on the code in `DirectCallGenerator::generate()`. >> - We use a similar trick like for method handle intrinsics where the callee from the bytecode is replaced by a direct call and this fact is recorded in the call's `_override_symbolic_info` field. For calling constructors of implicit exceptions I've introduced the new field `_implicit_exception_init`. This field is also used in various assertions to prevent queries for the bytecode's symbolic method information which doesn't exist because we're not at an `invoke*` bytecode at the place where we generate the call. >> - The PR contains a micro-benchmark which compares the old and the new implementation for [Tomcat's `HttpParser::isAlpha()` method](https://github.com/apache/tomcat/blob/26ba86cdbd40ca718e43b82e62b3eb49d004c3d6/java/org/apache/tomcat/util/http/parser/HttpParser.java#L266-L274). Except for the trivial case where the exception probability is 0 (i.e. no exceptions are happening at all) the new implementation is about 10 times faster. > > Volker Simonis has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: > > - Fix jit/t/t105/t105.java to also use -XX:-OptimizeImplicitExceptions in addition to -XX:-OmitStacktracesInFastThrow > - Fix IR Framework test Traps::classCheck() which now behaves differently with -XX:+OptimizeImplicitExceptions > - Added jtreg test and extended the Whitebox API to export decompile, deopt and trap counters > Rebased on top of '8275908: Record null_check traps for calls and array_check traps in the interpreter' > - Fix special case where we're creating an implicit exception for a regular invoke* bytecode > - Minor updates as requested by @TheRealMDoerr > - 8273563: Improve performance of implicit exceptions with -XX:-OmitStackTraceInFastThrow Volker. What MDO (bytecodes and counters) looks like for your test case method (-XX:CompileCommand=print,ImplicitException.isAlphaWithException) ? src/hotspot/share/opto/graphKit.cpp line 627: > 625: const TypeKlassPtr *ex_type = TypeKlassPtr::make(ex_ciInstKlass); > 626: kill_dead_locals(); > 627: Node* ex_node = new_instance(makecon(ex_type), NULL, NULL, true); What happened if deoptimization happen during this allocation (which is safepoint)? Which bytecode will be executed in Interpeter after deopt? src/hotspot/share/opto/graphKit.cpp line 629: > 627: Node* ex_node = new_instance(makecon(ex_type), NULL, NULL, true); > 628: set_argument(0, ex_node); > 629: ciMethod* init = ex_ciInstKlass->find_method(ciSymbol::make(""), ciSymbol::make("()V")); I know that all exceptions classes have such constructor but in general you need to check for `nullptr`. I think it could be moved before check at line 624. src/hotspot/share/opto/graphKit.cpp line 640: > 638: address target = SharedRuntime::get_resolve_opt_virtual_call_stub(); > 639: > 640: CallStaticJavaNode *call = new CallStaticJavaNode(kit.C, TypeFunc::make(init), target, init); At the end `()` will call native `fillInStackTrace()` and nothing else: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/Throwable.java#L255 Should we optimize it by inlining it here so that EA can eliminate above Allocation if it does not escape? ------------- PR: https://git.openjdk.java.net/jdk/pull/5488 From aoqi at openjdk.java.net Thu Dec 2 01:10:24 2021 From: aoqi at openjdk.java.net (Ao Qi) Date: Thu, 2 Dec 2021 01:10:24 GMT Subject: RFR: 8278037: Clean up PPC32 related code in C1 [v3] In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 08:32:44 GMT, Aleksey Shipilev wrote: > So, shouldn't we just clean the entire C1 of PPC32 macros? Done. @DamonFool @tstuefe, I have updated the pull request. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6625 From jiefu at openjdk.java.net Thu Dec 2 01:18:24 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 2 Dec 2021 01:18:24 GMT Subject: RFR: 8278037: Clean up PPC32 related code in C1 [v3] In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 17:22:53 GMT, Ao Qi wrote: >> `_tmp1` and `_tmp2` were [removed](https://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/faaed259df37#l13.173) in [JDK-8160245](https://bugs.openjdk.java.net/browse/JDK-8160245), but they are still used. >> >> This should cause a build error. I don't have a ppc32 machine for the test. It's also not found at https://builds.shipilev.net. Is ppc32 a supported compiler1 platform? > > Ao Qi has updated the pull request incrementally with one additional commit since the last revision: > > missing ")" Marked as reviewed by jiefu (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6625 From dholmes at openjdk.java.net Thu Dec 2 02:10:27 2021 From: dholmes at openjdk.java.net (David Holmes) Date: Thu, 2 Dec 2021 02:10:27 GMT Subject: RFR: 8278016: Add compiler tests to tier{2,3} [v2] In-Reply-To: References: Message-ID: <7LuaRE7ggJLUwX-IWbm_HR4jzqTb02z9-jqDFcLtz5M=.634c995b-3616-4145-911c-f639aee68b21@github.com> On Tue, 30 Nov 2021 20:44:43 GMT, Aleksey Shipilev wrote: >> I have been looking at `hotspot:tier4` (catch-all not in lower tiers) run logs, and realized the whole bunch of compiler tests are running there. >> >> Since `hotspot:tier4` runs a lot of `vmTestbase` tests, contributors seldom run it, as it takes many hours. Which means that many compiler tests are not running regularly for many contributors. But these tests are rather fast themselves and cover important compiler features. >> >> We can properly add compiler tests to `tier{2,3}` to expose them on earlier tiers. The split logic between tiers is roughly: fast feature tests go into tier2, slower feature tests and debugging/printing stuff goes to tier3. >> >> Sample times for new subgroups (think about this as "How much time they add to existing tiers"): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:tier2_compiler 243 243 0 0 >> ============================== >> >> real 2m16.518s >> user 35m40.839s >> sys 1m35.334s >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:tier3_compiler 132 132 0 0 >> ============================== >> >> real 4m31.935s >> user 71m54.617s >> sys 2m13.073s > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Filter out tier1/2 groups too Marked as reviewed by dholmes (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6622 From dholmes at openjdk.java.net Thu Dec 2 02:10:28 2021 From: dholmes at openjdk.java.net (David Holmes) Date: Thu, 2 Dec 2021 02:10:28 GMT Subject: RFR: 8278016: Add compiler tests to tier{2,3} [v2] In-Reply-To: References: <72-LBIFmV82Z0GRieH-cxjyrub8Z-VVC4izGsW_bc4A=.1e021a98-3a4e-4ccc-87a7-c3e01c26d5e5@github.com> Message-ID: <1VQI2YTryJdLrkpNHAlnHJ9Mh6AlcQb0EoxQvVZkc2k=.6dff5cf5-7d24-4949-a245-e308bb5a2934@github.com> On Wed, 1 Dec 2021 20:56:26 GMT, Vladimir Kozlov wrote: >>> @shipilev again I think we need to examine this in terms of impact to our CI. We run different platforms and configurations in different tiers so the costs are not as simple as looking at one run. >> >> Again, I can wait for those who have more insight in Oracle testing pipelines check their workflows with this change. I have no insight in Oracle infra, so somebody else have to do it. Now that Igor left, who should we be talking to? > >> > @shipilev again I think we need to examine this in terms of impact to our CI. We run different platforms and configurations in different tiers so the costs are not as simple as looking at one run. >> >> Again, I can wait for those who have more insight in Oracle testing pipelines check their workflows with this change. I have no insight in Oracle infra, so somebody else have to do it. Now that Igor left, who should we be talking to? > > @dholmes-ora I checked and this change does not interfere with our CI. `tier2` and `tier3` introduced by #5241 are not used by our CI. New `tier2_compiler` and `tier3_compiler` groups are also not used. We use different sets in CI. I am not sure how else it can affect our testing. > > I also submitted our testing. I will let you know results. @vnkozlov thanks for that! I didn't realize the HS testing was so isolated from the jtreg group definitions. Thanks for your patience @shipilev . ------------- PR: https://git.openjdk.java.net/jdk/pull/6622 From jiefu at openjdk.java.net Thu Dec 2 02:32:29 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 2 Dec 2021 02:32:29 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs [v6] In-Reply-To: References: Message-ID: <3scFWQBWPLI0rgMH_D2n02JqJdXEGtlz_5s2Co6re40=.57cbe0a5-6f91-4ec5-958d-a3a5a3bddec8@github.com> On Wed, 1 Dec 2021 23:19:47 GMT, Jie Fu wrote: > > Yes, the patch doesn't change behavior on AVX2 and older AVX512 systems. > > Thanks for your clarification. But it still remains unknown why the 64-byte instructions shouldn't be used on CPUs which don't support `serialize`. > > I will test the 64-byte instructions on older AVX512 systems today and feedback here. Here is the performance data on our older AVX512 platform which doesn't support `serialize`. Even without `serialize` , the performance has been improved with 64-byte instructions. E.g., for `ArrayCopy.arrayCopyObjectNonConst`, it has been improved by ~15%. So it seems unfair only enable 64-byte instructions for the latest Intel AVX512 platforms. Still, I would like to know why we don't use 64-byte instructions on platforms without `serialize` support. Thanks. --------------------------------------------------- Results with 32-byte instructions. ==> perf32-1.log <== Benchmark Mode Cnt Score Error Units ArrayCopy.arrayCopyObject avgt 5 24.070 ? 0.013 ns/op ArrayCopy.arrayCopyObjectNonConst avgt 5 27.517 ? 0.023 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 21.127 ? 0.008 ns/op ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 21.934 ? 0.009 ns/op ==> perf32-2.log <== Benchmark Mode Cnt Score Error Units ArrayCopy.arrayCopyObject avgt 5 24.511 ? 0.027 ns/op ArrayCopy.arrayCopyObjectNonConst avgt 5 27.240 ? 0.034 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 21.065 ? 0.013 ns/op ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 21.956 ? 0.161 ns/op ==> perf32-3.log <== Benchmark Mode Cnt Score Error Units ArrayCopy.arrayCopyObject avgt 5 25.357 ? 0.006 ns/op ArrayCopy.arrayCopyObjectNonConst avgt 5 27.513 ? 1.468 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 20.984 ? 0.024 ns/op ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 20.945 ? 1.346 ns/op Results with 64-byte instructions. ==> perf64-1.log <== Benchmark Mode Cnt Score Error Units ArrayCopy.arrayCopyObject avgt 5 23.425 ? 0.003 ns/op ArrayCopy.arrayCopyObjectNonConst avgt 5 23.530 ? 0.002 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 20.174 ? 0.074 ns/op ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 19.942 ? 0.134 ns/op ==> perf64-2.log <== Benchmark Mode Cnt Score Error Units ArrayCopy.arrayCopyObject avgt 5 22.429 ? 0.012 ns/op ArrayCopy.arrayCopyObjectNonConst avgt 5 25.189 ? 0.031 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 20.093 ? 0.004 ns/op ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 20.400 ? 1.213 ns/op ==> perf64-3.log <== Benchmark Mode Cnt Score Error Units ArrayCopy.arrayCopyObject avgt 5 23.472 ? 0.002 ns/op ArrayCopy.arrayCopyObjectNonConst avgt 5 23.534 ? 0.031 ns/op ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 20.232 ? 0.150 ns/op ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 21.921 ? 0.008 ns/op ------------- PR: https://git.openjdk.java.net/jdk/pull/6512 From david.holmes at oracle.com Thu Dec 2 02:46:56 2021 From: david.holmes at oracle.com (David Holmes) Date: Thu, 2 Dec 2021 12:46:56 +1000 Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs In-Reply-To: References: <7At_ag4fpFbiKQ81BsVoM9MzmXJJCIPufYSoi1WgIpg=.e2085ff5-34d8-4c7a-a561-9b36100f14a4@github.com> <869rFsjOuFR2Yt9bfmz45l4s6uRaFQc5IkPW3eJyZ8E=.e1a6695f-258d-4ec6-a28a-9fcba5e53ce9@github.com> Message-ID: <6180c804-3396-774d-8cda-5c3900c8a0f4@oracle.com> On 1/12/2021 11:54 pm, Jie Fu wrote: > On Wed, 1 Dec 2021 12:41:14 GMT, David Holmes wrote: > >> From my previous questions on this it was indicated that any CPU that supports `serialize` has the improved performance. > > If so, CPUs that don't support `serialize` would behave as before. > Then there shouldn't be any performance regression. Yes, which is exactly why we have been saying this should not affect "old" CPUs. David > ------------- > > PR: https://git.openjdk.java.net/jdk/pull/6512 > From david.holmes at oracle.com Thu Dec 2 02:51:46 2021 From: david.holmes at oracle.com (David Holmes) Date: Thu, 2 Dec 2021 12:51:46 +1000 Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs [v6] In-Reply-To: <3scFWQBWPLI0rgMH_D2n02JqJdXEGtlz_5s2Co6re40=.57cbe0a5-6f91-4ec5-958d-a3a5a3bddec8@github.com> References: <3scFWQBWPLI0rgMH_D2n02JqJdXEGtlz_5s2Co6re40=.57cbe0a5-6f91-4ec5-958d-a3a5a3bddec8@github.com> Message-ID: <0bae5279-90e5-3f13-61ff-820684c049f9@oracle.com> On 2/12/2021 12:32 pm, Jie Fu wrote: > On Wed, 1 Dec 2021 23:19:47 GMT, Jie Fu wrote: > >>> Yes, the patch doesn't change behavior on AVX2 and older AVX512 systems. >> >> Thanks for your clarification. But it still remains unknown why the 64-byte instructions shouldn't be used on CPUs which don't support `serialize`. >> >> I will test the 64-byte instructions on older AVX512 systems today and feedback here. > > > Here is the performance data on our older AVX512 platform which doesn't support `serialize`. > > Even without `serialize` , the performance has been improved with 64-byte instructions. > E.g., for `ArrayCopy.arrayCopyObjectNonConst`, it has been improved by ~15%. > > So it seems unfair only enable 64-byte instructions for the latest Intel AVX512 platforms. > > Still, I would like to know why we don't use 64-byte instructions on platforms without `serialize` support. Because, as previously stated, there is no actual way to identify those CPUs. But we know that if they support serialize then they also support the faster 64-bit ops. But that doesn't means that if they don't support serialize that they don't support the faster 64-bit ops. So all that is available for choosing whether to use them or not is whether serialize is supported. David > Thanks. > > --------------------------------------------------- > > Results with 32-byte instructions. > > ==> perf32-1.log <== > Benchmark Mode Cnt Score Error Units > ArrayCopy.arrayCopyObject avgt 5 24.070 ? 0.013 ns/op > ArrayCopy.arrayCopyObjectNonConst avgt 5 27.517 ? 0.023 ns/op > ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 21.127 ? 0.008 ns/op > ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 21.934 ? 0.009 ns/op > > ==> perf32-2.log <== > Benchmark Mode Cnt Score Error Units > ArrayCopy.arrayCopyObject avgt 5 24.511 ? 0.027 ns/op > ArrayCopy.arrayCopyObjectNonConst avgt 5 27.240 ? 0.034 ns/op > ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 21.065 ? 0.013 ns/op > ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 21.956 ? 0.161 ns/op > > ==> perf32-3.log <== > Benchmark Mode Cnt Score Error Units > ArrayCopy.arrayCopyObject avgt 5 25.357 ? 0.006 ns/op > ArrayCopy.arrayCopyObjectNonConst avgt 5 27.513 ? 1.468 ns/op > ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 20.984 ? 0.024 ns/op > ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 20.945 ? 1.346 ns/op > > > > Results with 64-byte instructions. > > ==> perf64-1.log <== > Benchmark Mode Cnt Score Error Units > ArrayCopy.arrayCopyObject avgt 5 23.425 ? 0.003 ns/op > ArrayCopy.arrayCopyObjectNonConst avgt 5 23.530 ? 0.002 ns/op > ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 20.174 ? 0.074 ns/op > ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 19.942 ? 0.134 ns/op > > ==> perf64-2.log <== > Benchmark Mode Cnt Score Error Units > ArrayCopy.arrayCopyObject avgt 5 22.429 ? 0.012 ns/op > ArrayCopy.arrayCopyObjectNonConst avgt 5 25.189 ? 0.031 ns/op > ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 20.093 ? 0.004 ns/op > ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 20.400 ? 1.213 ns/op > > ==> perf64-3.log <== > Benchmark Mode Cnt Score Error Units > ArrayCopy.arrayCopyObject avgt 5 23.472 ? 0.002 ns/op > ArrayCopy.arrayCopyObjectNonConst avgt 5 23.534 ? 0.031 ns/op > ArrayCopy.arrayCopyObjectSameArraysBackward avgt 5 20.232 ? 0.150 ns/op > ArrayCopy.arrayCopyObjectSameArraysForward avgt 5 21.921 ? 0.008 ns/op > > ------------- > > PR: https://git.openjdk.java.net/jdk/pull/6512 > From duke at openjdk.java.net Thu Dec 2 03:06:28 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Thu, 2 Dec 2021 03:06:28 GMT Subject: RFR: 8277882: New subnode ideal optimization: converting "c0 - (x + c1)" into "(c0 - c1) - x" [v4] In-Reply-To: References: Message-ID: <3Wb0saiWk-0Q6W6Yj8CkHPj9t__a3jwCAvua0O_sXKs=.234ac139-8d7c-4590-9430-b02e4abbc924@github.com> On Wed, 1 Dec 2021 22:32:21 GMT, Zang, Zhiqiang wrote: >> test/hotspot/jtreg/compiler/c2/TestSubIdeal.java line 1: >> >>> 1: /* >> >> You missed the copyright header here, the same for the microbenchmark > > Thank you for reviewing. This is my first pull request for OpenJDK. I was wondering what the format for the copyright should look like. I am a PhD student; should I state my university there? Thanks. I believe it depends on your OCA. If you signed it as an individual then the name appears there should be Oracle and/or its affiliates (read other files for the exact line), otherwise it would be the name of the company. Cheers. ------------- PR: https://git.openjdk.java.net/jdk/pull/6441 From jiefu at openjdk.java.net Thu Dec 2 03:38:28 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 2 Dec 2021 03:38:28 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs [v6] In-Reply-To: References: Message-ID: On Tue, 30 Nov 2021 00:10:39 GMT, Sandhya Viswanathan wrote: >> Currently 32-byte instructions are used for small array copy and clear. >> This can be optimized by using 64-byte instructions. >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Fix whitespace LGTM ------------- Marked as reviewed by jiefu (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6512 From jiefu at openjdk.java.net Thu Dec 2 03:38:28 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 2 Dec 2021 03:38:28 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs In-Reply-To: <0bae5279-90e5-3f13-61ff-820684c049f9@oracle.com> References: <0bae5279-90e5-3f13-61ff-820684c049f9@oracle.com> Message-ID: On Thu, 2 Dec 2021 02:53:52 GMT, David Holmes wrote: > Because, as previously stated, there is no actual way to identify those > CPUs. But we know that if they support serialize then they also support > the faster 64-bit ops. But that doesn't means that if they don't support > serialize that they don't support the faster 64-bit ops. So all that is > available for choosing whether to use them or not is whether serialize > is supported. OK, make sense! Since it won't make things worse for the "old" systems, I'm fine with it. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6512 From dlong at openjdk.java.net Thu Dec 2 03:39:22 2021 From: dlong at openjdk.java.net (Dean Long) Date: Thu, 2 Dec 2021 03:39:22 GMT Subject: RFR: 8277882: New subnode ideal optimization: converting "c0 - (x + c1)" into "(c0 - c1) - x" [v5] In-Reply-To: References: Message-ID: <-nkxszqjPlPFizSxBOffBqpFhByeR1CUaPHNBCQh-Yk=.2b6d56bf-cc26-4157-981a-cf3b3d466b6c@github.com> On Thu, 2 Dec 2021 02:38:09 GMT, Zang, Zhiqiang wrote: >> Suggest two new optimizations that can be done in SubINode::Ideal. > > Zang, Zhiqiang has updated the pull request incrementally with one additional commit since the last revision: > > address comments from code review. Do you want to make the same change to SubLNode for completeness, and adds tests for "long" like you did for "int"? If so, then you might find the refactoring that Roland is doing in 8262341 useful. ------------- PR: https://git.openjdk.java.net/jdk/pull/6441 From dlong at openjdk.java.net Thu Dec 2 05:20:21 2021 From: dlong at openjdk.java.net (Dean Long) Date: Thu, 2 Dec 2021 05:20:21 GMT Subject: RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert [v2] In-Reply-To: References: <0GbtDPLsXoyXYNJv2ZJ6vwTSdbzJOXqWUNENCTCrZmA=.d0b24bd8-be69-435d-8db1-d291afcd7f62@github.com> Message-ID: <4ggmeDmX6-82KhLQC0_ijUHQqeCHrD-fHrTwGGhoneM=.36d2be35-5156-4bbd-b55c-07a15487062e@github.com> On Wed, 1 Dec 2021 13:11:00 GMT, Roland Westrelin wrote: > Where would deoptimization occur then? In Parse::catch_inline_exceptions() when the right exception handlers is picked with subtype checks? I don't see any uncommon trap there. Maybe the requirement to keep the stack is no longer necessary? At the end of the method I see: 1. a runtime call to the rethrow stub. The comment says "must not deoptimize", but I'm not sure what prevents that 2. catch_call_exceptions(), which does an uncommon trap for unloaded exception classes > The reason I didn't go with Vladimir's work around for 8273165 is that I think it could have a performance impact that would be more likely to be noticed than in the case of 8273165 (because late inlining of method handle has been around for many releases and is likely something that's relied on). We could extend Vladimir's work around by checking that the receiver may be null and that the null check would cause the exception to be thrown rather than deoptimization. But with possible performance impact, right? > Another way to deal with this could be to pop the stack if the current method has no exception handlers because then the exception is passed on to the caller and the entire frame is popped anyway. That would work nicely for this case as AFAIU, the method handle invoker can only be inlined from a lambda form that wouldn't have exception handlers. That sounds promising. ------------- PR: https://git.openjdk.java.net/jdk/pull/6572 From jiefu at openjdk.java.net Thu Dec 2 06:59:30 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 2 Dec 2021 06:59:30 GMT Subject: RFR: 8276162: Optimise unsigned comparison pattern [v4] In-Reply-To: References: Message-ID: On Sat, 13 Nov 2021 05:22:07 GMT, Mai ??ng Qu?n Anh wrote: >> This patch changes operations in the form `x +- Integer.MIN_VALUE <=> y +- Integer.MIN_VALUE`, which is a pattern used to do unsigned comparisons, into `x u<=> y`. >> >> In addition to being basic operations, they may be utilised to implement range checks such as the methods in `jdk.internal.util.Preconditions`, or in places where the compiler cannot deduce the non-negativeness of the bound as in `java.util.ArrayList`. >> >> Thank you very much. > > Mai ??ng Qu?n Anh has updated the pull request incrementally with two additional commits since the last revision: > > - add tests cover constant comparison and calling library > - add eq/ne, add correction test, refine micro This introduced a regression on 32-bit x86 for Vector API tests, could someone help to review it? https://github.com/openjdk/jdk/pull/6533 Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6101 From duke at openjdk.java.net Thu Dec 2 07:00:52 2021 From: duke at openjdk.java.net (Danil Bubnov) Date: Thu, 2 Dec 2021 07:00:52 GMT Subject: RFR: 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> [v2] In-Reply-To: References: Message-ID: > This is the fix of aarch64 jvmci calling convention. > > On MacOS/aarch64 "Function arguments may consume slots on the stack that are not multiples of 8 bytes" [1], but current approach uses only wordsize or bigger slots, which is incorrect (that is why tests were failing [4]). Now arguments consume the right amount of bytes. > > Another problem is that current approach don't make 16-byte alignment of Stack Pointer [1][2][3]. However, tests not fail on Linux/aarch64 and Windows/aarch64. They pass because in this tests all functions have even number of argumets, that is why 16-byte alignment comes automatically. But if you try to add or delete one argumets, tests will fail with SIGBUS. > > I've tested this patch on MacOS/aarch64 and Linux/aarch64, all tests have passed. > > Also I don't understand, why current tests (NativeCallTest) use only int, long, float and double as arguments types. Is it possible to add functions with another types like byte or short? I tried, but it fails on every platform. > > [1] https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms > [2] https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#the-stack > [3] https://docs.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-160#stack > [4] https://bugs.openjdk.java.net/browse/JDK-8262901 Danil Bubnov has updated the pull request incrementally with one additional commit since the last revision: Add test with an odd number of arguments ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6641/files - new: https://git.openjdk.java.net/jdk/pull/6641/files/c76331bc..2ff8c037 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6641&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6641&range=00-01 Stats: 73 lines in 2 files changed: 73 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6641.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6641/head:pull/6641 PR: https://git.openjdk.java.net/jdk/pull/6641 From kvn at openjdk.java.net Thu Dec 2 07:32:29 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Dec 2021 07:32:29 GMT Subject: RFR: 8278016: Add compiler tests to tier{2,3} [v2] In-Reply-To: References: Message-ID: <_tpNEJ7RnKYConrel3BdMSsn7DRHOZX1i1lqX3yVZD8=.82735e32-7713-4d6f-8142-9bba01cbee48@github.com> On Tue, 30 Nov 2021 20:44:43 GMT, Aleksey Shipilev wrote: >> I have been looking at `hotspot:tier4` (catch-all not in lower tiers) run logs, and realized the whole bunch of compiler tests are running there. >> >> Since `hotspot:tier4` runs a lot of `vmTestbase` tests, contributors seldom run it, as it takes many hours. Which means that many compiler tests are not running regularly for many contributors. But these tests are rather fast themselves and cover important compiler features. >> >> We can properly add compiler tests to `tier{2,3}` to expose them on earlier tiers. The split logic between tiers is roughly: fast feature tests go into tier2, slower feature tests and debugging/printing stuff goes to tier3. >> >> Sample times for new subgroups (think about this as "How much time they add to existing tiers"): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:tier2_compiler 243 243 0 0 >> ============================== >> >> real 2m16.518s >> user 35m40.839s >> sys 1m35.334s >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:tier3_compiler 132 132 0 0 >> ============================== >> >> real 4m31.935s >> user 71m54.617s >> sys 2m13.073s > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Filter out tier1/2 groups too I don't see issues with these changes in my testing. I submitted our tier1,2,3 testing in internal infra. ------------- PR: https://git.openjdk.java.net/jdk/pull/6622 From duke at openjdk.java.net Thu Dec 2 07:33:23 2021 From: duke at openjdk.java.net (Danil Bubnov) Date: Thu, 2 Dec 2021 07:33:23 GMT Subject: RFR: 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> In-Reply-To: <-Wzb3RXPFeEY0E7HDT-7lWw1_2mxReE-J8FPeI3UKl8=.4211598e-f032-46b8-8aa4-6f5254785193@github.com> References: <-Wzb3RXPFeEY0E7HDT-7lWw1_2mxReE-J8FPeI3UKl8=.4211598e-f032-46b8-8aa4-6f5254785193@github.com> Message-ID: <14apY849fpxKvDBzwHwE8rGYA3Pov9oAPCJ7X6dh2CQ=.8ec2a540-b45c-46dc-889a-8c96027450b1@github.com> On Wed, 1 Dec 2021 18:17:41 GMT, Andrew Haley wrote: > Please add some test with an odd number of arguments. Done. I took test I32SDILDS, copied it and change last 6 arguments for 1 int argument, also changed return type to int. Without stack alignment from this patch this test fails with SIGBUS on Linux/aarch64 and MacOS/aarch64 (on MacOS/aarch64 even existing F32SDILDS fails) ------------- PR: https://git.openjdk.java.net/jdk/pull/6641 From kvn at openjdk.java.net Thu Dec 2 07:44:23 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Dec 2021 07:44:23 GMT Subject: RFR: 8277753: Long*VectorTests.java fail with "bad AD file" on x86_32 after JDK-8276162 In-Reply-To: References: Message-ID: On Wed, 24 Nov 2021 08:52:28 GMT, Jie Fu wrote: > Hi all, > > The following vector api tests fail with "bad AD file" on x86_32. > > jdk/incubator/vector/Long128VectorTests.java > jdk/incubator/vector/Long256VectorTests.java > jdk/incubator/vector/Long512VectorTests.java > jdk/incubator/vector/LongMaxVectorTests.java > jdk/incubator/vector/Long64VectorTests.java > > > Let's fix it. > > Thanks. > Best regards, > Jie Good. @DamonFool In PR description please describe what was the problem. Yes, I see from changes that 32-bit .ad file was missing instruction definitions. But it should be said in description. And say about testing you did. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6533 From shade at openjdk.java.net Thu Dec 2 07:57:25 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 2 Dec 2021 07:57:25 GMT Subject: RFR: 8278037: Clean up PPC32 related code in C1 [v3] In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 17:22:53 GMT, Ao Qi wrote: >> `_tmp1` and `_tmp2` were [removed](https://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/faaed259df37#l13.173) in [JDK-8160245](https://bugs.openjdk.java.net/browse/JDK-8160245), but they are still used. >> >> This should cause a build error. I don't have a ppc32 machine for the test. It's also not found at https://builds.shipilev.net. Is ppc32 a supported compiler1 platform? > > Ao Qi has updated the pull request incrementally with one additional commit since the last revision: > > missing ")" Marked as reviewed by shade (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6625 From jiefu at openjdk.java.net Thu Dec 2 08:13:22 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 2 Dec 2021 08:13:22 GMT Subject: RFR: 8277753: Long*VectorTests.java fail with "bad AD file" on x86_32 after JDK-8276162 In-Reply-To: References: Message-ID: On Thu, 2 Dec 2021 07:40:34 GMT, Vladimir Kozlov wrote: > @DamonFool In PR description please describe what was the problem. Yes, I see from changes that 32-bit .ad file was missing instruction definitions. But it should be said in description. > > And say about testing you did. Thanks @vnkozlov . Got it. Will do it next time. Testing: - tier1 ~ tier3 on linux/x86_32, no regression ------------- PR: https://git.openjdk.java.net/jdk/pull/6533 From neliasso at openjdk.java.net Thu Dec 2 09:11:21 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Thu, 2 Dec 2021 09:11:21 GMT Subject: RFR: 8251216: Implement MD5 intrinsics on AArch64 In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 09:24:45 GMT, Patric Hedlin wrote: > Implementation of MD5 intrinsic support for AArch64. > > Contributed by Ludovic Henry (@luhenry). > > Speedup measured (in Aurora running Ampere Altra) as follows: > > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1048576-provider:...29.39% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:2047-provider:.........28.91% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:2048-provider:.........28.81% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1023-provider:.........28.43% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1024-provider:.........28.32% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:511-provider:...........27.78% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:512-provider:...........27.62% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:255-provider:...........26.52% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:256-provider:...........26.38% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:127-provider:...........25.41% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:128-provider:...........24.66% > > Testing tier1-7. Looks good! ------------- Marked as reviewed by neliasso (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6628 From phedlin at openjdk.java.net Thu Dec 2 09:28:30 2021 From: phedlin at openjdk.java.net (Patric Hedlin) Date: Thu, 2 Dec 2021 09:28:30 GMT Subject: RFR: 8251216: Implement MD5 intrinsics on AArch64 In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 09:24:45 GMT, Patric Hedlin wrote: > Implementation of MD5 intrinsic support for AArch64. > > Contributed by Ludovic Henry (@luhenry). > > Speedup measured (in Aurora running Ampere Altra) as follows: > > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1048576-provider:...29.39% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:2047-provider:.........28.91% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:2048-provider:.........28.81% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1023-provider:.........28.43% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1024-provider:.........28.32% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:511-provider:...........27.78% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:512-provider:...........27.62% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:255-provider:...........26.52% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:256-provider:...........26.38% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:127-provider:...........25.41% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:128-provider:...........24.66% > > Testing tier1-7. Thank you for reviewing @aph and @neliasso. Thank you for contributing the patch @luhenry. Thank you for commenting on the issue @cl4es. ------------- PR: https://git.openjdk.java.net/jdk/pull/6628 From phedlin at openjdk.java.net Thu Dec 2 09:28:31 2021 From: phedlin at openjdk.java.net (Patric Hedlin) Date: Thu, 2 Dec 2021 09:28:31 GMT Subject: Integrated: 8251216: Implement MD5 intrinsics on AArch64 In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 09:24:45 GMT, Patric Hedlin wrote: > Implementation of MD5 intrinsic support for AArch64. > > Contributed by Ludovic Henry (@luhenry). > > Speedup measured (in Aurora running Ampere Altra) as follows: > > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1048576-provider:...29.39% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:2047-provider:.........28.91% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:2048-provider:.........28.81% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1023-provider:.........28.43% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:1024-provider:.........28.32% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:511-provider:...........27.78% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:512-provider:...........27.62% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:255-provider:...........26.52% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:256-provider:...........26.38% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:127-provider:...........25.41% > openjdk.bench.javax.crypto.full.MessageDigestBench.digest-algorithm:MD5-dataSize:128-provider:...........24.66% > > Testing tier1-7. This pull request has now been integrated. Changeset: 088b244e Author: Patric Hedlin URL: https://git.openjdk.java.net/jdk/commit/088b244ec6d9393a1fcd2233fa5b4cf46f9ae0dd Stats: 199 lines in 4 files changed: 193 ins; 1 del; 5 mod 8251216: Implement MD5 intrinsics on AArch64 Co-authored-by: Ludovic Henry Reviewed-by: aph, neliasso ------------- PR: https://git.openjdk.java.net/jdk/pull/6628 From duke at openjdk.java.net Thu Dec 2 11:27:41 2021 From: duke at openjdk.java.net (=?UTF-8?B?546L6LaF?=) Date: Thu, 2 Dec 2021 11:27:41 GMT Subject: RFR: JDK-8278135: Remove un-necessary null-check for get-static in c2 Message-ID: When run the following test, lots of un-necessary null check deoptimization happen. Small Test: public class CodeDependenciesTest { private Object obj; private String[] strs; private Object[][] objs; private static Class clzOne; private static Class clzTwo; public static void main(String[] args) throws Exception { CodeDependenciesTest codeDependenciesTest = new CodeDependenciesTest(); codeDependenciesTest.obj = new String("1"); for (int i = 0; i < 300000; i++) { codeDependenciesTest.foo(); } } public void foo() { objs = new Object[10][10]; for (int i = 0; i < 10; i++) { for (int j = 0; j < 10; j++) { objs[i][j] = new Object(); } } clzOne = InvokeTest.class; clzTwo = clzOne; } static class InvokeTest { public void bar(String i) { try { Thread.sleep(Long.valueOf(i)); } catch (Exception e) { e.printStackTrace(); } } } } The deoptimization log generated by `-XX:+TraceDeoptimization` is: Uncommon trap bci=63 pc=0x00007f0eafbe1e38, relative_pc=0x00000000000005f8, method=CodeDependenciesTest.foo()V, debug_id=0 Uncommon trap occurred in CodeDependenciesTest::foo compiler=c2 compile_id=288 (@0x00007f0eafbe1e38) thread=20285 reason=null_assert_or_unreached0 action=make_not_entrant unloaded_class_index=-1 debug_id=0 DEOPT UNPACKING thread 0x00007f0ec0028190 vframeArray 0x00007f0ec02348a0 mode 2 {method} {0x00007f0e7c4004f0} 'foo' '()V' in 'CodeDependenciesTest' - putstatic @ bci 63 sp = 0x00007f0ec8a317c8 Uncommon trap bci=63 pc=0x00007f0eafbe0f34, relative_pc=0x0000000000000514, method=CodeDependenciesTest.foo()V, debug_id=0 Uncommon trap occurred in CodeDependenciesTest::foo compiler=c2 compile_id=287 (@0x00007f0eafbe0f34) thread=20285 reason=null_assert_or_unreached0 action=make_not_entrant unloaded_class_index=-1 debug_id=0 DEOPT UNPACKING thread 0x00007f0ec0028190 vframeArray 0x00007f0ec0235e40 mode 2 {method} {0x00007f0e7c4004f0} 'foo' '()V' in 'CodeDependenciesTest' - putstatic @ bci 63 sp = 0x00007f0ec8a317c8 The corresponding opto assembly is, 230 B20: # out( B56 B21 ) <- in( B58 B43 B41 B19 ) Freq: 0.999369 230 movl [RBX + #112 (8-bit)], narrowoop: java/lang/Class:exact * # compressed ptr ! Field: CodeDependenciesTest.clzOne 237 movq R10, RBX # ptr -> long 23a movq RBP, java/lang/Class:exact * # ptr 244 movq R11, RBP # ptr -> long 247 xorq R11, R10 # long 24a shrq R11, #22 24e testq R11, R11 251 je B56 P=0.000001 C=-1.000000 257 B21: # out( B56 B22 ) <- in( B20 ) Freq: 0.999368 257 shrq R10, #9 25b addq R14, R10 # ptr 25e cmpb [R14], #4 262 je B56 P=0.000001 C=-1.000000 5ed B56: # out( N780 ) <- in( B55 B21 B20 ) Freq: 3.02464e-06 5ed movl RSI, #-20 # int nop # 1 bytes pad for loops and calls 5f3 call,static wrapper for: uncommon_trap(reason='null_assert_or_unreached0' action='make_not_entrant' debug_id='0') # CodeDependenciesTest::foo @ bci:63 (line 23) L[0]=_ L[1]=_ L[2]=_ STK[0]=RBP # OopMap {rbp=Oop off=1528/0x5f8} 5f8 stop # ShouldNotReachHere C2 tries to generate a null-check for the get-static in `clzTwo = clzOne;`, because it thinks that ciKlass of `java.lang.Class` is not loaded. The ciKlass of `java.lang.Class` is generated by the following stack trace, (gdb) bt #0 SystemDictionary::find_instance_klass (class_name=0x800481080, class_loader=..., protection_domain=...) at /data/openjdk/jdk_dev/src/hotspot/share/classfile/systemDictionary.cpp:778 #1 0x00007ffff610769d in SystemDictionary::find_instance_or_array_klass (class_name=0x800481080, class_loader=..., protection_domain=...) at /data/openjdk/jdk_dev/src/hotspot/share/classfile/systemDictionary.cpp:813 #2 0x00007ffff610a55c in SystemDictionary::find_constrained_instance_or_array_klass (current=0x7ffff020a5e0, class_name=0x800481080, class_loader=...) at /data/openjdk/jdk_dev/src/hotspot/share/classfile/systemDictionary.cpp:1760 #3 0x00007ffff55f1f29 in ciEnv::get_klass_by_name_impl (this=0x7fffc12066c0, accessing_klass=0x7fff8c0f6278, cpool=..., name=0x7ffff00334f8, require_local=false) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:519 #4 0x00007ffff55f1d81 in ciEnv::get_klass_by_name_impl (this=0x7fffc12066c0, accessing_klass=0x7fff8c0f6278, cpool=..., name=0x7fff8c004518, require_local=false) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:488 #5 0x00007ffff55f2466 in ciEnv::get_klass_by_index_impl (this=0x7fffc12066c0, cpool=..., index=34, is_accessible=@0x7fffc1204540: 112, accessor=0x7fff8c0f6278) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:611 #6 0x00007ffff55f2632 in ciEnv::get_klass_by_index (this=0x7fffc12066c0, cpool=..., index=34, is_accessible=@0x7fffc1204540: 112, accessor=0x7fff8c0f6278) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:658 #7 0x00007ffff55fb1bd in ciField::ciField (this=0x7fff8c0f9208, klass=0x7fff8c0f6278, index=65542) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciField.cpp:101 #8 0x00007ffff55f344f in ciEnv::get_field_by_index_impl (this=0x7fffc12066c0, accessor=0x7fff8c0f6278, index=65542) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:798 #9 0x00007ffff55f3506 in ciEnv::get_field_by_index (this=0x7fffc12066c0, accessor=0x7fff8c0f6278, index=65542) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:811 #10 0x00007ffff56284ec in ciBytecodeStream::get_field (this=0x7fffc1204850, will_link=@0x7fffc12046cf: false) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciStreams.cpp:274 #11 0x00007ffff562d10c in ciTypeFlow::StateVector::do_putstatic (this=0x7fff8c1bd078, str=0x7fffc1204850) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:798 #12 0x00007ffff562e960 in ciTypeFlow::StateVector::apply_one_bytecode (this=0x7fff8c1bd078, str=0x7fffc1204850) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:1457 #13 0x00007ffff563218d in ciTypeFlow::flow_block (this=0x7fff8c0f7a50, block=0x7fff8c0f8ec0, state=0x7fff8c1bd078, jsrs=0x7fff8c1bd0b8) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:2364 #14 0x00007ffff563332d in ciTypeFlow::df_flow_types (this=0x7fff8c0f7a50, start=0x7fff8c0f80a8, do_flow=true, temp_vector=0x7fff8c1bd078, temp_set=0x7fff8c1bd0b8) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:2675 #15 0x00007ffff563361f in ciTypeFlow::flow_types (this=0x7fff8c0f7a50) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:2725 #16 0x00007ffff5634081 in ciTypeFlow::do_flow (this=0x7fff8c0f7a50) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:2886 #17 0x00007ffff5605687 in ciMethod::get_flow_analysis (this=0x7fff8c0f6340) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciMethod.cpp:327 #18 0x00007ffff5499d34 in InlineTree::check_can_parse (callee=0x7fff8c0f6340) at /data/openjdk/jdk_dev/src/hotspot/share/opto/bytecodeInfo.cpp:535 #19 0x00007ffff55a0806 in CallGenerator::for_osr (m=0x7fff8c0f6340, osr_bci=22) at /data/openjdk/jdk_dev/src/hotspot/share/opto/callGenerator.cpp:299 #20 0x00007ffff56aa2d9 in Compile::Compile (this=0x7fffc1205900, ci_env=0x7fffc12066c0, target=0x7fff8c0f6340, osr_bci=22, options=..., directive=0x7ffff01588a0) at /data/openjdk/jdk_dev/src/hotspot/share/opto/compile.cpp:687 #21 0x00007ffff559da3e in C2Compiler::compile_method (this=0x7ffff0209f40, env=0x7fffc12066c0, target=0x7fff8c0f6340, entry_bci=22, install_code=true, directive=0x7ffff01588a0) at /data/openjdk/jdk_dev/src/hotspot/share/opto/c2compiler.cpp:108 #22 0x00007ffff56c7f6e in CompileBroker::invoke_compiler_on_method (task=0x7ffff02322d0) at /data/openjdk/jdk_dev/src/hotspot/share/compiler/compileBroker.cpp:2291 #23 0x00007ffff56c6aed in CompileBroker::compiler_thread_loop () at /data/openjdk/jdk_dev/src/hotspot/share/compiler/compileBroker.cpp:1966 #24 0x00007ffff56e6f9f in CompilerThread::thread_entry (thread=0x7ffff020a5e0, __the_thread__=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/share/compiler/compilerThread.cpp:59 #25 0x00007ffff614a42f in JavaThread::thread_main_inner (this=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/share/runtime/thread.cpp:1297 #26 0x00007ffff614a2c5 in JavaThread::run (this=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/share/runtime/thread.cpp:1280 #27 0x00007ffff6147b08 in Thread::call_run (this=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/share/runtime/thread.cpp:358 #28 0x00007ffff5eafa4f in thread_native_entry (thread=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/os/linux/os_linux.cpp:705 #29 0x00007ffff779cea5 in start_thread (arg=0x7fffc1207700) at pthread_create.c:307 #30 0x00007ffff72c19fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 When `CodeDependenciesTest.foo` is compiled, the classloader of the holder of this method is AppClassLoader, so it finds `java.lang.Class` in AppClassLoader, and finds nothing, so it thinks that `java.lang.Class` is not loaded, but `java.lang.Class` is a built-in class which is definitely loaded and initialized. The following patch is the first patch we implemented, it tries to find Klass in the parent classloader when the Klass can not be found in current classloader. But the patch has potential risk, because user-defined classloader may not follow the 'parent delegation' style of classloading. diff --git a/src/hotspot/share/ci/ciEnv.cpp b/src/hotspot/share/ci/ciEnv.cpp index e29b56a..3ceafc4 100644 --- a/src/hotspot/share/ci/ciEnv.cpp +++ b/src/hotspot/share/ci/ciEnv.cpp @@ -514,11 +514,17 @@ ciKlass* ciEnv::get_klass_by_name_impl(ciKlass* accessing_klass, { ttyUnlocker ttyul; // release tty lock to avoid ordering problems MutexLocker ml(current, Compile_lock); - Klass* kls; - if (!require_local) { - kls = SystemDictionary::find_constrained_instance_or_array_klass(current, sym, loader); - } else { - kls = SystemDictionary::find_instance_or_array_klass(sym, loader, domain); + Klass* kls = NULL; + while (true) { + if (!require_local) { + kls = SystemDictionary::find_constrained_instance_or_array_klass(current, sym, loader); + } else { + kls = SystemDictionary::find_instance_or_array_klass(sym, loader, domain); + } + if (kls != NULL || loader() == NULL) { + break; + } + loader = Handle(current, java_lang_ClassLoader::parent(loader())); } found_klass = kls; } When the Klass of the field is not loaded, the generated 'null check' helps nothing, we think remove it is the right way to avoid the deoptmization. ------------- Commit messages: - Remove un-necessary null check Changes: https://git.openjdk.java.net/jdk/pull/6667/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6667&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278135 Stats: 27 lines in 1 file changed: 0 ins; 27 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6667.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6667/head:pull/6667 PR: https://git.openjdk.java.net/jdk/pull/6667 From thartmann at openjdk.java.net Thu Dec 2 12:20:21 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 2 Dec 2021 12:20:21 GMT Subject: RFR: 8262341: Refine identical code in AddI/LNode. In-Reply-To: <5q7zq7qSEn7iTQdlYBh3d92Qeg7oxO_b5T36iUaFSGs=.326c2ded-2a41-4589-8951-6a31a6d44257@github.com> References: <5q7zq7qSEn7iTQdlYBh3d92Qeg7oxO_b5T36iUaFSGs=.326c2ded-2a41-4589-8951-6a31a6d44257@github.com> Message-ID: On Tue, 30 Nov 2021 08:39:52 GMT, Roland Westrelin wrote: > AddINode::Ideal() and AddlNode::Ideal() are almost identical but the > same logic had to be duplicated because AddINode::Ideal() tests its > inputs for Op_AddI, Op_SubI etc. while AddLNode::Ideal() tests for > Op_AddL, Op_SubL etc. This patch refactors the code so the common > logic is in a single method parameterized by a BasicType argument. > > The way I've done this before in the context of int/long counted loops > was to use and extra virtual method operates_on(). So: > > n->Opcode() == Op_AddI becomes n->is_Add() && n->operates_on(T_INT) > > Working on this change made me realize that pattern doesn't work that well: > > - it's quite a bit more verbose and converting existing code is not as > mechanical as we would like to avoid conversion errors. > > - it breaks when a class has a subclass. For instance AddNode has > OrINode and OrLNode as subclasses so testing for n->is_Add() returns > true with an OrI node. > > Instead, this change introduces new functions. For instance of > AddI/AddL: > > int Op_Add(BasicType bt) > > that returns either Op_AddI or Op_AddL depending on bt. This made > refactoring the AddINode::Ideal() logic straightforward. I removed all > use of operates_on() as well and converted existing code to the new > Op_XXX() functions. Very nice, looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6607 From shade at openjdk.java.net Thu Dec 2 12:37:43 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 2 Dec 2021 12:37:43 GMT Subject: RFR: 8278141: LIR_OpLoadKlass::_info shadows the field of the same name from LIR_Op Message-ID: SoundCloud complains about code added in [JDK-8277417](https://bugs.openjdk.java.net/browse/JDK-8277417): Field "_info" shadows a field of the same name in base class "LIR_Op" class LIR_OpLoadKlass: public LIR_Op { friend class LIR_OpVisitState; private: LIR_Opr _obj; CodeEmitInfo* _info; <--- here >From the look of it, it seems risky to have two inconsistent fields here. Depending on which base class we use to access it, we might have different `_info`-s referenced. @rkennke, that was not intentional, was it? I don't see the mentions of this oddity in the original PR. The fix is to push `CodeEmitInfo` to super-class `LIR_Op`, and use it from there. Additional testing: - [x] Linux x86_64 fastdebug `tier1` - [x] Linux x86_64 fastdebug `tier2` ------------- Commit messages: - Fix Changes: https://git.openjdk.java.net/jdk/pull/6669/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6669&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278141 Stats: 4 lines in 1 file changed: 0 ins; 2 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/6669.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6669/head:pull/6669 PR: https://git.openjdk.java.net/jdk/pull/6669 From thartmann at openjdk.java.net Thu Dec 2 12:44:25 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 2 Dec 2021 12:44:25 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 15:41:07 GMT, Denghui Dong wrote: > Hi, > > Could I have a review of this small fix that makes expand_dtrace_alloc_probe take effect? > > Thanks, > Denghui But isn't the intention of this code to always go to the slow path when `DTraceAllocProbes` is true (hence the `expand_fast_path = false;`) and apply dtrace probes in the runtime? `ExtendedDTraceProbes` is then there to `Enable performance-impacting dtrace probes`, i.e. add dtrace probes to the fast allocation path in C2. And it's only used when `ExtendedDTraceProbes` is false. src/hotspot/share/opto/macro.cpp line 1650: > 1648: call->init_req(TypeFunc::Control, ctrl); > 1649: call->init_req(TypeFunc::I_O , top()); // does no i/o > 1650: call->init_req(TypeFunc::Memory , rawmem); Good catch, looks like this was introduced by JDK-8237581. ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From thartmann at openjdk.java.net Thu Dec 2 12:53:25 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 2 Dec 2021 12:53:25 GMT Subject: RFR: 8277753: Long*VectorTests.java fail with "bad AD file" on x86_32 after JDK-8276162 In-Reply-To: References: Message-ID: On Wed, 24 Nov 2021 08:52:28 GMT, Jie Fu wrote: > Hi all, > > The following vector api tests fail with "bad AD file" on x86_32. > > jdk/incubator/vector/Long128VectorTests.java > jdk/incubator/vector/Long256VectorTests.java > jdk/incubator/vector/Long512VectorTests.java > jdk/incubator/vector/LongMaxVectorTests.java > jdk/incubator/vector/Long64VectorTests.java > > > The failure reason is that several unsigned long comparison instructs are missing for x86_32. > The fix just added the missing instructs rules. > > Testing: > - tier1 ~ tier3 on linux/x86_32, no regression. > > Thanks. > Best regards, > Jie Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6533 From thartmann at openjdk.java.net Thu Dec 2 13:02:25 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 2 Dec 2021 13:02:25 GMT Subject: RFR: 8278141: LIR_OpLoadKlass::_info shadows the field of the same name from LIR_Op In-Reply-To: References: Message-ID: On Thu, 2 Dec 2021 12:31:41 GMT, Aleksey Shipilev wrote: > SoundCloud complains about code added in [JDK-8277417](https://bugs.openjdk.java.net/browse/JDK-8277417): > Field "_info" shadows a field of the same name in base class "LIR_Op" > > > class LIR_OpLoadKlass: public LIR_Op { > friend class LIR_OpVisitState; > > private: > LIR_Opr _obj; > CodeEmitInfo* _info; <--- here > > > From the look of it, it seems risky to have two inconsistent fields here. Depending on which base class we use to access it, we might have different `_info`-s referenced. @rkennke, that was not intentional, was it? I don't see the mentions of this oddity in the original PR. > > The fix is to push `CodeEmitInfo` to super-class `LIR_Op`, and use it from there. > > Additional testing: > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` Looks good to me. @rkennke should have a look as well. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6669 From ddong at openjdk.java.net Thu Dec 2 13:05:22 2021 From: ddong at openjdk.java.net (Denghui Dong) Date: Thu, 2 Dec 2021 13:05:22 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: On Thu, 2 Dec 2021 12:41:15 GMT, Tobias Hartmann wrote: > But isn't the intention of this code to always go to the slow path when `DTraceAllocProbes` is true (hence the `expand_fast_path = false;`) and apply dtrace probes in the runtime? C->env()->dtrace_alloc_probes() return true if ExtendedDTraceProbes is enabled(see ciEnv::cache_dtrace_flags()), which means PhaseMacroExpand::expand_dtrace_alloc_probe always doesn't take effect now. ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From ddong at openjdk.java.net Thu Dec 2 13:08:23 2021 From: ddong at openjdk.java.net (Denghui Dong) Date: Thu, 2 Dec 2021 13:08:23 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 15:41:07 GMT, Denghui Dong wrote: > Hi, > > Could I have a review of this small fix that makes expand_dtrace_alloc_probe take effect? > > Thanks, > Denghui src/hotspot/share/opto/macro.cpp line 1245: > 1243: } > 1244: > 1245: if (C->env()->dtrace_alloc_probes() || This check was introduced in JDK-6788527. I'm not very clear about the root cause. ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From ddong at openjdk.java.net Thu Dec 2 13:25:28 2021 From: ddong at openjdk.java.net (Denghui Dong) Date: Thu, 2 Dec 2021 13:25:28 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 15:41:07 GMT, Denghui Dong wrote: > Hi, > > Could I have a review of this small fix that makes expand_dtrace_alloc_probe take effect? > > Thanks, > Denghui src/hotspot/share/opto/macro.cpp line 1633: > 1631: void PhaseMacroExpand::expand_dtrace_alloc_probe(AllocateNode* alloc, Node* oop, > 1632: Node*& ctrl, Node*& rawmem) { > 1633: if (C->env()->dtrace_alloc_probes()) { IIUC, this probe is related to object allocation, so it should be expanded when ciEnv::dtrace_alloc_probes() returns true, just like the implementation in C1(see C1_MacroAssembler::initialize_object(). ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From rkennke at openjdk.java.net Thu Dec 2 13:29:25 2021 From: rkennke at openjdk.java.net (Roman Kennke) Date: Thu, 2 Dec 2021 13:29:25 GMT Subject: RFR: 8278141: LIR_OpLoadKlass::_info shadows the field of the same name from LIR_Op In-Reply-To: References: Message-ID: On Thu, 2 Dec 2021 12:31:41 GMT, Aleksey Shipilev wrote: > SoundCloud complains about code added in [JDK-8277417](https://bugs.openjdk.java.net/browse/JDK-8277417): > Field "_info" shadows a field of the same name in base class "LIR_Op" > > > class LIR_OpLoadKlass: public LIR_Op { > friend class LIR_OpVisitState; > > private: > LIR_Opr _obj; > CodeEmitInfo* _info; <--- here > > > From the look of it, it seems risky to have two inconsistent fields here. Depending on which base class we use to access it, we might have different `_info`-s referenced. @rkennke, that was not intentional, was it? I don't see the mentions of this oddity in the original PR. > > The fix is to push `CodeEmitInfo` to super-class `LIR_Op`, and use it from there. > > Additional testing: > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` Whoops. That went wrong when I moved it from Lilliput work repo to upstream. The fix is good. Thank you! ------------- Marked as reviewed by rkennke (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6669 From thartmann at openjdk.java.net Thu Dec 2 13:39:29 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 2 Dec 2021 13:39:29 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: <6byawc9F8UgcR72iIo72qBXG-2Q8DQwWO17_QPyyidc=.73e491e8-8d73-4856-a434-8995ca50a69d@github.com> On Thu, 2 Dec 2021 13:05:42 GMT, Denghui Dong wrote: >> Hi, >> >> Could I have a review of this small fix that makes expand_dtrace_alloc_probe take effect? >> >> Thanks, >> Denghui > > src/hotspot/share/opto/macro.cpp line 1245: > >> 1243: } >> 1244: >> 1245: if (C->env()->dtrace_alloc_probes() || > > This check was introduced in JDK-6788527. > I'm not very clear about the root cause. @vnkozlov do you remember? ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From thartmann at openjdk.java.net Thu Dec 2 13:43:31 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 2 Dec 2021 13:43:31 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 15:41:07 GMT, Denghui Dong wrote: > Hi, > > Could I have a review of this small fix that makes expand_dtrace_alloc_probe take effect? > > Thanks, > Denghui Thanks for the explanation but then the only real use of `dtrace_extended_probes()` is gone and the `ExtendedDTraceProbes` flag has no effect, right? ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From chagedorn at openjdk.java.net Thu Dec 2 14:14:49 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 2 Dec 2021 14:14:49 GMT Subject: RFR: 8277529: SIGSEGV in C2 CompilerThread Node::rematerialize() compiling Packet::readUnsignedTrint Message-ID: This bug was found in internal testing for 17.0.2 after the backport of [JDK-8272574](https://bugs.openjdk.java.net/browse/JDK-8272574). The fix rewires a control input of a split load node through a phi to the same control input as the corresponding memory input into the phi node: https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/memnode.cpp#L1629-L1637 This can lead to an illegal control input for the new load if `mem->in(i)` is a projection. This situation occurs in the test case: `mem->in(i)` is a projection of an `ArrayCopy` node. Having an `ArrayCopy` node as control input is unexpected and we later fail with an assertion (in product we crash with a segfault). My initial thought was to just not apply the improved rewiring for memory projections and fall back to the else case on L1636. However, this also does not work as we could then wrongly move a range check out of a loop and creating a cyclic dependency (case described in the [review of JDK-8272574](https://github.com/openjdk/jdk/pull/5142)) and reproduced with `test2()` where we hit: https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/loopPredicate.cpp#L777-L778 During this analysis I came across the [fix for JDK-8146792](http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/2748d975045f) which should actually prevent loop predication if we have a data dependency on the projection into a loop. But this does not seem to work correctly as shown in the test case found for JDK-8272574. In the review of JDK-8272574, it was missed that the actual problem could have been traced back to JDK-8146792 and instead we went for an improvement for loads split through phis to not end up with cases supposed to be fixed by JDK-8146972 (which now also causes other problems). But the root problem is still there to be fixed. I therefore propose to completely undo the fix for JDK-8272574 for now and go for a fix on top of JDK-8146972 to fix JDK-8272574 and also prevent this bug. The assertion code added by JDK-8272574 is still good to have. I suggest to revisit the improved rewiring for loads done with JDK-8272574 in an RFE. I think it should still be beneficial but requires some more careful checking (to avoid the problems reported in this bug). Explanation of JDK-8272574 with regard to JDK-8146972 (covered by `test2-5()`): When deciding if we can apply loop predication to a check inside the loop, we call `IdealLoopTree::is_range_check()`. This method calls `PhaseIdealLoop::is_scaled_iv_plus_offset()` which can create new nodes for the offset: https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopTransform.cpp#L2586-L2588 `get_ctrl()` of these new nodes could also be the incoming projection into a loop. But these nodes are not marked as non-loop-invariant as done for the other nodes in `Invariant::Invariant()`: https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopPredicate.cpp#L663-L667 The proposed fix keeps track when above code was applied (`data_dependency_on` is set) and does an additional checking for `get_ctrl()` accordingly if a new node was created for the offset in `PhaseIdealLoop::is_scaled_iv_plus_offset()`. This should be fine as we are not using the newly created `offset` node anymore after doing that check. In discussions with Roland, we think that we should revisit this fix and the fix of JDK-8146972 and clean them up to directly do the invariant checks in `compute_invariance()/visit()` without special handling in the constructor. But given the deadline of 17.0.2 and the fork soon coming up for 18, we think it's the most safe way to go with the proposed fix - if others agree. Thanks, Christian ------------- Commit messages: - 8277529: SIGSEGV in C2 CompilerThread Node::rematerialize() compiling Packet::readUnsignedTrint Changes: https://git.openjdk.java.net/jdk/pull/6670/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6670&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8277529 Stats: 172 lines in 3 files changed: 160 ins; 7 del; 5 mod Patch: https://git.openjdk.java.net/jdk/pull/6670.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6670/head:pull/6670 PR: https://git.openjdk.java.net/jdk/pull/6670 From ecaspole at openjdk.java.net Thu Dec 2 14:33:23 2021 From: ecaspole at openjdk.java.net (Eric Caspole) Date: Thu, 2 Dec 2021 14:33:23 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v4] In-Reply-To: <41Q1gB0N_YPjDB9nr4l74d2o0QKaYKsjC60s9xg4nk8=.45d4275e-fd64-44aa-8adf-13e480612ea1@github.com> References: <41Q1gB0N_YPjDB9nr4l74d2o0QKaYKsjC60s9xg4nk8=.45d4275e-fd64-44aa-8adf-13e480612ea1@github.com> Message-ID: On Thu, 2 Dec 2021 00:20:52 GMT, Scott Gibbons wrote: >> Accelerates CRC32-C by utilizing vpclmulqdq similarly to CRC32. This change achieves ~4x throughput improvement. >> >> 5986.947899319073 MB/s => 24041.05203089616 MB/s >> 5840.02689336947 MB/s => 24898.781468710356 MB/s >> >> ********** Original *********** >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 1.710387358 seconds >> CRC32C.update(byte[]) throughput = 5986.947899319073 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 1.753416583 seconds >> CRC32C.update(ByteBuffer) throughput = 5840.02689336947 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> >> >> >> >> *********** With my changes: ************* >> >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 0.425938099 seconds >> CRC32C.update(byte[]) throughput = 24041.05203089616 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 0.411265106 seconds >> CRC32C.update(ByteBuffer) throughput = 24898.781468710356 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- > > Scott Gibbons has updated the pull request incrementally with two additional commits since the last revision: > > - MICRO to MILLI as requested. > - Fixing benchmark to throughput with default iterations. The JMH part looks good. Thanks, Eric ------------- Marked as reviewed by ecaspole (Committer). PR: https://git.openjdk.java.net/jdk/pull/6595 From rkennke at openjdk.java.net Thu Dec 2 14:41:53 2021 From: rkennke at openjdk.java.net (Roman Kennke) Date: Thu, 2 Dec 2021 14:41:53 GMT Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v11] In-Reply-To: References: Message-ID: <3ij7xShJzxFD9U79iHoJPnHJzg6rPv7T1r67gIqEuD4=.3b964bd4-1374-4d14-a4b3-e209c5954093@github.com> > The flag UseHeavyMonitors seems to imply that it makes Hotspot always use inflated monitors, rather than stack locks. However, it is only implemented in the interpreter that way. When it calls into runtime, it would still happily stack-lock. Even worse, C1 uses another flag UseFastLocking to achieve something similar (with the same caveat that runtime would stack-lock anyway). C2 doesn't have any such mechanism at all. > I would like to experiment with disabling stack-locking, and thus, having this flag work as expected would seem very useful. > > The change removes the C1 flag UseFastLocking, and replaces its uses with equivalent (i.e. inverted) UseHeavyMonitors instead. I think it makes sense to make UseHeavyMonitors develop (I wouldn't want anybody to use this in production, not currently without this change, and not with this change). I also added a flag VerifyHeavyMonitors to be able to verify that stack-locking is really disabled. We can't currently verify this uncondiftionally (e.g. in debug builds) because all non-x86_64 platforms would need work. > > Testing: > - [x] tier1 > - [x] tier2 > - [x] tier3 > - [ ] tier4 Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: PPC port by @TheRealMDoerr ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6320/files - new: https://git.openjdk.java.net/jdk/pull/6320/files/79090ed1..706d1e85 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6320&range=10 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6320&range=09-10 Stats: 55 lines in 4 files changed: 11 ins; 0 del; 44 mod Patch: https://git.openjdk.java.net/jdk/pull/6320.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6320/head:pull/6320 PR: https://git.openjdk.java.net/jdk/pull/6320 From rkennke at openjdk.java.net Thu Dec 2 14:41:54 2021 From: rkennke at openjdk.java.net (Roman Kennke) Date: Thu, 2 Dec 2021 14:41:54 GMT Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v10] In-Reply-To: References: <___2nS2ZNEiggHv_C70jADW_cE-T2wUJKqRK_J5gIj0=.23da20f8-ba19-4279-b154-f0a67626ccf4@github.com> <5oeP5b_jS2eu5lsDpNkfIBINEPPT23A-kqeHwOSXrYQ=.e32eb661-903d-4778-a56f-2a1d525d1874@github.com> Message-ID: On Thu, 2 Dec 2021 00:55:02 GMT, Martin Doerr wrote: >>> > I believe you still have to change something in sharedRuntime_ppc.cpp, similar to what I did in, e.g., sharedRuntime_aarch64.cpp. >>> >>> You mean in `generate_native_wrapper`? I already did. It uses the same assembler function as C2 on PPC64. Did I miss anything else? I think hacking `unlock` is optional. The additional checks don't really disturb. >> >> Ah I haven't seen it, sorry. >> It turns out, I cannot avoid emitting FastLockNode, some backends (x86 and aarch64) also generate fast-path code that deals with ObjectMonitor, and we want this even when running with +UseHeavyMonitors. >> >> Can you verify the new testcase, and perhaps some test programs that do some locking with -XX:+UseHeavyMonitors -XX:+VerifyHeavyMonitors ? You also need to include PPC in arguments.cpp and synchronizer.cpp changes to enable that stuff on PPC: > >> It turns out, I cannot avoid emitting FastLockNode, some backends (x86 and aarch64) also generate fast-path code that deals with ObjectMonitor, and we want this even when running with +UseHeavyMonitors. > > Ok, thanks for checking. You have convinced me that your version is fine. We should do it the same way on PPC64: > > diff --git a/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp b/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp > index 98565003691..cb58e775422 100644 > --- a/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp > +++ b/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp > @@ -2660,27 +2660,32 @@ void MacroAssembler::compiler_fast_lock_object(ConditionRegister flag, Register > andi_(temp, displaced_header, markWord::monitor_value); > bne(CCR0, object_has_monitor); > > - // Set displaced_header to be (markWord of object | UNLOCK_VALUE). > - ori(displaced_header, displaced_header, markWord::unlocked_value); > - > - // Load Compare Value application register. > - > - // Initialize the box. (Must happen before we update the object mark!) > - std(displaced_header, BasicLock::displaced_header_offset_in_bytes(), box); > - > - // Must fence, otherwise, preceding store(s) may float below cmpxchg. > - // Compare object markWord with mark and if equal exchange scratch1 with object markWord. > - cmpxchgd(/*flag=*/flag, > - /*current_value=*/current_header, > - /*compare_value=*/displaced_header, > - /*exchange_value=*/box, > - /*where=*/oop, > - MacroAssembler::MemBarRel | MacroAssembler::MemBarAcq, > - MacroAssembler::cmpxchgx_hint_acquire_lock(), > - noreg, > - &cas_failed, > - /*check without membar and ldarx first*/true); > - assert(oopDesc::mark_offset_in_bytes() == 0, "offset of _mark is not 0"); > + if (!UseHeavyMonitors) { > + // Set displaced_header to be (markWord of object | UNLOCK_VALUE). > + ori(displaced_header, displaced_header, markWord::unlocked_value); > + > + // Load Compare Value application register. > + > + // Initialize the box. (Must happen before we update the object mark!) > + std(displaced_header, BasicLock::displaced_header_offset_in_bytes(), box); > + > + // Must fence, otherwise, preceding store(s) may float below cmpxchg. > + // Compare object markWord with mark and if equal exchange scratch1 with object markWord. > + cmpxchgd(/*flag=*/flag, > + /*current_value=*/current_header, > + /*compare_value=*/displaced_header, > + /*exchange_value=*/box, > + /*where=*/oop, > + MacroAssembler::MemBarRel | MacroAssembler::MemBarAcq, > + MacroAssembler::cmpxchgx_hint_acquire_lock(), > + noreg, > + &cas_failed, > + /*check without membar and ldarx first*/true); > + assert(oopDesc::mark_offset_in_bytes() == 0, "offset of _mark is not 0"); > + } else { > + // Set NE to indicate 'failure' -> take slow-path. > + crandc(flag, Assembler::equal, flag, Assembler::equal); > + } > > // If the compare-and-exchange succeeded, then we found an unlocked > // object and we have now locked it. > @@ -2768,12 +2773,14 @@ void MacroAssembler::compiler_fast_unlock_object(ConditionRegister flag, Registe > } > #endif > > - // Find the lock address and load the displaced header from the stack. > - ld(displaced_header, BasicLock::displaced_header_offset_in_bytes(), box); > + if (!UseHeavyMonitors) { > + // Find the lock address and load the displaced header from the stack. > + ld(displaced_header, BasicLock::displaced_header_offset_in_bytes(), box); > > - // If the displaced header is 0, we have a recursive unlock. > - cmpdi(flag, displaced_header, 0); > - beq(flag, cont); > + // If the displaced header is 0, we have a recursive unlock. > + cmpdi(flag, displaced_header, 0); > + beq(flag, cont); > + } > > // Handle existing monitor. > // The object has an existing monitor iff (mark & monitor_value) != 0. > @@ -2782,20 +2789,24 @@ void MacroAssembler::compiler_fast_unlock_object(ConditionRegister flag, Registe > andi_(R0, current_header, markWord::monitor_value); > bne(CCR0, object_has_monitor); > > - // Check if it is still a light weight lock, this is is true if we see > - // the stack address of the basicLock in the markWord of the object. > - // Cmpxchg sets flag to cmpd(current_header, box). > - cmpxchgd(/*flag=*/flag, > - /*current_value=*/current_header, > - /*compare_value=*/box, > - /*exchange_value=*/displaced_header, > - /*where=*/oop, > - MacroAssembler::MemBarRel, > - MacroAssembler::cmpxchgx_hint_release_lock(), > - noreg, > - &cont); > - > - assert(oopDesc::mark_offset_in_bytes() == 0, "offset of _mark is not 0"); > + if (!UseHeavyMonitors) { > + // Check if it is still a light weight lock, this is is true if we see > + // the stack address of the basicLock in the markWord of the object. > + // Cmpxchg sets flag to cmpd(current_header, box). > + cmpxchgd(/*flag=*/flag, > + /*current_value=*/current_header, > + /*compare_value=*/box, > + /*exchange_value=*/displaced_header, > + /*where=*/oop, > + MacroAssembler::MemBarRel, > + MacroAssembler::cmpxchgx_hint_release_lock(), > + noreg, > + &cont); > + assert(oopDesc::mark_offset_in_bytes() == 0, "offset of _mark is not 0"); > + } else { > + // Set NE to indicate 'failure' -> take slow-path. > + crandc(flag, Assembler::equal, flag, Assembler::equal); > + } > > // Handle existing monitor. > b(cont); > diff --git a/src/hotspot/share/runtime/arguments.cpp b/src/hotspot/share/runtime/arguments.cpp > index 3396adc0799..969c8e82b91 100644 > --- a/src/hotspot/share/runtime/arguments.cpp > +++ b/src/hotspot/share/runtime/arguments.cpp > @@ -2021,12 +2021,12 @@ bool Arguments::check_vm_args_consistency() { > } > #endif > > -#if !defined(X86) && !defined(AARCH64) > +#if !defined(X86) && !defined(AARCH64) && !defined(PPC64) > if (UseHeavyMonitors) { > warning("UseHeavyMonitors is not fully implemented on this architecture"); > } > #endif > -#ifdef X86 > +#if defined(X86) || defined(PPC64) > if (UseHeavyMonitors && UseRTMForStackLocks) { > fatal("-XX:+UseHeavyMonitors and -XX:+UseRTMForStackLocks are mutually exclusive"); > } > diff --git a/src/hotspot/share/runtime/synchronizer.cpp b/src/hotspot/share/runtime/synchronizer.cpp > index 4c5ea4a6e40..4f9c7c21a9b 100644 > --- a/src/hotspot/share/runtime/synchronizer.cpp > +++ b/src/hotspot/share/runtime/synchronizer.cpp > @@ -418,7 +418,7 @@ void ObjectSynchronizer::handle_sync_on_value_based_class(Handle obj, JavaThread > } > > static bool useHeavyMonitors() { > -#if defined(X86) || defined(AARCH64) > +#if defined(X86) || defined(AARCH64) || defined(PPC64) > return UseHeavyMonitors; > #else > return false; > diff --git a/test/jdk/java/util/concurrent/ConcurrentHashMap/MapLoops.java b/test/jdk/java/util/concurrent/ConcurrentHashMap/MapLoops.java > index cd32e222f68..922b18836dd 100644 > --- a/test/jdk/java/util/concurrent/ConcurrentHashMap/MapLoops.java > +++ b/test/jdk/java/util/concurrent/ConcurrentHashMap/MapLoops.java > @@ -48,7 +48,7 @@ > /* > * @test > * @summary Exercise multithreaded maps, using only heavy monitors. > - * @requires os.arch=="x86" | os.arch=="i386" | os.arch=="amd64" | os.arch=="x86_64" | os.arch=="aarch64" > + * @requires os.arch=="x86" | os.arch=="i386" | os.arch=="amd64" | os.arch=="x86_64" | os.arch=="aarch64" | os.arch == "ppc64" | os.arch == "ppc64le" > * @library /test/lib > * @run main/othervm/timeout=1600 -XX:+IgnoreUnrecognizedVMOptions -XX:+UseHeavyMonitors -XX:+VerifyHeavyMonitors MapLoops > */ > > Note that this version does no longer require changes in sharedRuntime_ppc because the native wrapper generator uses the same code as C2. The test case has passed. Thanks, @TheRealMDoerr ! I've added your PPC port to this PR. ------------- PR: https://git.openjdk.java.net/jdk/pull/6320 From ddong at openjdk.java.net Thu Dec 2 14:48:25 2021 From: ddong at openjdk.java.net (Denghui Dong) Date: Thu, 2 Dec 2021 14:48:25 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: On Thu, 2 Dec 2021 13:40:05 GMT, Tobias Hartmann wrote: > Thanks for the explanation but then the only real use of `dtrace_extended_probes()` is gone and the `ExtendedDTraceProbes` flag has no effect, right? dtrace_extended_probes() is still used in ciEnv::register_method, it's seems used to confirm ExtendedDTraceProbes/DTraceMethodProbes//DTraceAllocProbes(declared as product flags, should not be changed during running I think) is not changed during the compilation. ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From thartmann at openjdk.java.net Thu Dec 2 15:06:29 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 2 Dec 2021 15:06:29 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 15:41:07 GMT, Denghui Dong wrote: > Hi, > > Could I have a review of this small fix that makes expand_dtrace_alloc_probe take effect? > > Thanks, > Denghui Yes but that check is useless if the compilation does not even use the flag, right? ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From chagedorn at openjdk.java.net Thu Dec 2 15:13:35 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 2 Dec 2021 15:13:35 GMT Subject: RFR: 8277906: Incorrect type for IV phi of long counted loops after CCP In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 10:35:32 GMT, Roland Westrelin wrote: > This failure occurs because the iv phi of a long counted loop has the > wrong type after CCP. That happens because, during CCP, after the type > of the limit of the counted loop is updated, the type of the iv phi is > not recomputed. The fix is to apply to long counted loops the logic > that already exists for int counted loop. Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6632 From roland at openjdk.java.net Thu Dec 2 15:13:35 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Thu, 2 Dec 2021 15:13:35 GMT Subject: RFR: 8277906: Incorrect type for IV phi of long counted loops after CCP In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 13:30:24 GMT, Tobias Hartmann wrote: >> This failure occurs because the iv phi of a long counted loop has the >> wrong type after CCP. That happens because, during CCP, after the type >> of the limit of the counted loop is updated, the type of the iv phi is >> not recomputed. The fix is to apply to long counted loops the logic >> that already exists for int counted loop. > > Okay, makes sense. @TobiHartmann @chhagedorn thanks for the reviews. ------------- PR: https://git.openjdk.java.net/jdk/pull/6632 From roland at openjdk.java.net Thu Dec 2 15:13:37 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Thu, 2 Dec 2021 15:13:37 GMT Subject: Integrated: 8277906: Incorrect type for IV phi of long counted loops after CCP In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 10:35:32 GMT, Roland Westrelin wrote: > This failure occurs because the iv phi of a long counted loop has the > wrong type after CCP. That happens because, during CCP, after the type > of the limit of the counted loop is updated, the type of the iv phi is > not recomputed. The fix is to apply to long counted loops the logic > that already exists for int counted loop. This pull request has now been integrated. Changeset: 3889af3f Author: Roland Westrelin URL: https://git.openjdk.java.net/jdk/commit/3889af3f7debc4f8d75f620bb54134d1d11a6c83 Stats: 66 lines in 2 files changed: 61 ins; 0 del; 5 mod 8277906: Incorrect type for IV phi of long counted loops after CCP Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.java.net/jdk/pull/6632 From ddong at openjdk.java.net Thu Dec 2 15:19:26 2021 From: ddong at openjdk.java.net (Denghui Dong) Date: Thu, 2 Dec 2021 15:19:26 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: On Thu, 2 Dec 2021 15:03:43 GMT, Tobias Hartmann wrote: > Yes but that check is useless if the compilation does not even use the flag, right? Hmmm... I may not get your points. According to JDK-6788527 and the related patch: > Jvmti and DTrace threads may change global flags used by Compiler in the middle of compilation. These leads to inconsistent answers from MethodLiveness which leads to the failure during parsing. It also invalidates dependencies constructed during compilation. I think that explains why ciEnv need to cache DTrace flags before compilation, and _dtrace_method_probes/_dtrace_alloc_probes is set enabled if ExtendedDTraceProbes is true, so I think ExtendedDTraceProbes may still be useful, right? ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From jiefu at openjdk.java.net Thu Dec 2 15:19:31 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 2 Dec 2021 15:19:31 GMT Subject: RFR: 8277753: Long*VectorTests.java fail with "bad AD file" on x86_32 after JDK-8276162 In-Reply-To: References: Message-ID: On Thu, 2 Dec 2021 12:50:08 GMT, Tobias Hartmann wrote: > Looks good. Thanks @TobiHartmann . ------------- PR: https://git.openjdk.java.net/jdk/pull/6533 From jiefu at openjdk.java.net Thu Dec 2 15:19:32 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 2 Dec 2021 15:19:32 GMT Subject: Integrated: 8277753: Long*VectorTests.java fail with "bad AD file" on x86_32 after JDK-8276162 In-Reply-To: References: Message-ID: On Wed, 24 Nov 2021 08:52:28 GMT, Jie Fu wrote: > Hi all, > > The following vector api tests fail with "bad AD file" on x86_32. > > jdk/incubator/vector/Long128VectorTests.java > jdk/incubator/vector/Long256VectorTests.java > jdk/incubator/vector/Long512VectorTests.java > jdk/incubator/vector/LongMaxVectorTests.java > jdk/incubator/vector/Long64VectorTests.java > > > The failure reason is that several unsigned long comparison instructs are missing for x86_32. > The fix just added the missing instructs rules. > > Testing: > - tier1 ~ tier3 on linux/x86_32, no regression. > > Thanks. > Best regards, > Jie This pull request has now been integrated. Changeset: 65960f71 Author: Jie Fu URL: https://git.openjdk.java.net/jdk/commit/65960f712ed6d4c4478d74f0842ce78d500d4229 Stats: 30 lines in 1 file changed: 30 ins; 0 del; 0 mod 8277753: Long*VectorTests.java fail with "bad AD file" on x86_32 after JDK-8276162 Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/6533 From roland at openjdk.java.net Thu Dec 2 15:41:24 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Thu, 2 Dec 2021 15:41:24 GMT Subject: RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert [v2] In-Reply-To: <4ggmeDmX6-82KhLQC0_ijUHQqeCHrD-fHrTwGGhoneM=.36d2be35-5156-4bbd-b55c-07a15487062e@github.com> References: <0GbtDPLsXoyXYNJv2ZJ6vwTSdbzJOXqWUNENCTCrZmA=.d0b24bd8-be69-435d-8db1-d291afcd7f62@github.com> <4ggmeDmX6-82KhLQC0_ijUHQqeCHrD-fHrTwGGhoneM=.36d2be35-5156-4bbd-b55c-07a15487062e@github.com> Message-ID: On Thu, 2 Dec 2021 05:17:29 GMT, Dean Long wrote: > 1. a runtime call to the rethrow stub. The comment says "must not deoptimize", but I'm not sure what prevents that That one is a leaf call and doesn't capture debug info so I think it's safe to assume it can't deoptimize. > 2. catch_call_exceptions(), which does an uncommon trap for unloaded exception classes There's a push_ex_oop() right because the uncommon trap and that one clears the stack so it can't be the reason to preserve the stack. > > The reason I didn't go with Vladimir's work around for 8273165 is that I think it could have a performance impact that would be more likely to be noticed than in the case of 8273165 (because late inlining of method handle has been around for many releases and is likely something that's relied on). We could extend Vladimir's work around by checking that the receiver may be null and that the null check would cause the exception to be thrown rather than deoptimization. > > But with possible performance impact, right? Right. > > Another way to deal with this could be to pop the stack if the current method has no exception handlers because then the exception is passed on to the caller and the entire frame is popped anyway. That would work nicely for this case as AFAIU, the method handle invoker can only be inlined from a lambda form that wouldn't have exception handlers. > > That sounds promising. So you agree that if there's no handler then there's no need to preserve the stack? ------------- PR: https://git.openjdk.java.net/jdk/pull/6572 From sviswanathan at openjdk.java.net Thu Dec 2 17:46:17 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 2 Dec 2021 17:46:17 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs In-Reply-To: References: <7At_ag4fpFbiKQ81BsVoM9MzmXJJCIPufYSoi1WgIpg=.e2085ff5-34d8-4c7a-a561-9b36100f14a4@github.com> <869rFsjOuFR2Yt9bfmz45l4s6uRaFQc5IkPW3eJyZ8E=.e1a6695f-258d-4ec6-a28a-9fcba5e53ce9@github.com> Message-ID: On Wed, 1 Dec 2021 12:41:14 GMT, David Holmes wrote: >>> As I understand it - old AVX512 platforms will continue to work as before. >> >> According to @sviswa7 's comments (no cupid bit for the latest ISA), `is_intel_family_core() && supports_serialize()` can't distinguish all the old AVX512 platforms from the latest ones. >> So I think it may be possible some old AVX512 machines will behave differently after this opt. >> >> @sviswa7 , can you further explain what's the difference of the 64-byte instructions between Intel's old and latest AVX512 platforms? >> Why can't we enable them as default on old platforms? >> Thanks. > >> > As I understand it - old AVX512 platforms will continue to work as before. >> >> According to @sviswa7 's comments (no cupid bit for the latest ISA), `is_intel_family_core() && supports_serialize()` can't distinguish all the old AVX512 platforms from the latest ones. So I think it may be possible some old AVX512 machines will behave differently after this opt. > > I do not see such comments. From my previous questions on this it was indicated that any CPU that supports `serialize` has the improved performance. @dholmes-ora @DamonFool @jatin-bhateja @neliasso Thanks a lot for the review. If no further objections, I plan to integrate this PR tomorrow (Friday 12/3). ------------- PR: https://git.openjdk.java.net/jdk/pull/6512 From neliasso at openjdk.java.net Thu Dec 2 18:13:30 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Thu, 2 Dec 2021 18:13:30 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs [v6] In-Reply-To: References: Message-ID: On Tue, 30 Nov 2021 00:10:39 GMT, Sandhya Viswanathan wrote: >> Currently 32-byte instructions are used for small array copy and clear. >> This can be optimized by using 64-byte instructions. >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Fix whitespace Looks good. ------------- Marked as reviewed by neliasso (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6512 From kvn at openjdk.java.net Thu Dec 2 18:24:15 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Dec 2021 18:24:15 GMT Subject: RFR: 8277893: Arraycopy stress tests [v2] In-Reply-To: <3UVCSS5k6cAvWIzxF_1egpBrr69f5Bu8AlhWCMmmINw=.9dd5b6d5-375b-481d-a4e9-cb2ef7a1629f@github.com> References: <3UVCSS5k6cAvWIzxF_1egpBrr69f5Bu8AlhWCMmmINw=.9dd5b6d5-375b-481d-a4e9-cb2ef7a1629f@github.com> Message-ID: <0XuwXvto0gl1o6RkEJ3Ui_4Jah0LzBfuFX13TUz6-Ow=.0347d85f-392b-4b7c-a7c3-65ff5573339b@github.com> On Wed, 1 Dec 2021 09:13:36 GMT, Aleksey Shipilev wrote: >> I would like to fork the new tests off the JDK-8150730. These tests were instrumental in capturing many bugs in my arraycopy work, and I think they are good on their own merit, because they provide a test for the current baseline and on-going minor improvements in arraycopy on all platforms, not only x86_64, and they might be cleanly backportable. >> >> A brief tour of these tests: >> >> - Tests all data types; >> - Tests small arrays exhaustively, which captures conjoint/disjoint cases, errors near the edges, etc; >> - Tests large arrays with fuzzing around powers of two and powers of ten, both conjoint and disjoint cases; >> - Tests all available compilation modes for arraycopy stubs; for example, running on AVX-512 enabled machine runs all versions down to `-XX:UseAVX=0 -XX:UseSSE=0` cases; >> - Tests with/without compressed oops mode -- theoretically only needed for `Object` copies, but Hotspot cobbles together int+coops and long+no-coops loops, so I decided to alternate coops mode for all data types; >> >> My previous version used individual `@run` clauses for all configurations, but I think the Java driver is cleaner and easier to maintain. >> >> Test times: >> >> >> # x86_64 (TR 3970X) >> real 6m37.855s >> user 56m23.004s >> sys 0m20.148s >> >> # x86_32 (TR 3970X) >> real 11m22.877s >> user 168m8.137s >> sys 5m7.037s >> >> # x86_64 (i5-11500) >> real 15m55.424s >> user 118m0.969s >> sys 0m12.039s >> >> # AArch64 (ThunderX2) >> real 4m5.177s >> user 32m7.295s >> sys 0m19.689s >> >> >> Since these tests are quite long, especially on small machines, I hooked them up to `hotspot:tier3`. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `compiler/stress/arraycopy` >> - [x] Linux x86_32 fastdebug `compiler/stress/arraycopy` >> - [x] Linux AArch64 fastdebug `compiler/stress/arraycopy` > > Aleksey Shipilev has updated the pull request incrementally with two additional commits since the last revision: > > - Separate test group and hooks into hotspot_slow_compiler > - Trim down MAX_SIZE and explain the choice Most testing passed fine. I am still waiting results on linux-aarch64. But I got 1 timeout failure running next test on Windows-x64-debug with -XX:+UseZGC: `compiler/arraycopy/stress/TestStressObjectArrayCopy.java` reason: User specified action: run main/othervm/timeout=960 -Xbootclasspath/a:. -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI StressArrayCopyDriver TestStressObjectArrayCopy Timeout information: elapsed time (seconds): 3920.744 One flags combination run can take up to 8 min and you have 20 of them: [2021-12-02T16:33:27.024914600Z] Waiting for completion for process 9764 [2021-12-02T16:41:27.136342300Z] Waiting for completion finished for process 9764 Windows VM image has 12 cores, 46Gb memory and AMD latest CPU. I don't think it HW is issue. But it could be something with OS at that time. On linux-x64-debug it took: `main: 1411.585 seconds` running with ZGC. Still more then your specified timeout `timeout=960` Please, check it. You can increase timeout or split testing. ------------- PR: https://git.openjdk.java.net/jdk/pull/6594 From shade at openjdk.java.net Thu Dec 2 18:33:23 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 2 Dec 2021 18:33:23 GMT Subject: RFR: 8277893: Arraycopy stress tests [v2] In-Reply-To: <0XuwXvto0gl1o6RkEJ3Ui_4Jah0LzBfuFX13TUz6-Ow=.0347d85f-392b-4b7c-a7c3-65ff5573339b@github.com> References: <3UVCSS5k6cAvWIzxF_1egpBrr69f5Bu8AlhWCMmmINw=.9dd5b6d5-375b-481d-a4e9-cb2ef7a1629f@github.com> <0XuwXvto0gl1o6RkEJ3Ui_4Jah0LzBfuFX13TUz6-Ow=.0347d85f-392b-4b7c-a7c3-65ff5573339b@github.com> Message-ID: On Thu, 2 Dec 2021 18:20:43 GMT, Vladimir Kozlov wrote: > Please, check it. You can increase timeout or split testing. Yes, thank you, that's exactly why I wanted to do these tests ahead of the actual arraycopy changes. I'll take a look at what can be done. FWIW, my Windows VM running on TR 3970X passed these tests in reasonable time, maybe it is Windows+ZGC-specific problem here. ------------- PR: https://git.openjdk.java.net/jdk/pull/6594 From kvn at openjdk.java.net Thu Dec 2 18:49:16 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Dec 2021 18:49:16 GMT Subject: RFR: 8277893: Arraycopy stress tests [v2] In-Reply-To: <3UVCSS5k6cAvWIzxF_1egpBrr69f5Bu8AlhWCMmmINw=.9dd5b6d5-375b-481d-a4e9-cb2ef7a1629f@github.com> References: <3UVCSS5k6cAvWIzxF_1egpBrr69f5Bu8AlhWCMmmINw=.9dd5b6d5-375b-481d-a4e9-cb2ef7a1629f@github.com> Message-ID: <0-D8zfmV85KlFeIiWrC5LIb9Wu7Pj_VABLI-UuPS0t4=.d22226ad-54a7-4d48-bba1-7eabd6ed2c1c@github.com> On Wed, 1 Dec 2021 09:13:36 GMT, Aleksey Shipilev wrote: >> I would like to fork the new tests off the JDK-8150730. These tests were instrumental in capturing many bugs in my arraycopy work, and I think they are good on their own merit, because they provide a test for the current baseline and on-going minor improvements in arraycopy on all platforms, not only x86_64, and they might be cleanly backportable. >> >> A brief tour of these tests: >> >> - Tests all data types; >> - Tests small arrays exhaustively, which captures conjoint/disjoint cases, errors near the edges, etc; >> - Tests large arrays with fuzzing around powers of two and powers of ten, both conjoint and disjoint cases; >> - Tests all available compilation modes for arraycopy stubs; for example, running on AVX-512 enabled machine runs all versions down to `-XX:UseAVX=0 -XX:UseSSE=0` cases; >> - Tests with/without compressed oops mode -- theoretically only needed for `Object` copies, but Hotspot cobbles together int+coops and long+no-coops loops, so I decided to alternate coops mode for all data types; >> >> My previous version used individual `@run` clauses for all configurations, but I think the Java driver is cleaner and easier to maintain. >> >> Test times: >> >> >> # x86_64 (TR 3970X) >> real 6m37.855s >> user 56m23.004s >> sys 0m20.148s >> >> # x86_32 (TR 3970X) >> real 11m22.877s >> user 168m8.137s >> sys 5m7.037s >> >> # x86_64 (i5-11500) >> real 15m55.424s >> user 118m0.969s >> sys 0m12.039s >> >> # AArch64 (ThunderX2) >> real 4m5.177s >> user 32m7.295s >> sys 0m19.689s >> >> >> Since these tests are quite long, especially on small machines, I hooked them up to `hotspot:tier3`. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `compiler/stress/arraycopy` >> - [x] Linux x86_32 fastdebug `compiler/stress/arraycopy` >> - [x] Linux AArch64 fastdebug `compiler/stress/arraycopy` > > Aleksey Shipilev has updated the pull request incrementally with two additional commits since the last revision: > > - Separate test group and hooks into hotspot_slow_compiler > - Trim down MAX_SIZE and explain the choice Yes, ZGC is definitely affecting it. With ParallelGC on linux-x64 the time was: `main: 551.014 seconds` ------------- PR: https://git.openjdk.java.net/jdk/pull/6594 From aoqi at openjdk.java.net Thu Dec 2 19:02:20 2021 From: aoqi at openjdk.java.net (Ao Qi) Date: Thu, 2 Dec 2021 19:02:20 GMT Subject: Integrated: 8278037: Clean up PPC32 related code in C1 In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 03:13:38 GMT, Ao Qi wrote: > `_tmp1` and `_tmp2` were [removed](https://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/faaed259df37#l13.173) in [JDK-8160245](https://bugs.openjdk.java.net/browse/JDK-8160245), but they are still used. > > This should cause a build error. I don't have a ppc32 machine for the test. It's also not found at https://builds.shipilev.net. Is ppc32 a supported compiler1 platform? > > Update: PPC32 C1 is not supported. Clean up PPC32 related code in C1. This pull request has now been integrated. Changeset: 8f196a24 Author: Ao Qi Committer: Martin Doerr URL: https://git.openjdk.java.net/jdk/commit/8f196a2487982a0ae827cdef17243b8c64ba3217 Stats: 67 lines in 3 files changed: 0 ins; 63 del; 4 mod 8278037: Clean up PPC32 related code in C1 Reviewed-by: jiefu, stuefe, shade, mdoerr ------------- PR: https://git.openjdk.java.net/jdk/pull/6625 From aoqi at openjdk.java.net Thu Dec 2 19:02:19 2021 From: aoqi at openjdk.java.net (Ao Qi) Date: Thu, 2 Dec 2021 19:02:19 GMT Subject: RFR: 8278037: Clean up PPC32 related code in C1 [v3] In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 18:01:41 GMT, Martin Doerr wrote: >> Ao Qi has updated the pull request incrementally with one additional commit since the last revision: >> >> missing ")" > > I appreciate to see this go away. Thanks! @TheRealMDoerr, thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/6625 From kvn at openjdk.java.net Thu Dec 2 19:40:20 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Dec 2021 19:40:20 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 15:41:07 GMT, Denghui Dong wrote: > Hi, > > Could I have a review of this small fix that makes expand_dtrace_alloc_probe take effect? > > Thanks, > Denghui Looks good to me. How it is tested? ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6639 From kvn at openjdk.java.net Thu Dec 2 19:40:20 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Dec 2021 19:40:20 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: <6byawc9F8UgcR72iIo72qBXG-2Q8DQwWO17_QPyyidc=.73e491e8-8d73-4856-a434-8995ca50a69d@github.com> References: <6byawc9F8UgcR72iIo72qBXG-2Q8DQwWO17_QPyyidc=.73e491e8-8d73-4856-a434-8995ca50a69d@github.com> Message-ID: On Thu, 2 Dec 2021 13:36:26 GMT, Tobias Hartmann wrote: >> src/hotspot/share/opto/macro.cpp line 1245: >> >>> 1243: } >>> 1244: >>> 1245: if (C->env()->dtrace_alloc_probes() || >> >> This check was introduced in JDK-6788527. >> I'm not very clear about the root cause. > > @vnkozlov do you remember? Yes, it replaced `DTraceAllocProbes` flag check with cached value: https://hg.openjdk.java.net/jdk7u/jdk7u-dev/hotspot/rev/c96bf21b756f We cached JVMTI and DTrace flags values before compilation and check after compilation to throw it out if flags were changed: https://hg.openjdk.java.net/jdk7u/jdk7u-dev/hotspot/rev/c96bf21b756f#l8.38 ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From kvn at openjdk.java.net Thu Dec 2 19:40:21 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Dec 2021 19:40:21 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: On Thu, 2 Dec 2021 13:22:45 GMT, Denghui Dong wrote: >> Hi, >> >> Could I have a review of this small fix that makes expand_dtrace_alloc_probe take effect? >> >> Thanks, >> Denghui > > src/hotspot/share/opto/macro.cpp line 1633: > >> 1631: void PhaseMacroExpand::expand_dtrace_alloc_probe(AllocateNode* alloc, Node* oop, >> 1632: Node*& ctrl, Node*& rawmem) { >> 1633: if (C->env()->dtrace_alloc_probes()) { > > IIUC, this probe is related to object allocation, so it should be expanded when ciEnv::dtrace_alloc_probes() returns true, just like the implementation in C1(see C1_MacroAssembler::initialize_object(). Originally it checked `ExtendedDTraceProbes` flag here. But I think it was missing change when `DTraceAllocProbes` flag was introduced by https://bugs.openjdk.java.net/browse/JDK-6346964 So I agree with this change. ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From shade at openjdk.java.net Thu Dec 2 19:48:22 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 2 Dec 2021 19:48:22 GMT Subject: RFR: 8277893: Arraycopy stress tests [v2] In-Reply-To: <0-D8zfmV85KlFeIiWrC5LIb9Wu7Pj_VABLI-UuPS0t4=.d22226ad-54a7-4d48-bba1-7eabd6ed2c1c@github.com> References: <3UVCSS5k6cAvWIzxF_1egpBrr69f5Bu8AlhWCMmmINw=.9dd5b6d5-375b-481d-a4e9-cb2ef7a1629f@github.com> <0-D8zfmV85KlFeIiWrC5LIb9Wu7Pj_VABLI-UuPS0t4=.d22226ad-54a7-4d48-bba1-7eabd6ed2c1c@github.com> Message-ID: On Thu, 2 Dec 2021 18:46:02 GMT, Vladimir Kozlov wrote: > Yes, ZGC is definitely affecting it. With ParallelGC on linux-x64 the time was: `main: 551.014 seconds` I suspect arraycopy GC barriers, because Shenandoah is also quite a bit slow. In exhaustive tests for small arrays, runtime calls probably dominate. I'll figure it out. # UseParallelGC real 6m32.100s user 54m19.617s sys 0m36.336s # UseG1GC real 6m31.220s user 55m50.315s sys 0m19.156s # UseSerialGC real 5m53.627s user 52m59.540s sys 0m29.012s # UseShenandoahGC real 11m1.101s user 65m26.868s sys 0m34.472s # UseZGC real 15m15.289s user 73m15.533s sys 0m31.396s ------------- PR: https://git.openjdk.java.net/jdk/pull/6594 From duke at openjdk.java.net Thu Dec 2 19:49:16 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Thu, 2 Dec 2021 19:49:16 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v3] In-Reply-To: References: <_uLnNO847-foX9jDiBc2nSNfiwhgl5pV-0WV5NXViZg=.4c73d6cb-966a-42f2-aade-87edfc69a316@github.com> Message-ID: On Wed, 1 Dec 2021 20:27:19 GMT, Vladimir Kozlov wrote: >> Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision: >> >> Fixing review comments > > Nice work. Let me test it before approval. @vnkozlov Please let me know if the test passed. ------------- PR: https://git.openjdk.java.net/jdk/pull/6595 From kvn at openjdk.java.net Thu Dec 2 19:57:16 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Dec 2021 19:57:16 GMT Subject: RFR: 8277529: SIGSEGV in C2 CompilerThread Node::rematerialize() compiling Packet::readUnsignedTrint In-Reply-To: References: Message-ID: On Thu, 2 Dec 2021 12:46:06 GMT, Christian Hagedorn wrote: > This bug was found in internal testing for 17.0.2 after the backport of [JDK-8272574](https://bugs.openjdk.java.net/browse/JDK-8272574). The fix rewires a control input of a split load node through a phi to the same control input as the corresponding memory input into the phi node: > https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/memnode.cpp#L1629-L1637 > > This can lead to an illegal control input for the new load if `mem->in(i)` is a projection. This situation occurs in the test case: `mem->in(i)` is a projection of an `ArrayCopy` node. Having an `ArrayCopy` node as control input is unexpected and we later fail with an assertion (in product we crash with a segfault). > > My initial thought was to just not apply the improved rewiring for memory projections and fall back to the else case on L1636. However, this also does not work as we could then wrongly move a range check out of a loop and creating a cyclic dependency (case described in the [review of JDK-8272574](https://github.com/openjdk/jdk/pull/5142)) and reproduced with `test2()` where we hit: > https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/loopPredicate.cpp#L777-L778 > > During this analysis I came across the [fix for JDK-8146792](http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/2748d975045f) which should actually prevent loop predication if we have a data dependency on the projection into a loop. But this does not seem to work correctly as shown in the test case found for JDK-8272574. In the review of JDK-8272574, it was missed that the actual problem could have been traced back to JDK-8146792 and instead we went for an improvement for loads split through phis to not end up with cases supposed to be fixed by JDK-8146972 (which now also causes other problems). But the root problem is still there to be fixed. > > I therefore propose to completely undo the fix for JDK-8272574 for now and go for a fix on top of JDK-8146972 to fix JDK-8272574 and also prevent this bug. The assertion code added by JDK-8272574 is still good to have. I suggest to revisit the improved rewiring for loads done with JDK-8272574 in an RFE. I think it should still be beneficial but requires some more careful checking (to avoid the problems reported in this bug). > > > Explanation of JDK-8272574 with regard to JDK-8146972 (covered by `test2-5()`): > > When deciding if we can apply loop predication to a check inside the loop, we call `IdealLoopTree::is_range_check()`. This method calls `PhaseIdealLoop::is_scaled_iv_plus_offset()` which can create new nodes for the offset: > https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopTransform.cpp#L2586-L2588 > `get_ctrl()` of these new nodes could also be the incoming projection into a loop. But these nodes are not marked as non-loop-invariant as done for the other nodes in `Invariant::Invariant()`: > https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopPredicate.cpp#L663-L667 > > The proposed fix keeps track when above code was applied (`data_dependency_on` is set) and does an additional checking for `get_ctrl()` accordingly if a new node was created for the offset in `PhaseIdealLoop::is_scaled_iv_plus_offset()`. This should be fine as we are not using the newly created `offset` node anymore after doing that check. > > In discussions with Roland, we think that we should revisit this fix and the fix of JDK-8146972 and clean them up to directly do the invariant checks in `compute_invariance()/visit()` without special handling in the constructor. But given the deadline of 17.0.2 and the fork soon coming up for 18, we think it's the most safe way to go with the proposed fix - if others agree. > > Thanks, > Christian src/hotspot/share/opto/loopPredicate.cpp line 776: > 774: const uint old_unique_idx = C->unique(); > 775: if (is_range_check_if(iff, phase, T_INT, iv, range, offset, scale)) { > 776: if (!invar.is_invariant(range)) { First, the fix is reasonable to me. My only complain is original code flow. If we do `return false;` for each check we should the same for `is_range_check_if()` check instead of returning `false` at the very end you lost logic why it is `false`. ------------- PR: https://git.openjdk.java.net/jdk/pull/6670 From kvn at openjdk.java.net Thu Dec 2 20:02:17 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Dec 2021 20:02:17 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v4] In-Reply-To: <41Q1gB0N_YPjDB9nr4l74d2o0QKaYKsjC60s9xg4nk8=.45d4275e-fd64-44aa-8adf-13e480612ea1@github.com> References: <41Q1gB0N_YPjDB9nr4l74d2o0QKaYKsjC60s9xg4nk8=.45d4275e-fd64-44aa-8adf-13e480612ea1@github.com> Message-ID: On Thu, 2 Dec 2021 00:20:52 GMT, Scott Gibbons wrote: >> Accelerates CRC32-C by utilizing vpclmulqdq similarly to CRC32. This change achieves ~4x throughput improvement. >> >> 5986.947899319073 MB/s => 24041.05203089616 MB/s >> 5840.02689336947 MB/s => 24898.781468710356 MB/s >> >> ********** Original *********** >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 1.710387358 seconds >> CRC32C.update(byte[]) throughput = 5986.947899319073 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 1.753416583 seconds >> CRC32C.update(ByteBuffer) throughput = 5840.02689336947 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> >> >> >> >> *********** With my changes: ************* >> >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 0.425938099 seconds >> CRC32C.update(byte[]) throughput = 24041.05203089616 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 0.411265106 seconds >> CRC32C.update(ByteBuffer) throughput = 24898.781468710356 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- > > Scott Gibbons has updated the pull request incrementally with two additional commits since the last revision: > > - MICRO to MILLI as requested. > - Fixing benchmark to throughput with default iterations. Testing finally passed (at least on x86, aarch64 is still running). I takes long time to get results from several tiers. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6595 From sviswanathan at openjdk.java.net Thu Dec 2 20:09:21 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 2 Dec 2021 20:09:21 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v4] In-Reply-To: References: <41Q1gB0N_YPjDB9nr4l74d2o0QKaYKsjC60s9xg4nk8=.45d4275e-fd64-44aa-8adf-13e480612ea1@github.com> Message-ID: On Thu, 2 Dec 2021 19:58:52 GMT, Vladimir Kozlov wrote: >> Scott Gibbons has updated the pull request incrementally with two additional commits since the last revision: >> >> - MICRO to MILLI as requested. >> - Fixing benchmark to throughput with default iterations. > > Testing finally passed (at least on x86, aarch64 is still running). I takes long time to get results from several tiers. Thanks a lot @vnkozlov. ------------- PR: https://git.openjdk.java.net/jdk/pull/6595 From duke at openjdk.java.net Thu Dec 2 20:09:21 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Thu, 2 Dec 2021 20:09:21 GMT Subject: RFR: 8277358: Accelerate CRC32-C [v4] In-Reply-To: <41Q1gB0N_YPjDB9nr4l74d2o0QKaYKsjC60s9xg4nk8=.45d4275e-fd64-44aa-8adf-13e480612ea1@github.com> References: <41Q1gB0N_YPjDB9nr4l74d2o0QKaYKsjC60s9xg4nk8=.45d4275e-fd64-44aa-8adf-13e480612ea1@github.com> Message-ID: On Thu, 2 Dec 2021 00:20:52 GMT, Scott Gibbons wrote: >> Accelerates CRC32-C by utilizing vpclmulqdq similarly to CRC32. This change achieves ~4x throughput improvement. >> >> 5986.947899319073 MB/s => 24041.05203089616 MB/s >> 5840.02689336947 MB/s => 24898.781468710356 MB/s >> >> ********** Original *********** >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 1.710387358 seconds >> CRC32C.update(byte[]) throughput = 5986.947899319073 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 1.753416583 seconds >> CRC32C.update(ByteBuffer) throughput = 5840.02689336947 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> >> >> >> >> *********** With my changes: ************* >> >> >> >> scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 >> offset = 0 >> msgSize = 512 bytes >> iters = 20000000 >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(byte[]) runtime = 0.425938099 seconds >> CRC32C.update(byte[]) throughput = 24041.05203089616 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> CRC32C.update(ByteBuffer) runtime = 0.411265106 seconds >> CRC32C.update(ByteBuffer) throughput = 24898.781468710356 MB/s >> CRCs: crc = ae10ee5a, crcReference = ae10ee5a >> ------------------------------------------------------- > > Scott Gibbons has updated the pull request incrementally with two additional commits since the last revision: > > - MICRO to MILLI as requested. > - Fixing benchmark to throughput with default iterations. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/6595 From duke at openjdk.java.net Thu Dec 2 20:09:23 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Thu, 2 Dec 2021 20:09:23 GMT Subject: Integrated: 8277358: Accelerate CRC32-C In-Reply-To: References: Message-ID: On Mon, 29 Nov 2021 14:45:22 GMT, Scott Gibbons wrote: > Accelerates CRC32-C by utilizing vpclmulqdq similarly to CRC32. This change achieves ~4x throughput improvement. > > 5986.947899319073 MB/s => 24041.05203089616 MB/s > 5840.02689336947 MB/s => 24898.781468710356 MB/s > > ********** Original *********** > > > scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 > offset = 0 > msgSize = 512 bytes > iters = 20000000 > ------------------------------------------------------- > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > CRC32C.update(byte[]) runtime = 1.710387358 seconds > CRC32C.update(byte[]) throughput = 5986.947899319073 MB/s > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > ------------------------------------------------------- > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > CRC32C.update(ByteBuffer) runtime = 1.753416583 seconds > CRC32C.update(ByteBuffer) throughput = 5840.02689336947 MB/s > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > ------------------------------------------------------- > > > > > *********** With my changes: ************* > > > > scottgi at 96974-ICX32:~/crc/jdk (asgibbons-crc32c)$ java test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java 20000000 > offset = 0 > msgSize = 512 bytes > iters = 20000000 > ------------------------------------------------------- > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > CRC32C.update(byte[]) runtime = 0.425938099 seconds > CRC32C.update(byte[]) throughput = 24041.05203089616 MB/s > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > ------------------------------------------------------- > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > CRC32C.update(ByteBuffer) runtime = 0.411265106 seconds > CRC32C.update(ByteBuffer) throughput = 24898.781468710356 MB/s > CRCs: crc = ae10ee5a, crcReference = ae10ee5a > ------------------------------------------------------- This pull request has now been integrated. Changeset: e0f1fc78 Author: Scott Gibbons Committer: Sandhya Viswanathan URL: https://git.openjdk.java.net/jdk/commit/e0f1fc783cb492dd1eb18f2d56c57bdc160a410d Stats: 126 lines in 5 files changed: 98 ins; 4 del; 24 mod 8277358: Accelerate CRC32-C Co-authored-by: Greg Tucker Co-authored-by: Scott Gibbons Reviewed-by: kvn, sviswanathan, ecaspole ------------- PR: https://git.openjdk.java.net/jdk/pull/6595 From duke at openjdk.java.net Thu Dec 2 22:31:42 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Thu, 2 Dec 2021 22:31:42 GMT Subject: RFR: 8278171: [vectorapi] Mask incorrectly computed for zero extending cast In-Reply-To: <2MpshpgM8Q5on_qdNcAqweKT6Pe-431p0sKC4fJXRqk=.6129cf7b-79aa-4282-a1a0-2a080ded63b6@github.com> References: <2MpshpgM8Q5on_qdNcAqweKT6Pe-431p0sKC4fJXRqk=.6129cf7b-79aa-4282-a1a0-2a080ded63b6@github.com> Message-ID: On Thu, 2 Dec 2021 19:55:40 GMT, Paul Sandoz wrote: >> This patch implements vector unsigned upcast intrinsics for x86. I also fixed a bug in the current implementation where the zero extension masks are computed incorrectly and add relevant tests. >> >> Thank you very much. > > I am inclined to separated out. Fix the bug, add the tests, and integrate for 18. Then enhance with the intrinsics for 19. > > If you agree to that I will create two bugs. @PaulSandoz Yes, I think that should be the case, thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/6634 From duke at openjdk.java.net Thu Dec 2 22:31:42 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Thu, 2 Dec 2021 22:31:42 GMT Subject: RFR: 8278171: [vectorapi] Mask incorrectly computed for zero extending cast In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 12:03:15 GMT, Mai ??ng Qu?n Anh wrote: > This patch implements vector unsigned upcast intrinsics for x86. I also fixed a bug in the current implementation where the zero extension masks are computed incorrectly and add relevant tests. > > Thank you very much. @PaulSandoz Could you take a look at this PR? Also, could you create an issue for this PR, please. Should this be split into 2, the first one fixes the bug and add tests while the second one implements the intrinsics. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/6634 From psandoz at openjdk.java.net Thu Dec 2 22:31:41 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Thu, 2 Dec 2021 22:31:41 GMT Subject: RFR: 8278171: [vectorapi] Mask incorrectly computed for zero extending cast In-Reply-To: References: Message-ID: <2MpshpgM8Q5on_qdNcAqweKT6Pe-431p0sKC4fJXRqk=.6129cf7b-79aa-4282-a1a0-2a080ded63b6@github.com> On Wed, 1 Dec 2021 12:03:15 GMT, Mai ??ng Qu?n Anh wrote: > This patch implements vector unsigned upcast intrinsics for x86. I also fixed a bug in the current implementation where the zero extension masks are computed incorrectly and add relevant tests. > > Thank you very much. I am inclined to separated out. Fix the bug, add the tests, and integrate for 18. Then enhance with the intrinsics for 19. If you agree to that I will create two bugs. src/hotspot/cpu/x86/x86.ad line 1819: > 1817: return false; > 1818: } > 1819: break; Collapse cases, since each has the code code? ------------- PR: https://git.openjdk.java.net/jdk/pull/6634 From duke at openjdk.java.net Thu Dec 2 22:31:39 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Thu, 2 Dec 2021 22:31:39 GMT Subject: RFR: 8278171: [vectorapi] Mask incorrectly computed for zero extending cast Message-ID: This patch implements vector unsigned upcast intrinsics for x86. I also fixed a bug in the current implementation where the zero extension masks are computed incorrectly and add relevant tests. Thank you very much. ------------- Commit messages: - revert intrinsics - Merge branch 'master' into vectorUnsignedCastIntrinsics - retain relevant changes Changes: https://git.openjdk.java.net/jdk/pull/6634/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6634&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278171 Stats: 322 lines in 2 files changed: 321 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/6634.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6634/head:pull/6634 PR: https://git.openjdk.java.net/jdk/pull/6634 From psandoz at openjdk.java.net Thu Dec 2 22:31:43 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Thu, 2 Dec 2021 22:31:43 GMT Subject: RFR: 8278171: [vectorapi] Mask incorrectly computed for zero extending cast In-Reply-To: References: <2MpshpgM8Q5on_qdNcAqweKT6Pe-431p0sKC4fJXRqk=.6129cf7b-79aa-4282-a1a0-2a080ded63b6@github.com> Message-ID: On Thu, 2 Dec 2021 22:04:36 GMT, Mai ??ng Qu?n Anh wrote: >> I am inclined to separated out. Fix the bug, add the tests, and integrate for 18. Then enhance with the intrinsics for 19. >> >> If you agree to that I will create two bugs. > > @PaulSandoz Yes, I think that should be the case, thank you very much. @merykitty here you go: [vectorapi] Mask incorrectly computed for zero extending cast https://bugs.openjdk.java.net/browse/JDK-8278171 [vectorapi] Add x64 intrinsics for unsigned (zero extended) casts https://bugs.openjdk.java.net/browse/JDK-8278173 ------------- PR: https://git.openjdk.java.net/jdk/pull/6634 From kvn at openjdk.java.net Thu Dec 2 22:57:14 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 2 Dec 2021 22:57:14 GMT Subject: RFR: 8277893: Arraycopy stress tests [v2] In-Reply-To: <3UVCSS5k6cAvWIzxF_1egpBrr69f5Bu8AlhWCMmmINw=.9dd5b6d5-375b-481d-a4e9-cb2ef7a1629f@github.com> References: <3UVCSS5k6cAvWIzxF_1egpBrr69f5Bu8AlhWCMmmINw=.9dd5b6d5-375b-481d-a4e9-cb2ef7a1629f@github.com> Message-ID: On Wed, 1 Dec 2021 09:13:36 GMT, Aleksey Shipilev wrote: >> I would like to fork the new tests off the JDK-8150730. These tests were instrumental in capturing many bugs in my arraycopy work, and I think they are good on their own merit, because they provide a test for the current baseline and on-going minor improvements in arraycopy on all platforms, not only x86_64, and they might be cleanly backportable. >> >> A brief tour of these tests: >> >> - Tests all data types; >> - Tests small arrays exhaustively, which captures conjoint/disjoint cases, errors near the edges, etc; >> - Tests large arrays with fuzzing around powers of two and powers of ten, both conjoint and disjoint cases; >> - Tests all available compilation modes for arraycopy stubs; for example, running on AVX-512 enabled machine runs all versions down to `-XX:UseAVX=0 -XX:UseSSE=0` cases; >> - Tests with/without compressed oops mode -- theoretically only needed for `Object` copies, but Hotspot cobbles together int+coops and long+no-coops loops, so I decided to alternate coops mode for all data types; >> >> My previous version used individual `@run` clauses for all configurations, but I think the Java driver is cleaner and easier to maintain. >> >> Test times: >> >> >> # x86_64 (TR 3970X) >> real 6m37.855s >> user 56m23.004s >> sys 0m20.148s >> >> # x86_32 (TR 3970X) >> real 11m22.877s >> user 168m8.137s >> sys 5m7.037s >> >> # x86_64 (i5-11500) >> real 15m55.424s >> user 118m0.969s >> sys 0m12.039s >> >> # AArch64 (ThunderX2) >> real 4m5.177s >> user 32m7.295s >> sys 0m19.689s >> >> >> Since these tests are quite long, especially on small machines, I hooked them up to `hotspot:tier3`. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `compiler/stress/arraycopy` >> - [x] Linux x86_32 fastdebug `compiler/stress/arraycopy` >> - [x] Linux AArch64 fastdebug `compiler/stress/arraycopy` > > Aleksey Shipilev has updated the pull request incrementally with two additional commits since the last revision: > > - Separate test group and hooks into hotspot_slow_compiler > - Trim down MAX_SIZE and explain the choice Just let you know that testing (tier1,2,3) on linux-aarch64 passed clean. All testing now finished. ------------- PR: https://git.openjdk.java.net/jdk/pull/6594 From mdoerr at openjdk.java.net Thu Dec 2 23:25:39 2021 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Thu, 2 Dec 2021 23:25:39 GMT Subject: RFR: 8271202: C1: assert(false) failed: live_in set of first block must be empty Message-ID: I have written a checker which detects usage of the illegal phi function. In case of the reproducer provided in the JBS bug ("Reduced.java"), it finds the following and bails out: invalidating local 8 because of type mismatch (new_value is NULL) Bailing out because StoreIndexed (id 98) uses illegal phi (id 68) I haven't checked why that node uses the illegal phi. That still seems to be a bug. Maybe there's a better solution to the underlying problem, but I hope my checker is useful to analyze bugs and to make C1 more resilient. ------------- Commit messages: - 8271202: C1: assert(false) failed: live_in set of first block must be empty Changes: https://git.openjdk.java.net/jdk/pull/6683/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6683&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8271202 Stats: 51 lines in 1 file changed: 46 ins; 0 del; 5 mod Patch: https://git.openjdk.java.net/jdk/pull/6683.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6683/head:pull/6683 PR: https://git.openjdk.java.net/jdk/pull/6683 From ddong at openjdk.java.net Fri Dec 3 02:40:18 2021 From: ddong at openjdk.java.net (Denghui Dong) Date: Fri, 3 Dec 2021 02:40:18 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: On Thu, 2 Dec 2021 19:37:10 GMT, Vladimir Kozlov wrote: > How it is tested? I currently only do a simple manual test, adding a print statement in SharedRuntime::dtrace_object_alloc(Thread* thread, oopDesc* o), this method is currently only used by c2. And run with `java -XX:+DTraceAllocProbes -Xcomp -version` or `java -XX:+ExtendedDTraceProbes -Xcomp -version`, without this fix, the print statement is never hit. ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From dholmes at openjdk.java.net Fri Dec 3 03:41:19 2021 From: dholmes at openjdk.java.net (David Holmes) Date: Fri, 3 Dec 2021 03:41:19 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs [v6] In-Reply-To: References: Message-ID: On Tue, 30 Nov 2021 00:10:39 GMT, Sandhya Viswanathan wrote: >> Currently 32-byte instructions are used for small array copy and clear. >> This can be optimized by using 64-byte instructions. >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > Fix whitespace Marked as reviewed by dholmes (Reviewer). Please check the GHA test results before integrating - there is a failing compiler test. ------------- PR: https://git.openjdk.java.net/jdk/pull/6512 From dlong at openjdk.java.net Fri Dec 3 03:59:15 2021 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 3 Dec 2021 03:59:15 GMT Subject: RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert [v2] In-Reply-To: References: <0GbtDPLsXoyXYNJv2ZJ6vwTSdbzJOXqWUNENCTCrZmA=.d0b24bd8-be69-435d-8db1-d291afcd7f62@github.com> <4ggmeDmX6-82KhLQC0_ijUHQqeCHrD-fHrTwGGhoneM=.36d2be35-5156-4bbd-b55c-07a15487062e@github.com> Message-ID: On Thu, 2 Dec 2021 15:38:06 GMT, Roland Westrelin wrote: > So you agree that if there's no handler then there's no need to preserve the stack? Yes, if we are blowing away the frame and there's no chance the state will be used by an uncommon trap. As far as I can tell from SCCS comments for JDK-4432078 and related duplicate bugs, the original problem was re-executing the trapping bytecode with an uncommon trap. And one way that could have happened was removed between jdk5 and jdk6 with this change: http://hg.openjdk.java.net/jdk6/jdk6/hotspot/rev/fdd57634910e If that was the only uncommon trap causing a problem, then there's a good chance that preserving the stack is no longer needed at all. ------------- PR: https://git.openjdk.java.net/jdk/pull/6572 From thartmann at openjdk.java.net Fri Dec 3 07:42:22 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Fri, 3 Dec 2021 07:42:22 GMT Subject: RFR: 8277529: SIGSEGV in C2 CompilerThread Node::rematerialize() compiling Packet::readUnsignedTrint In-Reply-To: References: Message-ID: On Thu, 2 Dec 2021 12:46:06 GMT, Christian Hagedorn wrote: > This bug was found in internal testing for 17.0.2 after the backport of [JDK-8272574](https://bugs.openjdk.java.net/browse/JDK-8272574). The fix rewires a control input of a split load node through a phi to the same control input as the corresponding memory input into the phi node: > https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/memnode.cpp#L1629-L1637 > > This can lead to an illegal control input for the new load if `mem->in(i)` is a projection. This situation occurs in the test case: `mem->in(i)` is a projection of an `ArrayCopy` node. Having an `ArrayCopy` node as control input is unexpected and we later fail with an assertion (in product we crash with a segfault). > > My initial thought was to just not apply the improved rewiring for memory projections and fall back to the else case on L1636. However, this also does not work as we could then wrongly move a range check out of a loop and creating a cyclic dependency (case described in the [review of JDK-8272574](https://github.com/openjdk/jdk/pull/5142)) and reproduced with `test2()` where we hit: > https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/loopPredicate.cpp#L777-L778 > > During this analysis I came across the [fix for JDK-8146792](http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/2748d975045f) which should actually prevent loop predication if we have a data dependency on the projection into a loop. But this does not seem to work correctly as shown in the test case found for JDK-8272574. In the review of JDK-8272574, it was missed that the actual problem could have been traced back to JDK-8146792 and instead we went for an improvement for loads split through phis to not end up with cases supposed to be fixed by JDK-8146972 (which now also causes other problems). But the root problem is still there to be fixed. > > I therefore propose to completely undo the fix for JDK-8272574 for now and go for a fix on top of JDK-8146972 to fix JDK-8272574 and also prevent this bug. The assertion code added by JDK-8272574 is still good to have. I suggest to revisit the improved rewiring for loads done with JDK-8272574 in an RFE. I think it should still be beneficial but requires some more careful checking (to avoid the problems reported in this bug). > > > Explanation of JDK-8272574 with regard to JDK-8146972 (covered by `test2-5()`): > > When deciding if we can apply loop predication to a check inside the loop, we call `IdealLoopTree::is_range_check()`. This method calls `PhaseIdealLoop::is_scaled_iv_plus_offset()` which can create new nodes for the offset: > https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopTransform.cpp#L2586-L2588 > `get_ctrl()` of these new nodes could also be the incoming projection into a loop. But these nodes are not marked as non-loop-invariant as done for the other nodes in `Invariant::Invariant()`: > https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopPredicate.cpp#L663-L667 > > The proposed fix keeps track when above code was applied (`data_dependency_on` is set) and does an additional checking for `get_ctrl()` accordingly if a new node was created for the offset in `PhaseIdealLoop::is_scaled_iv_plus_offset()`. This should be fine as we are not using the newly created `offset` node anymore after doing that check. > > In discussions with Roland, we think that we should revisit this fix and the fix of JDK-8146972 and clean them up to directly do the invariant checks in `compute_invariance()/visit()` without special handling in the constructor. But given the deadline of 17.0.2 and the fork soon coming up for 18, we think it's the most safe way to go with the proposed fix - if others agree. > > Thanks, > Christian Nice analysis and test. Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6670 From thartmann at openjdk.java.net Fri Dec 3 08:03:16 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Fri, 3 Dec 2021 08:03:16 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: <20j9XG2FKxSZlg4VrqF2oMOx_Ln-luxUCAooAMPjw4w=.791599dc-dbcf-4c3e-9048-1c0812d7b486@github.com> On Thu, 2 Dec 2021 15:16:24 GMT, Denghui Dong wrote: > I think that explains why ciEnv need to cache DTrace flags before compilation, and _dtrace_method_probes/_dtrace_alloc_probes is set enabled if ExtendedDTraceProbes is true, so I think ExtendedDTraceProbes may still be useful, right? Yes, it makes perfect sense that the flag values are cached but what I'm saying is that with your change, `-XX:+ExtendedDTraceProbes` degrades to an alias for `-XX:+DTraceMethodProbes -XX:+DTraceAllocProbes`. Maybe that's intended or am I missing something? ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From chagedorn at openjdk.java.net Fri Dec 3 08:59:47 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 3 Dec 2021 08:59:47 GMT Subject: RFR: 8277529: SIGSEGV in C2 CompilerThread Node::rematerialize() compiling Packet::readUnsignedTrint [v2] In-Reply-To: References: Message-ID: > This bug was found in internal testing for 17.0.2 after the backport of [JDK-8272574](https://bugs.openjdk.java.net/browse/JDK-8272574). The fix rewires a control input of a split load node through a phi to the same control input as the corresponding memory input into the phi node: > https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/memnode.cpp#L1629-L1637 > > This can lead to an illegal control input for the new load if `mem->in(i)` is a projection. This situation occurs in the test case: `mem->in(i)` is a projection of an `ArrayCopy` node. Having an `ArrayCopy` node as control input is unexpected and we later fail with an assertion (in product we crash with a segfault). > > My initial thought was to just not apply the improved rewiring for memory projections and fall back to the else case on L1636. However, this also does not work as we could then wrongly move a range check out of a loop and creating a cyclic dependency (case described in the [review of JDK-8272574](https://github.com/openjdk/jdk/pull/5142)) and reproduced with `test2()` where we hit: > https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/loopPredicate.cpp#L777-L778 > > During this analysis I came across the [fix for JDK-8146792](http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/2748d975045f) which should actually prevent loop predication if we have a data dependency on the projection into a loop. But this does not seem to work correctly as shown in the test case found for JDK-8272574. In the review of JDK-8272574, it was missed that the actual problem could have been traced back to JDK-8146792 and instead we went for an improvement for loads split through phis to not end up with cases supposed to be fixed by JDK-8146972 (which now also causes other problems). But the root problem is still there to be fixed. > > I therefore propose to completely undo the fix for JDK-8272574 for now and go for a fix on top of JDK-8146972 to fix JDK-8272574 and also prevent this bug. The assertion code added by JDK-8272574 is still good to have. I suggest to revisit the improved rewiring for loads done with JDK-8272574 in an RFE. I think it should still be beneficial but requires some more careful checking (to avoid the problems reported in this bug). > > > Explanation of JDK-8272574 with regard to JDK-8146972 (covered by `test2-5()`): > > When deciding if we can apply loop predication to a check inside the loop, we call `IdealLoopTree::is_range_check()`. This method calls `PhaseIdealLoop::is_scaled_iv_plus_offset()` which can create new nodes for the offset: > https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopTransform.cpp#L2586-L2588 > `get_ctrl()` of these new nodes could also be the incoming projection into a loop. But these nodes are not marked as non-loop-invariant as done for the other nodes in `Invariant::Invariant()`: > https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopPredicate.cpp#L663-L667 > > The proposed fix keeps track when above code was applied (`data_dependency_on` is set) and does an additional checking for `get_ctrl()` accordingly if a new node was created for the offset in `PhaseIdealLoop::is_scaled_iv_plus_offset()`. This should be fine as we are not using the newly created `offset` node anymore after doing that check. > > In discussions with Roland, we think that we should revisit this fix and the fix of JDK-8146972 and clean them up to directly do the invariant checks in `compute_invariance()/visit()` without special handling in the constructor. But given the deadline of 17.0.2 and the fork soon coming up for 18, we think it's the most safe way to go with the proposed fix - if others agree. > > Thanks, > Christian Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: use direct bailout ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6670/files - new: https://git.openjdk.java.net/jdk/pull/6670/files/3102d7ad..6a83d231 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6670&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6670&range=00-01 Stats: 40 lines in 1 file changed: 12 ins; 12 del; 16 mod Patch: https://git.openjdk.java.net/jdk/pull/6670.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6670/head:pull/6670 PR: https://git.openjdk.java.net/jdk/pull/6670 From chagedorn at openjdk.java.net Fri Dec 3 08:59:47 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 3 Dec 2021 08:59:47 GMT Subject: RFR: 8277529: SIGSEGV in C2 CompilerThread Node::rematerialize() compiling Packet::readUnsignedTrint In-Reply-To: References: Message-ID: On Thu, 2 Dec 2021 12:46:06 GMT, Christian Hagedorn wrote: > This bug was found in internal testing for 17.0.2 after the backport of [JDK-8272574](https://bugs.openjdk.java.net/browse/JDK-8272574). The fix rewires a control input of a split load node through a phi to the same control input as the corresponding memory input into the phi node: > https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/memnode.cpp#L1629-L1637 > > This can lead to an illegal control input for the new load if `mem->in(i)` is a projection. This situation occurs in the test case: `mem->in(i)` is a projection of an `ArrayCopy` node. Having an `ArrayCopy` node as control input is unexpected and we later fail with an assertion (in product we crash with a segfault). > > My initial thought was to just not apply the improved rewiring for memory projections and fall back to the else case on L1636. However, this also does not work as we could then wrongly move a range check out of a loop and creating a cyclic dependency (case described in the [review of JDK-8272574](https://github.com/openjdk/jdk/pull/5142)) and reproduced with `test2()` where we hit: > https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/loopPredicate.cpp#L777-L778 > > During this analysis I came across the [fix for JDK-8146792](http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/2748d975045f) which should actually prevent loop predication if we have a data dependency on the projection into a loop. But this does not seem to work correctly as shown in the test case found for JDK-8272574. In the review of JDK-8272574, it was missed that the actual problem could have been traced back to JDK-8146792 and instead we went for an improvement for loads split through phis to not end up with cases supposed to be fixed by JDK-8146972 (which now also causes other problems). But the root problem is still there to be fixed. > > I therefore propose to completely undo the fix for JDK-8272574 for now and go for a fix on top of JDK-8146972 to fix JDK-8272574 and also prevent this bug. The assertion code added by JDK-8272574 is still good to have. I suggest to revisit the improved rewiring for loads done with JDK-8272574 in an RFE. I think it should still be beneficial but requires some more careful checking (to avoid the problems reported in this bug). > > > Explanation of JDK-8272574 with regard to JDK-8146972 (covered by `test2-5()`): > > When deciding if we can apply loop predication to a check inside the loop, we call `IdealLoopTree::is_range_check()`. This method calls `PhaseIdealLoop::is_scaled_iv_plus_offset()` which can create new nodes for the offset: > https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopTransform.cpp#L2586-L2588 > `get_ctrl()` of these new nodes could also be the incoming projection into a loop. But these nodes are not marked as non-loop-invariant as done for the other nodes in `Invariant::Invariant()`: > https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopPredicate.cpp#L663-L667 > > The proposed fix keeps track when above code was applied (`data_dependency_on` is set) and does an additional checking for `get_ctrl()` accordingly if a new node was created for the offset in `PhaseIdealLoop::is_scaled_iv_plus_offset()`. This should be fine as we are not using the newly created `offset` node anymore after doing that check. > > In discussions with Roland, we think that we should revisit this fix and the fix of JDK-8146972 and clean them up to directly do the invariant checks in `compute_invariance()/visit()` without special handling in the constructor. But given the deadline of 17.0.2 and the fork soon coming up for 18, we think it's the most safe way to go with the proposed fix - if others agree. > > Thanks, > Christian Thanks Tobias for your review! ------------- PR: https://git.openjdk.java.net/jdk/pull/6670 From chagedorn at openjdk.java.net Fri Dec 3 08:59:51 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 3 Dec 2021 08:59:51 GMT Subject: RFR: 8277529: SIGSEGV in C2 CompilerThread Node::rematerialize() compiling Packet::readUnsignedTrint [v2] In-Reply-To: References: Message-ID: On Thu, 2 Dec 2021 19:54:33 GMT, Vladimir Kozlov wrote: >> Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: >> >> use direct bailout > > src/hotspot/share/opto/loopPredicate.cpp line 776: > >> 774: const uint old_unique_idx = C->unique(); >> 775: if (is_range_check_if(iff, phase, T_INT, iv, range, offset, scale)) { >> 776: if (!invar.is_invariant(range)) { > > First, the fix is reasonable to me. > > My only complain is original code flow. If we do `return false;` for each check we should the same for `is_range_check_if()` check instead of returning `false` at the very end you lost logic why it is `false`. Thanks for your review Vladimir! Good point, I changed that into a direct bailout. ------------- PR: https://git.openjdk.java.net/jdk/pull/6670 From ddong at openjdk.java.net Fri Dec 3 09:12:20 2021 From: ddong at openjdk.java.net (Denghui Dong) Date: Fri, 3 Dec 2021 09:12:20 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: <20j9XG2FKxSZlg4VrqF2oMOx_Ln-luxUCAooAMPjw4w=.791599dc-dbcf-4c3e-9048-1c0812d7b486@github.com> References: <20j9XG2FKxSZlg4VrqF2oMOx_Ln-luxUCAooAMPjw4w=.791599dc-dbcf-4c3e-9048-1c0812d7b486@github.com> Message-ID: <7nQ8on8CcixFe3JKdaUOAq-EU6s3UGfNKqhwh9a9nFE=.af9b2828-bfe1-4bde-a560-69d01a5e68ee@github.com> On Fri, 3 Dec 2021 08:00:29 GMT, Tobias Hartmann wrote: > > I think that explains why ciEnv need to cache DTrace flags before compilation, and _dtrace_method_probes/_dtrace_alloc_probes is set enabled if ExtendedDTraceProbes is true, so I think ExtendedDTraceProbes may still be useful, right? > > Yes, it makes perfect sense that the flag values are cached but what I'm saying is that with your change, `-XX:+ExtendedDTraceProbes` degrades to an alias for `-XX:+DTraceMethodProbes -XX:+DTraceAllocProbes`. Maybe that's intended or am I missing something? Yes, after the change, `-XX:+ExtendedDTraceProbes` is just an alias for `-XX:+DTraceMethodProbes -XX:+DTraceAllocProbes`, I think `ExtendedDTraceProbes` was originally designed for this purpose. Or do you mean we should remove ExtendedDTraceProbes? ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From chagedorn at openjdk.java.net Fri Dec 3 09:21:14 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 3 Dec 2021 09:21:14 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: <3qWOok3Vh9nFlwI_sSTX57elq0SQzlpY32kPCqnScfk=.d29cac99-f574-4cb9-a9eb-6563884829a7@github.com> On Wed, 1 Dec 2021 15:41:07 GMT, Denghui Dong wrote: > Hi, > > Could I have a review of this small fix that makes expand_dtrace_alloc_probe take effect? > > Thanks, > Denghui Looks good to me. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6639 From thartmann at openjdk.java.net Fri Dec 3 09:21:15 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Fri, 3 Dec 2021 09:21:15 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: <7nQ8on8CcixFe3JKdaUOAq-EU6s3UGfNKqhwh9a9nFE=.af9b2828-bfe1-4bde-a560-69d01a5e68ee@github.com> References: <20j9XG2FKxSZlg4VrqF2oMOx_Ln-luxUCAooAMPjw4w=.791599dc-dbcf-4c3e-9048-1c0812d7b486@github.com> <7nQ8on8CcixFe3JKdaUOAq-EU6s3UGfNKqhwh9a9nFE=.af9b2828-bfe1-4bde-a560-69d01a5e68ee@github.com> Message-ID: On Fri, 3 Dec 2021 09:09:23 GMT, Denghui Dong wrote: > I think ExtendedDTraceProbes was originally designed for this purpose Yes, if that's the case, I'm fine with the change. Otherwise, I would suggest to remove that flag. @vnkozlov, what do you think? ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From ddong at openjdk.java.net Fri Dec 3 09:44:14 2021 From: ddong at openjdk.java.net (Denghui Dong) Date: Fri, 3 Dec 2021 09:44:14 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: <20j9XG2FKxSZlg4VrqF2oMOx_Ln-luxUCAooAMPjw4w=.791599dc-dbcf-4c3e-9048-1c0812d7b486@github.com> <7nQ8on8CcixFe3JKdaUOAq-EU6s3UGfNKqhwh9a9nFE=.af9b2828-bfe1-4bde-a560-69d01a5e68ee@github.com> Message-ID: <7SAeTr_b7zDE_JgPNtHE9hLtYHKU4xmDywuZECXtCbY=.070d48d6-76c3-4fb2-b4f2-8e9f45b64e59@github.com> On Fri, 3 Dec 2021 09:17:45 GMT, Tobias Hartmann wrote: > Yes, if that's the case in arguments.cpp } else if (match_option(option, "-XX:+ExtendedDTraceProbes")) { #if defined(DTRACE_ENABLED) if (FLAG_SET_CMDLINE(ExtendedDTraceProbes, true) != JVMFlag::SUCCESS) { return JNI_EINVAL; } if (FLAG_SET_CMDLINE(DTraceMethodProbes, true) != JVMFlag::SUCCESS) { return JNI_EINVAL; } if (FLAG_SET_CMDLINE(DTraceAllocProbes, true) != JVMFlag::SUCCESS) { return JNI_EINVAL; } if (FLAG_SET_CMDLINE(DTraceMonitorProbes, true) != JVMFlag::SUCCESS) { return JNI_EINVAL; } #else // defined(DTRACE_ENABLED) ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From thartmann at openjdk.java.net Fri Dec 3 10:06:21 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Fri, 3 Dec 2021 10:06:21 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 15:41:07 GMT, Denghui Dong wrote: > Hi, > > Could I have a review of this small fix that makes expand_dtrace_alloc_probe take effect? > > Thanks, > Denghui Marked as reviewed by thartmann (Reviewer). Yes, I've seen that but before your fix, setting `ExtendedDTraceProbes` had an "additional" (although unintended) effect. Now it does not have that anymore. Anyway, I'm fine with the fix. If we decide to remove `ExtendedDTraceProbes`, it can be done with a follow-up RFE. ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From roland at openjdk.java.net Fri Dec 3 10:22:33 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 3 Dec 2021 10:22:33 GMT Subject: RFR: 8277850: C2: optimize mask checks in counted loops Message-ID: <_gfsjS2vHwM9TTZtfGNeBOBu9ISOFAxfSdGl391M4MM=.4b05410a-5583-4bf2-99af-eab56a359e3c@github.com> This is another fix that addresses a performance issue with panama and that was brought up by Maurizio. The pattern to optimize is: if ((base + (offset << 2)) & 3) != 0) { } where base is loop independent but offset depends on a loop variable. This can be transformed to: if ((base & 3) != 0) { That check becomes loop independent and be optimized by loop predication (or I suppose loop unswitching but that wasn't the case of the micro benchmark I worked on). This change also optimizes the pattern: (offset << 2) & 3 to return 0. ------------- Depends on: https://git.openjdk.java.net/jdk/pull/6607 Commit messages: - whitespace - fix Changes: https://git.openjdk.java.net/jdk/pull/6697/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6697&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8277850 Stats: 207 lines in 4 files changed: 206 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/6697.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6697/head:pull/6697 PR: https://git.openjdk.java.net/jdk/pull/6697 From jbhateja at openjdk.java.net Fri Dec 3 12:36:18 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 3 Dec 2021 12:36:18 GMT Subject: RFR: 8277793: Support vector F2I and D2L cast operations for X86 [v3] In-Reply-To: References: <2dggI8qABtygqAVieUgvvwXV9pcTWrt-fXXW3CaGHXQ=.e73cad93-fc58-44af-aa05-1ff0fdf72fb6@github.com> Message-ID: On Wed, 1 Dec 2021 12:20:55 GMT, Nils Eliasson wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8277793: Review comments resolution > > Looks good! > Hi @neliasso , can you kindly do a final run of the patch though your regression suite. Hi @neliasso, @PaulSandoz , can you kindly run this through your test framework once before I Integrate. ------------- PR: https://git.openjdk.java.net/jdk/pull/6544 From roland at openjdk.java.net Fri Dec 3 12:54:15 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 3 Dec 2021 12:54:15 GMT Subject: RFR: 8277529: SIGSEGV in C2 CompilerThread Node::rematerialize() compiling Packet::readUnsignedTrint [v2] In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 08:59:47 GMT, Christian Hagedorn wrote: >> This bug was found in internal testing for 17.0.2 after the backport of [JDK-8272574](https://bugs.openjdk.java.net/browse/JDK-8272574). The fix rewires a control input of a split load node through a phi to the same control input as the corresponding memory input into the phi node: >> https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/memnode.cpp#L1629-L1637 >> >> This can lead to an illegal control input for the new load if `mem->in(i)` is a projection. This situation occurs in the test case: `mem->in(i)` is a projection of an `ArrayCopy` node. Having an `ArrayCopy` node as control input is unexpected and we later fail with an assertion (in product we crash with a segfault). >> >> My initial thought was to just not apply the improved rewiring for memory projections and fall back to the else case on L1636. However, this also does not work as we could then wrongly move a range check out of a loop and creating a cyclic dependency (case described in the [review of JDK-8272574](https://github.com/openjdk/jdk/pull/5142)) and reproduced with `test2()` where we hit: >> https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/loopPredicate.cpp#L777-L778 >> >> During this analysis I came across the [fix for JDK-8146792](http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/2748d975045f) which should actually prevent loop predication if we have a data dependency on the projection into a loop. But this does not seem to work correctly as shown in the test case found for JDK-8272574. In the review of JDK-8272574, it was missed that the actual problem could have been traced back to JDK-8146792 and instead we went for an improvement for loads split through phis to not end up with cases supposed to be fixed by JDK-8146972 (which now also causes other problems). But the root problem is still there to be fixed. >> >> I therefore propose to completely undo the fix for JDK-8272574 for now and go for a fix on top of JDK-8146972 to fix JDK-8272574 and also prevent this bug. The assertion code added by JDK-8272574 is still good to have. I suggest to revisit the improved rewiring for loads done with JDK-8272574 in an RFE. I think it should still be beneficial but requires some more careful checking (to avoid the problems reported in this bug). >> >> >> Explanation of JDK-8272574 with regard to JDK-8146972 (covered by `test2-5()`): >> >> When deciding if we can apply loop predication to a check inside the loop, we call `IdealLoopTree::is_range_check()`. This method calls `PhaseIdealLoop::is_scaled_iv_plus_offset()` which can create new nodes for the offset: >> https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopTransform.cpp#L2586-L2588 >> `get_ctrl()` of these new nodes could also be the incoming projection into a loop. But these nodes are not marked as non-loop-invariant as done for the other nodes in `Invariant::Invariant()`: >> https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopPredicate.cpp#L663-L667 >> >> The proposed fix keeps track when above code was applied (`data_dependency_on` is set) and does an additional checking for `get_ctrl()` accordingly if a new node was created for the offset in `PhaseIdealLoop::is_scaled_iv_plus_offset()`. This should be fine as we are not using the newly created `offset` node anymore after doing that check. >> >> In discussions with Roland, we think that we should revisit this fix and the fix of JDK-8146972 and clean them up to directly do the invariant checks in `compute_invariance()/visit()` without special handling in the constructor. But given the deadline of 17.0.2 and the fork soon coming up for 18, we think it's the most safe way to go with the proposed fix - if others agree. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > use direct bailout Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6670 From ddong at openjdk.java.net Fri Dec 3 13:34:21 2021 From: ddong at openjdk.java.net (Denghui Dong) Date: Fri, 3 Dec 2021 13:34:21 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 15:41:07 GMT, Denghui Dong wrote: > Hi, > > Could I have a review of this small fix that makes expand_dtrace_alloc_probe take effect? > > Thanks, > Denghui Thanks for the review :-) ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From ddong at openjdk.java.net Fri Dec 3 13:34:22 2021 From: ddong at openjdk.java.net (Denghui Dong) Date: Fri, 3 Dec 2021 13:34:22 GMT Subject: Integrated: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 15:41:07 GMT, Denghui Dong wrote: > Hi, > > Could I have a review of this small fix that makes expand_dtrace_alloc_probe take effect? > > Thanks, > Denghui This pull request has now been integrated. Changeset: f7237793 Author: Denghui Dong URL: https://git.openjdk.java.net/jdk/commit/f7237793ffa3a5a804fea49f165c8b9f1935cfac Stats: 4 lines in 1 file changed: 0 ins; 1 del; 3 mod 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp Reviewed-by: thartmann, kvn, chagedorn ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From chagedorn at openjdk.java.net Fri Dec 3 13:39:13 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 3 Dec 2021 13:39:13 GMT Subject: RFR: 8277529: SIGSEGV in C2 CompilerThread Node::rematerialize() compiling Packet::readUnsignedTrint [v2] In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 08:59:47 GMT, Christian Hagedorn wrote: >> This bug was found in internal testing for 17.0.2 after the backport of [JDK-8272574](https://bugs.openjdk.java.net/browse/JDK-8272574). The fix rewires a control input of a split load node through a phi to the same control input as the corresponding memory input into the phi node: >> https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/memnode.cpp#L1629-L1637 >> >> This can lead to an illegal control input for the new load if `mem->in(i)` is a projection. This situation occurs in the test case: `mem->in(i)` is a projection of an `ArrayCopy` node. Having an `ArrayCopy` node as control input is unexpected and we later fail with an assertion (in product we crash with a segfault). >> >> My initial thought was to just not apply the improved rewiring for memory projections and fall back to the else case on L1636. However, this also does not work as we could then wrongly move a range check out of a loop and creating a cyclic dependency (case described in the [review of JDK-8272574](https://github.com/openjdk/jdk/pull/5142)) and reproduced with `test2()` where we hit: >> https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/loopPredicate.cpp#L777-L778 >> >> During this analysis I came across the [fix for JDK-8146792](http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/2748d975045f) which should actually prevent loop predication if we have a data dependency on the projection into a loop. But this does not seem to work correctly as shown in the test case found for JDK-8272574. In the review of JDK-8272574, it was missed that the actual problem could have been traced back to JDK-8146792 and instead we went for an improvement for loads split through phis to not end up with cases supposed to be fixed by JDK-8146972 (which now also causes other problems). But the root problem is still there to be fixed. >> >> I therefore propose to completely undo the fix for JDK-8272574 for now and go for a fix on top of JDK-8146972 to fix JDK-8272574 and also prevent this bug. The assertion code added by JDK-8272574 is still good to have. I suggest to revisit the improved rewiring for loads done with JDK-8272574 in an RFE. I think it should still be beneficial but requires some more careful checking (to avoid the problems reported in this bug). >> >> >> Explanation of JDK-8272574 with regard to JDK-8146972 (covered by `test2-5()`): >> >> When deciding if we can apply loop predication to a check inside the loop, we call `IdealLoopTree::is_range_check()`. This method calls `PhaseIdealLoop::is_scaled_iv_plus_offset()` which can create new nodes for the offset: >> https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopTransform.cpp#L2586-L2588 >> `get_ctrl()` of these new nodes could also be the incoming projection into a loop. But these nodes are not marked as non-loop-invariant as done for the other nodes in `Invariant::Invariant()`: >> https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopPredicate.cpp#L663-L667 >> >> The proposed fix keeps track when above code was applied (`data_dependency_on` is set) and does an additional checking for `get_ctrl()` accordingly if a new node was created for the offset in `PhaseIdealLoop::is_scaled_iv_plus_offset()`. This should be fine as we are not using the newly created `offset` node anymore after doing that check. >> >> In discussions with Roland, we think that we should revisit this fix and the fix of JDK-8146972 and clean them up to directly do the invariant checks in `compute_invariance()/visit()` without special handling in the constructor. But given the deadline of 17.0.2 and the fork soon coming up for 18, we think it's the most safe way to go with the proposed fix - if others agree. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > use direct bailout Thanks Roland for your review and the offline discussion about it. I will file an RFE once it is integrated to clean it up in JDK 19. ------------- PR: https://git.openjdk.java.net/jdk/pull/6670 From chagedorn at openjdk.java.net Fri Dec 3 14:09:31 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 3 Dec 2021 14:09:31 GMT Subject: RFR: 8275610: C2: Object field load floats above its null check resulting in a segfault Message-ID: In the test case, a field load of an object floats above the object null check due to a `CastPP` that gets separated from its null check `If` node. C2 then schedules the field load before the object null check which results in a segfault. The problem can be traced back to the elimination of identical back-to-back ifs [(JDK-8143542)](https://bugs.openjdk.java.net/browse/JDK-8143542). This is done in split-if where we detect identical back-to-back ifs in `PhaseIdealLoop::identical_backtoback_ifs()`. We then replace the boolean node of the dominated `If` by a phi to use the split-if optimization to nicely fold away the dominated `If` later in IGVN. This, however, does not update any data nodes that were dependent on the dominated `If` projections. In the test case, we have the following graph right before splitting `313 If` (`NC 2`) through `190 Region`: ![Screenshot from 2021-12-03 14-08-31](https://user-images.githubusercontent.com/17833009/144608863-a1185bf8-dfa3-4bd5-a0c2-284ea2b27606.png) `313 If` is dominated by the identical (= both share `308 Bool`) `309 If` (`NC 1`). The bool input for `313 If` is replaced by a phi and the split-if optimization is applied. However, the data nodes dependent on the out projections of the dominated `313 If` (`261 CastPP` in this case) are not processed separately and just end up at the newly inserted regions in split-if. In the test case, we get the following graph where `261 CastPP` ends up at the new `334 Region`: ![Screenshot from 2021-12-03 14-08-59](https://user-images.githubusercontent.com/17833009/144608891-48adf7a3-9b04-4e6d-bc26-310757b3d596.png) Loopopts are applied and can remove the `325 CountedLoop` and we find that the `171 RangeCheck` (`RC 2`) is applied on a constant array index 1. Now at IGVN, the order in which the nodes are processed is important in order to trigger the segfault: 1. `334 Region` with `332 If` and `333 If` are removed because of the special split-if setup we used to remove the identical `313 If`. The control input of `261 CastPP` is therefore updated to `172 IfTrue`. 2. Applying `RangeCheck::Ideal()` for `171 RangeCheck` finds that `305 RangeCheck` (`RC 1`) already covers it and we remove `171 RangeCheck`. In this process, we rewire `261 CastPP` to the dominating `305 RangeCheck` and we have the following graph: ![Screenshot from 2021-12-03 14-09-21](https://user-images.githubusercontent.com/17833009/144608917-f5159b21-40b2-41c8-afab-fa495433217a.png) `261 CastPP` - and also the field `263 LoadI` - have now `306 IfFalse` as early control. GCM is then scheduling `263 LoadI` before the null check `309 If` and we get a segfault. An easy fix is not straight forward. What we actually would want to do is rewiring `261 CastPP` from `334 Region` to `311 IfTrue` in the second graph after split-if to not separate it from the null check. But that's not possible because we would create a bad graph: The early control `311 IfTrue` of `261 CastPP` does not dominate its late control further down after `334 Region` because of the not yet removed `334 Region`. We would need to already clean the regions up and then do the rewiring. But then the question arises why to use the split-if optimization in the first place when we do not want to rely on IGVN to clean it up. I therefore suggest to go with an easy bailout fix for JDK 18 where we do not apply this identical back-to-back if removal optimization if there are data dependencies and rework this in an RFE for JDK 19. Roland already has some ideas how to do that. I ran some standard benchmarks and did not see any performance regressions with this fix. Thanks, Christian ------------- Commit messages: - 8275610: C2: Object field load floats above its null check resulting in a segfault Changes: https://git.openjdk.java.net/jdk/pull/6701/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6701&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8275610 Stats: 112 lines in 2 files changed: 112 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6701.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6701/head:pull/6701 PR: https://git.openjdk.java.net/jdk/pull/6701 From shade at openjdk.java.net Fri Dec 3 15:46:19 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Fri, 3 Dec 2021 15:46:19 GMT Subject: RFR: 8278141: LIR_OpLoadKlass::_info shadows the field of the same name from LIR_Op In-Reply-To: References: Message-ID: On Thu, 2 Dec 2021 12:31:41 GMT, Aleksey Shipilev wrote: > SonarCloud complains about code added in [JDK-8277417](https://bugs.openjdk.java.net/browse/JDK-8277417): > Field "_info" shadows a field of the same name in base class "LIR_Op" > > > class LIR_OpLoadKlass: public LIR_Op { > friend class LIR_OpVisitState; > > private: > LIR_Opr _obj; > CodeEmitInfo* _info; <--- here > > > From the look of it, it seems risky to have two inconsistent fields here. Depending on which base class we use to access it, we might have different `_info`-s referenced. @rkennke, that was not intentional, was it? I don't see the mentions of this oddity in the original PR. > > The fix is to push `CodeEmitInfo` to super-class `LIR_Op`, and use it from there. > > Additional testing: > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/6669 From shade at openjdk.java.net Fri Dec 3 15:46:20 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Fri, 3 Dec 2021 15:46:20 GMT Subject: Integrated: 8278141: LIR_OpLoadKlass::_info shadows the field of the same name from LIR_Op In-Reply-To: References: Message-ID: On Thu, 2 Dec 2021 12:31:41 GMT, Aleksey Shipilev wrote: > SonarCloud complains about code added in [JDK-8277417](https://bugs.openjdk.java.net/browse/JDK-8277417): > Field "_info" shadows a field of the same name in base class "LIR_Op" > > > class LIR_OpLoadKlass: public LIR_Op { > friend class LIR_OpVisitState; > > private: > LIR_Opr _obj; > CodeEmitInfo* _info; <--- here > > > From the look of it, it seems risky to have two inconsistent fields here. Depending on which base class we use to access it, we might have different `_info`-s referenced. @rkennke, that was not intentional, was it? I don't see the mentions of this oddity in the original PR. > > The fix is to push `CodeEmitInfo` to super-class `LIR_Op`, and use it from there. > > Additional testing: > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` This pull request has now been integrated. Changeset: 0e7b6bcd Author: Aleksey Shipilev URL: https://git.openjdk.java.net/jdk/commit/0e7b6bcd8260293c3d39417f04b9b1e4409aa20a Stats: 4 lines in 1 file changed: 0 ins; 2 del; 2 mod 8278141: LIR_OpLoadKlass::_info shadows the field of the same name from LIR_Op Reviewed-by: thartmann, rkennke ------------- PR: https://git.openjdk.java.net/jdk/pull/6669 From kvn at openjdk.java.net Fri Dec 3 17:01:24 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Dec 2021 17:01:24 GMT Subject: RFR: 8277529: SIGSEGV in C2 CompilerThread Node::rematerialize() compiling Packet::readUnsignedTrint [v2] In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 08:59:47 GMT, Christian Hagedorn wrote: >> This bug was found in internal testing for 17.0.2 after the backport of [JDK-8272574](https://bugs.openjdk.java.net/browse/JDK-8272574). The fix rewires a control input of a split load node through a phi to the same control input as the corresponding memory input into the phi node: >> https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/memnode.cpp#L1629-L1637 >> >> This can lead to an illegal control input for the new load if `mem->in(i)` is a projection. This situation occurs in the test case: `mem->in(i)` is a projection of an `ArrayCopy` node. Having an `ArrayCopy` node as control input is unexpected and we later fail with an assertion (in product we crash with a segfault). >> >> My initial thought was to just not apply the improved rewiring for memory projections and fall back to the else case on L1636. However, this also does not work as we could then wrongly move a range check out of a loop and creating a cyclic dependency (case described in the [review of JDK-8272574](https://github.com/openjdk/jdk/pull/5142)) and reproduced with `test2()` where we hit: >> https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/loopPredicate.cpp#L777-L778 >> >> During this analysis I came across the [fix for JDK-8146792](http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/2748d975045f) which should actually prevent loop predication if we have a data dependency on the projection into a loop. But this does not seem to work correctly as shown in the test case found for JDK-8272574. In the review of JDK-8272574, it was missed that the actual problem could have been traced back to JDK-8146792 and instead we went for an improvement for loads split through phis to not end up with cases supposed to be fixed by JDK-8146972 (which now also causes other problems). But the root problem is still there to be fixed. >> >> I therefore propose to completely undo the fix for JDK-8272574 for now and go for a fix on top of JDK-8146972 to fix JDK-8272574 and also prevent this bug. The assertion code added by JDK-8272574 is still good to have. I suggest to revisit the improved rewiring for loads done with JDK-8272574 in an RFE. I think it should still be beneficial but requires some more careful checking (to avoid the problems reported in this bug). >> >> >> Explanation of JDK-8272574 with regard to JDK-8146972 (covered by `test2-5()`): >> >> When deciding if we can apply loop predication to a check inside the loop, we call `IdealLoopTree::is_range_check()`. This method calls `PhaseIdealLoop::is_scaled_iv_plus_offset()` which can create new nodes for the offset: >> https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopTransform.cpp#L2586-L2588 >> `get_ctrl()` of these new nodes could also be the incoming projection into a loop. But these nodes are not marked as non-loop-invariant as done for the other nodes in `Invariant::Invariant()`: >> https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopPredicate.cpp#L663-L667 >> >> The proposed fix keeps track when above code was applied (`data_dependency_on` is set) and does an additional checking for `get_ctrl()` accordingly if a new node was created for the offset in `PhaseIdealLoop::is_scaled_iv_plus_offset()`. This should be fine as we are not using the newly created `offset` node anymore after doing that check. >> >> In discussions with Roland, we think that we should revisit this fix and the fix of JDK-8146972 and clean them up to directly do the invariant checks in `compute_invariance()/visit()` without special handling in the constructor. But given the deadline of 17.0.2 and the fork soon coming up for 18, we think it's the most safe way to go with the proposed fix - if others agree. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > use direct bailout Update looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6670 From chagedorn at openjdk.java.net Fri Dec 3 17:17:26 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 3 Dec 2021 17:17:26 GMT Subject: RFR: 8277529: SIGSEGV in C2 CompilerThread Node::rematerialize() compiling Packet::readUnsignedTrint [v2] In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 08:59:47 GMT, Christian Hagedorn wrote: >> This bug was found in internal testing for 17.0.2 after the backport of [JDK-8272574](https://bugs.openjdk.java.net/browse/JDK-8272574). The fix rewires a control input of a split load node through a phi to the same control input as the corresponding memory input into the phi node: >> https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/memnode.cpp#L1629-L1637 >> >> This can lead to an illegal control input for the new load if `mem->in(i)` is a projection. This situation occurs in the test case: `mem->in(i)` is a projection of an `ArrayCopy` node. Having an `ArrayCopy` node as control input is unexpected and we later fail with an assertion (in product we crash with a segfault). >> >> My initial thought was to just not apply the improved rewiring for memory projections and fall back to the else case on L1636. However, this also does not work as we could then wrongly move a range check out of a loop and creating a cyclic dependency (case described in the [review of JDK-8272574](https://github.com/openjdk/jdk/pull/5142)) and reproduced with `test2()` where we hit: >> https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/loopPredicate.cpp#L777-L778 >> >> During this analysis I came across the [fix for JDK-8146792](http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/2748d975045f) which should actually prevent loop predication if we have a data dependency on the projection into a loop. But this does not seem to work correctly as shown in the test case found for JDK-8272574. In the review of JDK-8272574, it was missed that the actual problem could have been traced back to JDK-8146792 and instead we went for an improvement for loads split through phis to not end up with cases supposed to be fixed by JDK-8146972 (which now also causes other problems). But the root problem is still there to be fixed. >> >> I therefore propose to completely undo the fix for JDK-8272574 for now and go for a fix on top of JDK-8146972 to fix JDK-8272574 and also prevent this bug. The assertion code added by JDK-8272574 is still good to have. I suggest to revisit the improved rewiring for loads done with JDK-8272574 in an RFE. I think it should still be beneficial but requires some more careful checking (to avoid the problems reported in this bug). >> >> >> Explanation of JDK-8272574 with regard to JDK-8146972 (covered by `test2-5()`): >> >> When deciding if we can apply loop predication to a check inside the loop, we call `IdealLoopTree::is_range_check()`. This method calls `PhaseIdealLoop::is_scaled_iv_plus_offset()` which can create new nodes for the offset: >> https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopTransform.cpp#L2586-L2588 >> `get_ctrl()` of these new nodes could also be the incoming projection into a loop. But these nodes are not marked as non-loop-invariant as done for the other nodes in `Invariant::Invariant()`: >> https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopPredicate.cpp#L663-L667 >> >> The proposed fix keeps track when above code was applied (`data_dependency_on` is set) and does an additional checking for `get_ctrl()` accordingly if a new node was created for the offset in `PhaseIdealLoop::is_scaled_iv_plus_offset()`. This should be fine as we are not using the newly created `offset` node anymore after doing that check. >> >> In discussions with Roland, we think that we should revisit this fix and the fix of JDK-8146972 and clean them up to directly do the invariant checks in `compute_invariance()/visit()` without special handling in the constructor. But given the deadline of 17.0.2 and the fork soon coming up for 18, we think it's the most safe way to go with the proposed fix - if others agree. >> >> Thanks, >> Christian > > Christian Hagedorn has updated the pull request incrementally with one additional commit since the last revision: > > use direct bailout Thanks Vladimir for your review! ------------- PR: https://git.openjdk.java.net/jdk/pull/6670 From chagedorn at openjdk.java.net Fri Dec 3 17:17:28 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 3 Dec 2021 17:17:28 GMT Subject: Integrated: 8277529: SIGSEGV in C2 CompilerThread Node::rematerialize() compiling Packet::readUnsignedTrint In-Reply-To: References: Message-ID: <4CHYvuanW2z8kMK25J6jyf6MuvSeBuapgrx3_ECV-cE=.309f0667-047b-4e10-be00-87aea2e84d00@github.com> On Thu, 2 Dec 2021 12:46:06 GMT, Christian Hagedorn wrote: > This bug was found in internal testing for 17.0.2 after the backport of [JDK-8272574](https://bugs.openjdk.java.net/browse/JDK-8272574). The fix rewires a control input of a split load node through a phi to the same control input as the corresponding memory input into the phi node: > https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/memnode.cpp#L1629-L1637 > > This can lead to an illegal control input for the new load if `mem->in(i)` is a projection. This situation occurs in the test case: `mem->in(i)` is a projection of an `ArrayCopy` node. Having an `ArrayCopy` node as control input is unexpected and we later fail with an assertion (in product we crash with a segfault). > > My initial thought was to just not apply the improved rewiring for memory projections and fall back to the else case on L1636. However, this also does not work as we could then wrongly move a range check out of a loop and creating a cyclic dependency (case described in the [review of JDK-8272574](https://github.com/openjdk/jdk/pull/5142)) and reproduced with `test2()` where we hit: > https://github.com/openjdk/jdk/blob/e002bfec8cb815b551c9b0f851a8a5b288e8360d/src/hotspot/share/opto/loopPredicate.cpp#L777-L778 > > During this analysis I came across the [fix for JDK-8146792](http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/2748d975045f) which should actually prevent loop predication if we have a data dependency on the projection into a loop. But this does not seem to work correctly as shown in the test case found for JDK-8272574. In the review of JDK-8272574, it was missed that the actual problem could have been traced back to JDK-8146792 and instead we went for an improvement for loads split through phis to not end up with cases supposed to be fixed by JDK-8146972 (which now also causes other problems). But the root problem is still there to be fixed. > > I therefore propose to completely undo the fix for JDK-8272574 for now and go for a fix on top of JDK-8146972 to fix JDK-8272574 and also prevent this bug. The assertion code added by JDK-8272574 is still good to have. I suggest to revisit the improved rewiring for loads done with JDK-8272574 in an RFE. I think it should still be beneficial but requires some more careful checking (to avoid the problems reported in this bug). > > > Explanation of JDK-8272574 with regard to JDK-8146972 (covered by `test2-5()`): > > When deciding if we can apply loop predication to a check inside the loop, we call `IdealLoopTree::is_range_check()`. This method calls `PhaseIdealLoop::is_scaled_iv_plus_offset()` which can create new nodes for the offset: > https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopTransform.cpp#L2586-L2588 > `get_ctrl()` of these new nodes could also be the incoming projection into a loop. But these nodes are not marked as non-loop-invariant as done for the other nodes in `Invariant::Invariant()`: > https://github.com/openjdk/jdk/blob/b79554bb5cef14590d427543a40efbcc60c66548/src/hotspot/share/opto/loopPredicate.cpp#L663-L667 > > The proposed fix keeps track when above code was applied (`data_dependency_on` is set) and does an additional checking for `get_ctrl()` accordingly if a new node was created for the offset in `PhaseIdealLoop::is_scaled_iv_plus_offset()`. This should be fine as we are not using the newly created `offset` node anymore after doing that check. > > In discussions with Roland, we think that we should revisit this fix and the fix of JDK-8146972 and clean them up to directly do the invariant checks in `compute_invariance()/visit()` without special handling in the constructor. But given the deadline of 17.0.2 and the fork soon coming up for 18, we think it's the most safe way to go with the proposed fix - if others agree. > > Thanks, > Christian This pull request has now been integrated. Changeset: 01cb2b98 Author: Christian Hagedorn URL: https://git.openjdk.java.net/jdk/commit/01cb2b9883d7c9ecdba0ee5bd42124faed4d080c Stats: 192 lines in 3 files changed: 163 ins; 10 del; 19 mod 8277529: SIGSEGV in C2 CompilerThread Node::rematerialize() compiling Packet::readUnsignedTrint Reviewed-by: thartmann, roland, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/6670 From psandoz at openjdk.java.net Fri Dec 3 17:25:22 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Fri, 3 Dec 2021 17:25:22 GMT Subject: RFR: 8277793: Support vector F2I and D2L cast operations for X86 [v3] In-Reply-To: References: <2dggI8qABtygqAVieUgvvwXV9pcTWrt-fXXW3CaGHXQ=.e73cad93-fc58-44af-aa05-1ff0fdf72fb6@github.com> Message-ID: On Wed, 1 Dec 2021 12:20:55 GMT, Nils Eliasson wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8277793: Review comments resolution > > Looks good! > > Hi @neliasso , can you kindly do a final run of the patch though your regression suite. > > Hi @neliasso, @PaulSandoz , can you kindly run this through your test framework once before I Integrate. Testing. ------------- PR: https://git.openjdk.java.net/jdk/pull/6544 From kvn at openjdk.java.net Fri Dec 3 18:10:17 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Dec 2021 18:10:17 GMT Subject: RFR: 8278079: C2: expand_dtrace_alloc_probe doesn't take effect in macro.cpp In-Reply-To: References: <20j9XG2FKxSZlg4VrqF2oMOx_Ln-luxUCAooAMPjw4w=.791599dc-dbcf-4c3e-9048-1c0812d7b486@github.com> <7nQ8on8CcixFe3JKdaUOAq-EU6s3UGfNKqhwh9a9nFE=.af9b2828-bfe1-4bde-a560-69d01a5e68ee@github.com> Message-ID: On Fri, 3 Dec 2021 09:17:45 GMT, Tobias Hartmann wrote: > > I think ExtendedDTraceProbes was originally designed for this purpose > > Yes, if that's the case, I'm fine with the change. Otherwise, I would suggest to remove that flag. @vnkozlov, what do you think? Yes, I agree with removal of `ExtendedDTraceProbes` flag. It was first flag which was used in all places in HotSpot during development of DTrace support. Based on changes I see for [6346964](https://bugs.openjdk.java.net/browse/JDK-6346964) it was replaced with new 3 flags: ^Ad D 1.300.1.1 06/03/13 11:39:20 km88527 802 801 ^Ac Partial 6346964: Support dynamic enabling of extended DTrace probes } else if (match_option(option, "-XX:+ExtendedDTraceProbes", &tail)) { #ifdef SOLARIS FLAG_SET_CMDLINE(bool, ExtendedDTraceProbes, true); ^AI 802 FLAG_SET_CMDLINE(bool, DTraceMethodProbes, true); FLAG_SET_CMDLINE(bool, DTraceAllocProbes, true); FLAG_SET_CMDLINE(bool, DTraceMonitorProbes, true); ^AE 802 Actually the code guarded by `ExtendedDTraceProbes` in `macro.cpp` was removed in those changes but later it was restored during branches merge (wrong merge?). 6346964_macro ------------- PR: https://git.openjdk.java.net/jdk/pull/6639 From sviswanathan at openjdk.java.net Fri Dec 3 18:18:19 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 3 Dec 2021 18:18:19 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs [v6] In-Reply-To: References: Message-ID: <-KrgC60_yf9oimwB2SnNx-f7z_EBTbm4-d2OqLzd-Nc=.f7f14de3-ea27-44f1-9488-34ac2ccf6f78@github.com> On Fri, 3 Dec 2021 03:38:32 GMT, David Holmes wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix whitespace > > Please check the GHA test results before integrating - there is a failing compiler test. @dholmes-ora I looked at the results. The failure in compiler/c2/irTests/TestUnsignedComparison.java for x86_32 is unrelated to this patch ( https://bugs.openjdk.java.net/browse/JDK-8277324) and was fixed recently. A merge with master should fix that. ------------- PR: https://git.openjdk.java.net/jdk/pull/6512 From kvn at openjdk.java.net Fri Dec 3 18:18:20 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Dec 2021 18:18:20 GMT Subject: RFR: JDK-8277496 Remove duplication in c1 Block successor lists [v3] In-Reply-To: References: <4wdDk_HKbqPRXF47C895FeMhJpmrT6rt0aVT8q0miH0=.7b4bcdd0-671a-409b-ad6f-e433a3507e24@github.com> Message-ID: On Wed, 1 Dec 2021 13:20:49 GMT, Ludvig Janiuk wrote: >> Remove `BlockBegin::_successors`, leaving `BlockEnd::_sux` as the SSOT for the successors of a block. Prior to this PR, these two lists were both tracking the same list of successors of the same block. This necessitated a lot of syncing and verification code. >> >> With this PR, as long as a block has its end pointer assigned, its successors can always be reached by querying the `BlockEnd`. `BlockEnd::_sux` becomes the single place where the list of successors is maintained. When modified, the successor list no longer needs to be synchronized in two places, reducing complexity and confusion. Asserts on the two lists corresponding no longer need to be made. >> >> While being created in `GraphBuilder`, `BlockBegin`s don't have a `BlockEnd` assigned yet. To temporarily track block successors in this small interval, add a lookup structure `BlockListBuilder::_bci2block_successors`. >> >> This PR affects debug printing code. If the end pointer of a `BlockBegin `is NULL for some reason, then the successor list can no longer be printed (for obious reasons). >> >> This PR introduces an additional check to IR::verify to check that `BlockBegin::_end` is not set to null. >> >> This PR also performs some minor refactoring, polishing, inlining, and removing of dead code around the affected areas. >> >> The commit history has been polished to attempt to guide the reader through the changes. >> >> hs-tier1 and hs-tier2 tests pass. > > Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: > > changed if statement style @LudwikJaniuk Please, rerun your testing. Running only `-b linux-x64` is not enough - you should not specify this option at all. Testing should be done on all platforms. Also `linux-x64` is product build which does not include asserts. So we may not catch possible issues. ------------- PR: https://git.openjdk.java.net/jdk/pull/6614 From sviswanathan at openjdk.java.net Fri Dec 3 18:24:00 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 3 Dec 2021 18:24:00 GMT Subject: RFR: 8277617: Adjust AVX3Threshold for copy/fill stubs [v7] In-Reply-To: References: Message-ID: > Currently 32-byte instructions are used for small array copy and clear. > This can be optimized by using 64-byte instructions. > > Please review. > > Best Regards, > Sandhya Sandhya Viswanathan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: - Merge branch 'master' of https://git.openjdk.java.net/jdk into copyclearopt - Fix whitespace - Implement review comments - Override threshold only if flag is default - update comment for avx3_threshold() with more details - restrict to Intel core and add comment - 8277617: Optimize array copy and clear on x86_64 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6512/files - new: https://git.openjdk.java.net/jdk/pull/6512/files/190f974c..9cbfc374 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6512&range=06 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6512&range=05-06 Stats: 34253 lines in 918 files changed: 20994 ins; 7898 del; 5361 mod Patch: https://git.openjdk.java.net/jdk/pull/6512.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6512/head:pull/6512 PR: https://git.openjdk.java.net/jdk/pull/6512 From kvn at openjdk.java.net Fri Dec 3 18:34:12 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 3 Dec 2021 18:34:12 GMT Subject: RFR: 8275610: C2: Object field load floats above its null check resulting in a segfault In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 14:00:39 GMT, Christian Hagedorn wrote: > In the test case, a field load of an object floats above the object null check due to a `CastPP` that gets separated from its null check `If` node. C2 then schedules the field load before the object null check which results in a segfault. > > The problem can be traced back to the elimination of identical back-to-back ifs [(JDK-8143542)](https://bugs.openjdk.java.net/browse/JDK-8143542). This is done in split-if where we detect identical back-to-back ifs in `PhaseIdealLoop::identical_backtoback_ifs()`. We then replace the boolean node of the dominated `If` by a phi to use the split-if optimization to nicely fold away the dominated `If` later in IGVN. This, however, does not update any data nodes that were dependent on the dominated `If` projections. > > In the test case, we have the following graph right before splitting `313 If` (`NC 2`) through `190 Region`: > > ![Screenshot from 2021-12-03 14-08-31](https://user-images.githubusercontent.com/17833009/144608863-a1185bf8-dfa3-4bd5-a0c2-284ea2b27606.png) > `313 If` is dominated by the identical (= both share `308 Bool`) `309 If` (`NC 1`). The bool input for `313 If` is replaced by a phi and the split-if optimization is applied. However, the data nodes dependent on the out projections of the dominated `313 If` (`261 CastPP` in this case) are not processed separately and just end up at the newly inserted regions in split-if. In the test case, we get the following graph where `261 CastPP` ends up at the new `334 Region`: > > ![Screenshot from 2021-12-03 14-08-59](https://user-images.githubusercontent.com/17833009/144608891-48adf7a3-9b04-4e6d-bc26-310757b3d596.png) > Loopopts are applied and can remove the `325 CountedLoop` and we find that the `171 RangeCheck` (`RC 2`) is applied on a constant array index 1. > > Now at IGVN, the order in which the nodes are processed is important in order to trigger the segfault: > 1. `334 Region` with `332 If` and `333 If` are removed because of the special split-if setup we used to remove the identical `313 If`. The control input of `261 CastPP` is therefore updated to `172 IfTrue`. > 2. Applying `RangeCheck::Ideal()` for `171 RangeCheck` finds that `305 RangeCheck` (`RC 1`) already covers it and we remove `171 RangeCheck`. In this process, we rewire `261 CastPP` to the dominating `305 RangeCheck` and we have the following graph: > > ![Screenshot from 2021-12-03 14-09-21](https://user-images.githubusercontent.com/17833009/144608917-f5159b21-40b2-41c8-afab-fa495433217a.png) > `261 CastPP` - and also the field `263 LoadI` - have now `306 IfFalse` as early control. GCM is then scheduling `263 LoadI` before the null check `309 If` and we get a segfault. > > An easy fix is not straight forward. What we actually would want to do is rewiring `261 CastPP` from `334 Region` to `311 IfTrue` in the second graph after split-if to not separate it from the null check. But that's not possible because we would create a bad graph: The early control `311 IfTrue` of `261 CastPP` does not dominate its late control further down after `334 Region` because of the not yet removed `334 Region`. We would need to already clean the regions up and then do the rewiring. But then the question arises why to use the split-if optimization in the first place when we do not want to rely on IGVN to clean it up. > > I therefore suggest to go with an easy bailout fix for JDK 18 where we do not apply this identical back-to-back if removal optimization if there are data dependencies and rework this in an RFE for JDK 19. Roland already has some ideas how to do that. > > I ran some standard benchmarks and did not see any performance regressions with this fix. > > Thanks, > Christian I agree with bailout for now and proper fix it later. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6701 From psandoz at openjdk.java.net Fri Dec 3 19:34:12 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Fri, 3 Dec 2021 19:34:12 GMT Subject: RFR: 8277793: Support vector F2I and D2L cast operations for X86 [v4] In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 18:06:58 GMT, Jatin Bhateja wrote: >> - JDK-8275317 extended auto-vectorizer to infer Vector Cast operations if source and destination primitive type have same size. >> - This patch adds the backend support for vector CastF2I and CaseD2L on X86 AVX512 and legacy targets. >> >> Following are the performance measurements of an existing JMH benchmark (test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java) >> >> System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> BENCHMARK | SIZE | BASELINE (AVX3) ns/op | WithOpt (AVX3) ns/op | Gain AVX3(baseline/opt) | BASELINE (AVX2) ns/op | WithOpt (AVX2) ns/op | Gain AVX2 (baseline/opt) >> -- | -- | -- | -- | -- | -- | -- | -- >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 512.00 | 256.26 | 77.50 | 3.31 | 275.49 | 275.65 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 1024.00 | 501.87 | 150.35 | 3.34 | 540.47 | 541.22 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 2048.00 | 993.05 | 293.23 | 3.39 | 1070.56 | 1070.14 | 1.00 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 512.00 | 227.83 | 39.36 | 5.79 | 248.25 | 45.01 | 5.52 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 1024.00 | 449.70 | 77.88 | 5.77 | 487.33 | 86.15 | 5.66 >> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 2048.00 | 884.95 | 149.58 | 5.92 | 956.58 | 152.45 | 6.27 >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8277793: Review comments resolution. Tests pass. ------------- PR: https://git.openjdk.java.net/jdk/pull/6544 From jbhateja at openjdk.java.net Fri Dec 3 20:22:07 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 3 Dec 2021 20:22:07 GMT Subject: RFR: 8277997: Intrinsic creation for VectorMask.fromLong API [v2] In-Reply-To: References: Message-ID: > Summary of changes: > > 1) Inline expansion of VectorMask.fromLong API, this includes Java API implementation and C2 IR changes. > 2) X86 backend support for AVX512 and AVX2 targets. > 3) New IR transformation to handle following patterns:- > a) Mask2Long + Long2Mask -> MaskCast (when source and destination mask lengths are equal) > b) Long2Mask + Mask2Long -> Long > 4) Following performance data is collected for new JMH micro included with the patch:- > > System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) > > Benchmark | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain factor | Baseline AVX3 (ops/ms) | Withopt AVX3(ops/ms) | Gain factor > -- | -- | -- | -- | -- | -- | -- > MaskFromLongBenchmark.microMaskFromLong_Byte128 | 20050.884 | 36414.349 | 1.816096936 | 19699.631 | 36412.252 | 1.848372287 > MaskFromLongBenchmark.microMaskFromLong_Byte256 | 17589.496 | 36418.368 | 2.070461143 | 17211.451 | 36407.44 | 2.115303352 > MaskFromLongBenchmark.microMaskFromLong_Byte512 | 2824.411 | 2492.795 | 0.882589326 | 6359.071 | 36405.344 | 5.72494693 > MaskFromLongBenchmark.microMaskFromLong_Byte64 | 23507.28 | 36424.668 | 1.549505855 | 22659.666 | 36420.345 | 1.607276338 > MaskFromLongBenchmark.microMaskFromLong_Integer128 | 24567.895 | 36411.602 | 1.482080659 | 24620.619 | 36397.005 | 1.478313969 > MaskFromLongBenchmark.microMaskFromLong_Integer256 | 23495.078 | 36411.981 | 1.549770595 | 22823.846 | 36395.703 | 1.594634971 > MaskFromLongBenchmark.microMaskFromLong_Integer512 | 12377.022 | 11478.101 | 0.927371786 | 19701.118 | 36394.878 | 1.847350897 > MaskFromLongBenchmark.microMaskFromLong_Integer64 | 22169.231 | 17791.849 | 0.802546962 | 23603.169 | 18055.166 | 0.76494669 > MaskFromLongBenchmark.microMaskFromLong_Long128 | 22312.568 | 17859.474 | 0.800422166 | 22171.303 | 18106.295 | 0.816654529 > MaskFromLongBenchmark.microMaskFromLong_Long256 | 24271.19 | 36416.883 | 1.500416049 | 24621.327 | 36390.41 | 1.478003602 > MaskFromLongBenchmark.microMaskFromLong_Long512 | 15289.749 | 13860.775 | 0.906540389 | 23003.816 | 36396.033 | 1.582173714 > MaskFromLongBenchmark.microMaskFromLong_Long64 | 27086.471 | 20490.828 | 0.756496777 | 27177.133 | 20441.112 | 0.752143797 > MaskFromLongBenchmark.microMaskFromLong_Short128 | 23504.216 | 36412.66 | 1.549196961 | 22823.401 | 36417.799 | 1.595634191 > MaskFromLongBenchmark.microMaskFromLong_Short256 | 20056.61 | 36403.277 | 1.815026418 | 19699.502 | 36412.605 | 1.84840231 > MaskFromLongBenchmark.microMaskFromLong_Short512 | 4775.721 | 6827.594 | 1.429646749 | 17209.782 | 36388.226 | 2.114392036 > MaskFromLongBenchmark.microMaskFromLong_Short64 | 24759.049 | 36381.539 | 1.469423927 | 24506.013 | 36413.099 | 1.48588426 > > > > Kindly review and share feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8277997: Review comments resolved. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6646/files - new: https://git.openjdk.java.net/jdk/pull/6646/files/1208bb38..a1d3f019 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6646&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6646&range=00-01 Stats: 194 lines in 48 files changed: 19 ins; 8 del; 167 mod Patch: https://git.openjdk.java.net/jdk/pull/6646.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6646/head:pull/6646 PR: https://git.openjdk.java.net/jdk/pull/6646 From jbhateja at openjdk.java.net Fri Dec 3 20:22:08 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 3 Dec 2021 20:22:08 GMT Subject: RFR: 8277997: Intrinsic creation for VectorMask.fromLong API [v2] In-Reply-To: References: Message-ID: <19sfoNZ57GywwCjKfKwxBkKIOeUmPJA-THGAj2n-ywE=.b1741237-0060-4e2e-a0cd-6149f2274e23@github.com> On Wed, 1 Dec 2021 23:06:01 GMT, Paul Sandoz wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8277997: Review comments resolved. > > Arguably broadcasting is not the correct term to associate with conversion of a long value to a mask, but it is very convenient to reuse `VectorSupport.broadcastCoerced` and i don't have a better solution in that regard. The addition of a new intrinsic seems overly heavy. > > We could rename to `fromBitsCoerced` then the `bitwise` parameter can be renamed `mode`. > > Can we define named constants on the Java and HotSpot side: `0`, for broadcasting; and `1` for mask conversion e.g. `BITS_COERCED_BROADCAST = 0`, `BITS_COERCED_MASK_TO_LONG=1`. > > This potentially allows for future modes such as broadcast only to the first lane. @PaulSandoz , your comments have been addressed. ------------- PR: https://git.openjdk.java.net/jdk/pull/6646 From jbhateja at openjdk.java.net Fri Dec 3 20:25:21 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 3 Dec 2021 20:25:21 GMT Subject: Integrated: 8277793: Support vector F2I and D2L cast operations for X86 In-Reply-To: References: Message-ID: On Wed, 24 Nov 2021 18:50:17 GMT, Jatin Bhateja wrote: > - JDK-8275317 extended auto-vectorizer to infer Vector Cast operations if source and destination primitive type have same size. > - This patch adds the backend support for vector CastF2I and CaseD2L on X86 AVX512 and legacy targets. > > Following are the performance measurements of an existing JMH benchmark (test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java) > > System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) > > BENCHMARK | SIZE | BASELINE (AVX3) ns/op | WithOpt (AVX3) ns/op | Gain AVX3(baseline/opt) | BASELINE (AVX2) ns/op | WithOpt (AVX2) ns/op | Gain AVX2 (baseline/opt) > -- | -- | -- | -- | -- | -- | -- | -- > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 512.00 | 256.26 | 77.50 | 3.31 | 275.49 | 275.65 | 1.00 > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 1024.00 | 501.87 | 150.35 | 3.34 | 540.47 | 541.22 | 1.00 > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 2048.00 | 993.05 | 293.23 | 3.39 | 1070.56 | 1070.14 | 1.00 > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 512.00 | 227.83 | 39.36 | 5.79 | 248.25 | 45.01 | 5.52 > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 1024.00 | 449.70 | 77.88 | 5.77 | 487.33 | 86.15 | 5.66 > TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 2048.00 | 884.95 | 149.58 | 5.92 | 956.58 | 152.45 | 6.27 > > Kindly review and share your feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 2b87c2b4 Author: Jatin Bhateja URL: https://git.openjdk.java.net/jdk/commit/2b87c2b429f1c9f0d940795d5f74a54a20c2f5c0 Stats: 187 lines in 7 files changed: 181 ins; 1 del; 5 mod 8277793: Support vector F2I and D2L cast operations for X86 Reviewed-by: neliasso, sviswanathan ------------- PR: https://git.openjdk.java.net/jdk/pull/6544 From psandoz at openjdk.java.net Fri Dec 3 20:48:12 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Fri, 3 Dec 2021 20:48:12 GMT Subject: RFR: 8277997: Intrinsic creation for VectorMask.fromLong API [v2] In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 20:22:07 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> 1) Inline expansion of VectorMask.fromLong API, this includes Java API implementation and C2 IR changes. >> 2) X86 backend support for AVX512 and AVX2 targets. >> 3) New IR transformation to handle following patterns:- >> a) Mask2Long + Long2Mask -> MaskCast (when source and destination mask lengths are equal) >> b) Long2Mask + Mask2Long -> Long >> 4) Following performance data is collected for new JMH micro included with the patch:- >> >> System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> Benchmark | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain factor | Baseline AVX3 (ops/ms) | Withopt AVX3(ops/ms) | Gain factor >> -- | -- | -- | -- | -- | -- | -- >> MaskFromLongBenchmark.microMaskFromLong_Byte128 | 20050.884 | 36414.349 | 1.816096936 | 19699.631 | 36412.252 | 1.848372287 >> MaskFromLongBenchmark.microMaskFromLong_Byte256 | 17589.496 | 36418.368 | 2.070461143 | 17211.451 | 36407.44 | 2.115303352 >> MaskFromLongBenchmark.microMaskFromLong_Byte512 | 2824.411 | 2492.795 | 0.882589326 | 6359.071 | 36405.344 | 5.72494693 >> MaskFromLongBenchmark.microMaskFromLong_Byte64 | 23507.28 | 36424.668 | 1.549505855 | 22659.666 | 36420.345 | 1.607276338 >> MaskFromLongBenchmark.microMaskFromLong_Integer128 | 24567.895 | 36411.602 | 1.482080659 | 24620.619 | 36397.005 | 1.478313969 >> MaskFromLongBenchmark.microMaskFromLong_Integer256 | 23495.078 | 36411.981 | 1.549770595 | 22823.846 | 36395.703 | 1.594634971 >> MaskFromLongBenchmark.microMaskFromLong_Integer512 | 12377.022 | 11478.101 | 0.927371786 | 19701.118 | 36394.878 | 1.847350897 >> MaskFromLongBenchmark.microMaskFromLong_Integer64 | 22169.231 | 17791.849 | 0.802546962 | 23603.169 | 18055.166 | 0.76494669 >> MaskFromLongBenchmark.microMaskFromLong_Long128 | 22312.568 | 17859.474 | 0.800422166 | 22171.303 | 18106.295 | 0.816654529 >> MaskFromLongBenchmark.microMaskFromLong_Long256 | 24271.19 | 36416.883 | 1.500416049 | 24621.327 | 36390.41 | 1.478003602 >> MaskFromLongBenchmark.microMaskFromLong_Long512 | 15289.749 | 13860.775 | 0.906540389 | 23003.816 | 36396.033 | 1.582173714 >> MaskFromLongBenchmark.microMaskFromLong_Long64 | 27086.471 | 20490.828 | 0.756496777 | 27177.133 | 20441.112 | 0.752143797 >> MaskFromLongBenchmark.microMaskFromLong_Short128 | 23504.216 | 36412.66 | 1.549196961 | 22823.401 | 36417.799 | 1.595634191 >> MaskFromLongBenchmark.microMaskFromLong_Short256 | 20056.61 | 36403.277 | 1.815026418 | 19699.502 | 36412.605 | 1.84840231 >> MaskFromLongBenchmark.microMaskFromLong_Short512 | 4775.721 | 6827.594 | 1.429646749 | 17209.782 | 36388.226 | 2.114392036 >> MaskFromLongBenchmark.microMaskFromLong_Short64 | 24759.049 | 36381.539 | 1.469423927 | 24506.013 | 36413.099 | 1.48588426 >> >> >> >> Kindly review and share feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8277997: Review comments resolved. The coerced terms comes from representing the value to convert (one bit, byte, float, long, etc) to a mask or vector as a set of bits held in a long value. Thus having an extra mode `MODE_BITS_COERCED_BROADCAST` is confusing in that regard and i think you can just reuse `MODE_BROADCAST` when broadcast the 1 bit to a mask. Since the class argument determines whether we are referring to a vector or not, as determined by `is_mask`. Thus i would retain the existing `scalar2vector` boolean argument, thereby the mode is localized just to the intrinsic. ------------- PR: https://git.openjdk.java.net/jdk/pull/6646 From sviswanathan at openjdk.java.net Fri Dec 3 21:10:17 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 3 Dec 2021 21:10:17 GMT Subject: Integrated: 8277617: Adjust AVX3Threshold for copy/fill stubs In-Reply-To: References: Message-ID: On Tue, 23 Nov 2021 01:23:04 GMT, Sandhya Viswanathan wrote: > Currently 32-byte instructions are used for small array copy and clear. > This can be optimized by using 64-byte instructions. > > Please review. > > Best Regards, > Sandhya This pull request has now been integrated. Changeset: 24e16ac6 Author: Sandhya Viswanathan URL: https://git.openjdk.java.net/jdk/commit/24e16ac637095d7dee1d6fe34f996b68eedfa8bc Stats: 29 lines in 5 files changed: 15 ins; 0 del; 14 mod 8277617: Adjust AVX3Threshold for copy/fill stubs Reviewed-by: jbhateja, dholmes, neliasso, jiefu ------------- PR: https://git.openjdk.java.net/jdk/pull/6512 From simonis at openjdk.java.net Fri Dec 3 21:24:22 2021 From: simonis at openjdk.java.net (Volker Simonis) Date: Fri, 3 Dec 2021 21:24:22 GMT Subject: RFR: JDK-8278135: Remove un-necessary null-check for get-static in c2 In-Reply-To: References: Message-ID: On Thu, 2 Dec 2021 11:20:48 GMT, ?? wrote: > When run the following test, lots of un-necessary null check deoptimization happen. > > Small Test: > > public class CodeDependenciesTest { > private Object obj; > private String[] strs; > private Object[][] objs; > private static Class clzOne; > private static Class clzTwo; > public static void main(String[] args) throws Exception { > CodeDependenciesTest codeDependenciesTest = new CodeDependenciesTest(); > codeDependenciesTest.obj = new String("1"); > for (int i = 0; i < 300000; i++) { > codeDependenciesTest.foo(); > } > } > > public void foo() { > objs = new Object[10][10]; > for (int i = 0; i < 10; i++) { > for (int j = 0; j < 10; j++) { > objs[i][j] = new Object(); > } > } > clzOne = InvokeTest.class; > clzTwo = clzOne; > } > > static class InvokeTest { > public void bar(String i) { > try { > Thread.sleep(Long.valueOf(i)); > } catch (Exception e) { > e.printStackTrace(); > } > } > } > } > > > The deoptimization log generated by `-XX:+TraceDeoptimization` is: > > Uncommon trap bci=63 pc=0x00007f0eafbe1e38, relative_pc=0x00000000000005f8, method=CodeDependenciesTest.foo()V, debug_id=0 > Uncommon trap occurred in CodeDependenciesTest::foo compiler=c2 compile_id=288 (@0x00007f0eafbe1e38) thread=20285 reason=null_assert_or_unreached0 action=make_not_entrant unloaded_class_index=-1 debug_id=0 > DEOPT UNPACKING thread 0x00007f0ec0028190 vframeArray 0x00007f0ec02348a0 mode 2 > {method} {0x00007f0e7c4004f0} 'foo' '()V' in 'CodeDependenciesTest' - putstatic @ bci 63 sp = 0x00007f0ec8a317c8 > Uncommon trap bci=63 pc=0x00007f0eafbe0f34, relative_pc=0x0000000000000514, method=CodeDependenciesTest.foo()V, debug_id=0 > Uncommon trap occurred in CodeDependenciesTest::foo compiler=c2 compile_id=287 (@0x00007f0eafbe0f34) thread=20285 reason=null_assert_or_unreached0 action=make_not_entrant unloaded_class_index=-1 debug_id=0 > DEOPT UNPACKING thread 0x00007f0ec0028190 vframeArray 0x00007f0ec0235e40 mode 2 > {method} {0x00007f0e7c4004f0} 'foo' '()V' in 'CodeDependenciesTest' - putstatic @ bci 63 sp = 0x00007f0ec8a317c8 > > > The corresponding opto assembly is, > > 230 B20: # out( B56 B21 ) <- in( B58 B43 B41 B19 ) Freq: 0.999369 > 230 movl [RBX + #112 (8-bit)], narrowoop: java/lang/Class:exact * # compressed ptr ! Field: CodeDependenciesTest.clzOne > 237 movq R10, RBX # ptr -> long > 23a movq RBP, java/lang/Class:exact * # ptr > 244 movq R11, RBP # ptr -> long > 247 xorq R11, R10 # long > 24a shrq R11, #22 > 24e testq R11, R11 > 251 je B56 P=0.000001 C=-1.000000 > > 257 B21: # out( B56 B22 ) <- in( B20 ) Freq: 0.999368 > 257 shrq R10, #9 > 25b addq R14, R10 # ptr > 25e cmpb [R14], #4 > 262 je B56 P=0.000001 C=-1.000000 > > 5ed B56: # out( N780 ) <- in( B55 B21 B20 ) Freq: 3.02464e-06 > 5ed movl RSI, #-20 # int > nop # 1 bytes pad for loops and calls > 5f3 call,static wrapper for: uncommon_trap(reason='null_assert_or_unreached0' action='make_not_entrant' debug_id='0') > # CodeDependenciesTest::foo @ bci:63 (line 23) L[0]=_ L[1]=_ L[2]=_ STK[0]=RBP > # OopMap {rbp=Oop off=1528/0x5f8} > 5f8 stop # ShouldNotReachHere > > > C2 tries to generate a null-check for the get-static in `clzTwo = clzOne;`, because it thinks that ciKlass of `java.lang.Class` is not loaded. > > The ciKlass of `java.lang.Class` is generated by the following stack trace, > > (gdb) bt > #0 SystemDictionary::find_instance_klass (class_name=0x800481080, class_loader=..., protection_domain=...) at /data/openjdk/jdk_dev/src/hotspot/share/classfile/systemDictionary.cpp:778 > #1 0x00007ffff610769d in SystemDictionary::find_instance_or_array_klass (class_name=0x800481080, class_loader=..., protection_domain=...) at /data/openjdk/jdk_dev/src/hotspot/share/classfile/systemDictionary.cpp:813 > #2 0x00007ffff610a55c in SystemDictionary::find_constrained_instance_or_array_klass (current=0x7ffff020a5e0, class_name=0x800481080, class_loader=...) at /data/openjdk/jdk_dev/src/hotspot/share/classfile/systemDictionary.cpp:1760 > #3 0x00007ffff55f1f29 in ciEnv::get_klass_by_name_impl (this=0x7fffc12066c0, accessing_klass=0x7fff8c0f6278, cpool=..., name=0x7ffff00334f8, require_local=false) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:519 > #4 0x00007ffff55f1d81 in ciEnv::get_klass_by_name_impl (this=0x7fffc12066c0, accessing_klass=0x7fff8c0f6278, cpool=..., name=0x7fff8c004518, require_local=false) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:488 > #5 0x00007ffff55f2466 in ciEnv::get_klass_by_index_impl (this=0x7fffc12066c0, cpool=..., index=34, is_accessible=@0x7fffc1204540: 112, accessor=0x7fff8c0f6278) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:611 > #6 0x00007ffff55f2632 in ciEnv::get_klass_by_index (this=0x7fffc12066c0, cpool=..., index=34, is_accessible=@0x7fffc1204540: 112, accessor=0x7fff8c0f6278) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:658 > #7 0x00007ffff55fb1bd in ciField::ciField (this=0x7fff8c0f9208, klass=0x7fff8c0f6278, index=65542) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciField.cpp:101 > #8 0x00007ffff55f344f in ciEnv::get_field_by_index_impl (this=0x7fffc12066c0, accessor=0x7fff8c0f6278, index=65542) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:798 > #9 0x00007ffff55f3506 in ciEnv::get_field_by_index (this=0x7fffc12066c0, accessor=0x7fff8c0f6278, index=65542) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:811 > #10 0x00007ffff56284ec in ciBytecodeStream::get_field (this=0x7fffc1204850, will_link=@0x7fffc12046cf: false) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciStreams.cpp:274 > #11 0x00007ffff562d10c in ciTypeFlow::StateVector::do_putstatic (this=0x7fff8c1bd078, str=0x7fffc1204850) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:798 > #12 0x00007ffff562e960 in ciTypeFlow::StateVector::apply_one_bytecode (this=0x7fff8c1bd078, str=0x7fffc1204850) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:1457 > #13 0x00007ffff563218d in ciTypeFlow::flow_block (this=0x7fff8c0f7a50, block=0x7fff8c0f8ec0, state=0x7fff8c1bd078, jsrs=0x7fff8c1bd0b8) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:2364 > #14 0x00007ffff563332d in ciTypeFlow::df_flow_types (this=0x7fff8c0f7a50, start=0x7fff8c0f80a8, do_flow=true, temp_vector=0x7fff8c1bd078, temp_set=0x7fff8c1bd0b8) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:2675 > #15 0x00007ffff563361f in ciTypeFlow::flow_types (this=0x7fff8c0f7a50) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:2725 > #16 0x00007ffff5634081 in ciTypeFlow::do_flow (this=0x7fff8c0f7a50) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:2886 > #17 0x00007ffff5605687 in ciMethod::get_flow_analysis (this=0x7fff8c0f6340) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciMethod.cpp:327 > #18 0x00007ffff5499d34 in InlineTree::check_can_parse (callee=0x7fff8c0f6340) at /data/openjdk/jdk_dev/src/hotspot/share/opto/bytecodeInfo.cpp:535 > #19 0x00007ffff55a0806 in CallGenerator::for_osr (m=0x7fff8c0f6340, osr_bci=22) at /data/openjdk/jdk_dev/src/hotspot/share/opto/callGenerator.cpp:299 > #20 0x00007ffff56aa2d9 in Compile::Compile (this=0x7fffc1205900, ci_env=0x7fffc12066c0, target=0x7fff8c0f6340, osr_bci=22, options=..., directive=0x7ffff01588a0) at /data/openjdk/jdk_dev/src/hotspot/share/opto/compile.cpp:687 > #21 0x00007ffff559da3e in C2Compiler::compile_method (this=0x7ffff0209f40, env=0x7fffc12066c0, target=0x7fff8c0f6340, entry_bci=22, install_code=true, directive=0x7ffff01588a0) at /data/openjdk/jdk_dev/src/hotspot/share/opto/c2compiler.cpp:108 > #22 0x00007ffff56c7f6e in CompileBroker::invoke_compiler_on_method (task=0x7ffff02322d0) at /data/openjdk/jdk_dev/src/hotspot/share/compiler/compileBroker.cpp:2291 > #23 0x00007ffff56c6aed in CompileBroker::compiler_thread_loop () at /data/openjdk/jdk_dev/src/hotspot/share/compiler/compileBroker.cpp:1966 > #24 0x00007ffff56e6f9f in CompilerThread::thread_entry (thread=0x7ffff020a5e0, __the_thread__=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/share/compiler/compilerThread.cpp:59 > #25 0x00007ffff614a42f in JavaThread::thread_main_inner (this=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/share/runtime/thread.cpp:1297 > #26 0x00007ffff614a2c5 in JavaThread::run (this=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/share/runtime/thread.cpp:1280 > #27 0x00007ffff6147b08 in Thread::call_run (this=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/share/runtime/thread.cpp:358 > #28 0x00007ffff5eafa4f in thread_native_entry (thread=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/os/linux/os_linux.cpp:705 > #29 0x00007ffff779cea5 in start_thread (arg=0x7fffc1207700) at pthread_create.c:307 > #30 0x00007ffff72c19fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > > > When `CodeDependenciesTest.foo` is compiled, the classloader of the holder of this method is AppClassLoader, so it finds `java.lang.Class` in AppClassLoader, and finds nothing, so it thinks that `java.lang.Class` is not loaded, but `java.lang.Class` is a built-in class which is definitely loaded and initialized. > > The following patch is the first patch we implemented, it tries to find Klass in the parent classloader when the Klass can not be found in current classloader. But the patch has potential risk, because user-defined classloader may not follow the 'parent delegation' style of classloading. > > diff --git a/src/hotspot/share/ci/ciEnv.cpp b/src/hotspot/share/ci/ciEnv.cpp > index e29b56a..3ceafc4 100644 > --- a/src/hotspot/share/ci/ciEnv.cpp > +++ b/src/hotspot/share/ci/ciEnv.cpp > @@ -514,11 +514,17 @@ ciKlass* ciEnv::get_klass_by_name_impl(ciKlass* accessing_klass, > { > ttyUnlocker ttyul; // release tty lock to avoid ordering problems > MutexLocker ml(current, Compile_lock); > - Klass* kls; > - if (!require_local) { > - kls = SystemDictionary::find_constrained_instance_or_array_klass(current, sym, loader); > - } else { > - kls = SystemDictionary::find_instance_or_array_klass(sym, loader, domain); > + Klass* kls = NULL; > + while (true) { > + if (!require_local) { > + kls = SystemDictionary::find_constrained_instance_or_array_klass(current, sym, loader); > + } else { > + kls = SystemDictionary::find_instance_or_array_klass(sym, loader, domain); > + } > + if (kls != NULL || loader() == NULL) { > + break; > + } > + loader = Handle(current, java_lang_ClassLoader::parent(loader())); > } > found_klass = kls; > } > > > When the Klass of the field is not loaded, the generated 'null check' helps nothing, we think remove it is the right way to avoid the deoptmization. @casparcwang , thanks for this PR! It's really a very interesting and tricky issue :) What you actually see is not a deoptimization because of a null check, but rather the opposite, a deoptimization because of a not-null-check (i.e. `null_assert` is the opposit of `null_check` an asserts if a value is not null). Let's look at a simplified version of your example: public class CodeDependenciesSimple { private static Class clzOne; private static Class clzTwo; public static void main(String[] args) throws Exception { if (args.length == 1) { clzOne = Object.class; } else if (args.length == 2) { clzOne = Class.forName("java.lang.Object"); } for (int i = 0; i < 300000; i++) { foo(); } } public static void foo() { clzTwo = clzOne; } } Calling this wiht `java CodeDependenciesSimple class` will reproduce your problem of continuous deoptimizations. The OptoAssembly looks as follows: 000 B1: # out( B3 B2 ) <- BLOCK HEAD IS JUNK Freq: 1 000 # stack bang (104 bytes) pushq rbp # Save rbp subq rsp, #16 # Create frame 00c movq R10, java/lang/Class:exact * # ptr 016 movq R11, [R10 + #176 (32-bit)] # ptr ! Field: CodeDependenciesSimple.clzOne nop # 3 bytes pad for loops and calls 020 testq R11, R11 # ptr 023 jne,s B3 P=0,000001 C=-1,000000 025 B2: # out( N1 ) <- in( B1 ) Freq: 0,999999 025 movq [R10 + #184 (32-bit)], NULL # ptr ! Field: CodeDependenciesSimple.clzTwo 030 addq rsp, 16 # Destroy frame popq rbp cmpq rsp, poll_offset[r15_thread] ja #safepoint_stub # Safepoint: poll for GC 042 ret 043 B3: # out( N1 ) <- in( B1 ) Freq: 1e-06 043 movl RSI, #-20 # int 048 movq RBP, R11 # spill 04b call,static wrapper for: uncommon_trap(reason='null_assert_or_unreached0' action='make_not_entrant' debug_id='0') # CodeDependenciesSimple::foo @ bci:3 (line 15) STK[0]=RBP # OopMap {rbp=Oop off=80/0x50} 050 stop # ShouldNotReachHere As you can see, there's a check for `clzOne` being not zero at `0x020` and if `clzOne` is not zero (which will be true if you call the program with one parameter) you'll always jump to the uncommon trap in block B3. If you call the program without arguments, the OptoAssembly of `foo()` will look exactly the same, but `clzOne` will be NULL (because it is not initialized) and you'll see no deoptimizations. This is a kind of C2 "optimization" because for NULL assignments, C2 doesn't have to prove that the field has the same type like object which is assigned to the field (the object is actually NULL). I think all this weird behavior is really a corner case for class constants (i.e. `Object.class`) because they are directly translated into a `ldc java/lang/Object` bytecode and there's no need for the classloader which executes code to load and resolve the `java.lang.Class` class. If you use ` Class.forName("java.lang.Object")` to initialize `clzOne` which is semantically the same (by executing `java CodeDependenciesSimple class forName`) the generated code changes and will look as follows: 000 B1: # out( N1 ) <- BLOCK HEAD IS JUNK Freq: 1 000 # stack bang (96 bytes) pushq rbp # Save rbp subq rsp, #16 # Create frame 00c movq R10, 0x00007fbdf0ade000 # ptr 016 movq R11, java/lang/Class:exact * # ptr 020 movq R8, [R11 + #176 (32-bit)] # ptr ! Field: CodeDependenciesSimple.clzOne 027 movq [R11 + #184 (32-bit)], R8 # ptr ! Field: CodeDependenciesSimple.clzTwo 02e movq R11, R11 # ptr -> long 02e shrq R11, #9 032 movb [R10 + R11], #0 # byte 037 addq rsp, 16 # Destroy frame popq rbp cmpq rsp, poll_offset[r15_thread] ja #safepoint_stub # Safepoint: poll for GC 049 ret ``` As you see, there's no more uncommon trap in the code. This is because now that application class loader which executes ` Class.forName("java.lang.Object")` will trigger the loading of `java.lang.Class` and will be recorded as an initiating classloader for `java.lang.Class` (`java.lang.Class` will still be loaded, or has already been loaded to be more accurate, by the bootstrap class loader, which is recorded as the class' defining class loader). Because the application class loader is an initiating classloader of `java.lang.Class`, the type flow analysis will correctly detect the type of the `clzOne` field and proove that it's the same as the type of the assigned object so there's no need for an uncommon trap any more. I think this situation can only occur for fields of type `Class` because they can be assigned from the result of an `ldc ` instruction which prevents that the executing class loader triggers the loading of the corresponding class. By the way, `String` constants are also loaded with `ldc`, but the application class loader already initiates the loading of `java.lang.String` because it's in the argument list of the `main()` function so the problem can't be easily reproduced with `String` fields. So to cut a long story short, I don't think that your proposed fix is correct. I have to think more about it, but maybe one approach for a fix could be to handle `java.lang.Class` special because it is always loaded in the bootstrap classloader during VM initialization. I'll take another look at the problem next week :) Have a nice weekend, Volker ------------- PR: https://git.openjdk.java.net/jdk/pull/6667 From kvn at openjdk.java.net Sat Dec 4 00:08:15 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 4 Dec 2021 00:08:15 GMT Subject: RFR: 8277850: C2: optimize mask checks in counted loops In-Reply-To: <_gfsjS2vHwM9TTZtfGNeBOBu9ISOFAxfSdGl391M4MM=.4b05410a-5583-4bf2-99af-eab56a359e3c@github.com> References: <_gfsjS2vHwM9TTZtfGNeBOBu9ISOFAxfSdGl391M4MM=.4b05410a-5583-4bf2-99af-eab56a359e3c@github.com> Message-ID: On Fri, 3 Dec 2021 10:04:58 GMT, Roland Westrelin wrote: > This is another fix that addresses a performance issue with panama and that was brought up by Maurizio. The pattern to optimize is: > if ((base + (offset << 2)) & 3) != 0) { > } > > where base is loop independent but offset depends on a loop variable. This can be transformed to: > > if ((base & 3) != 0) { > > That check becomes loop independent and be optimized by loop predication (or I suppose loop unswitching but that wasn't the case of the micro benchmark I worked on). > > This change also optimizes the pattern: > > (offset << 2) & 3 > > to return 0. @rwestrel there is issue with build on Windows: d:\a\jdk\jdk\jdk\src\hotspot\share\opto\mulnode.cpp(1746): error C2220: the following warning is treated as an error 20 d:\a\jdk\jdk\jdk\src\hotspot\share\opto\mulnode.cpp(1746): warning C4334: '<<': result of 32-bit shift implicitly converted to 64 bits (was 64-bit shift intended?) 21 ------------- PR: https://git.openjdk.java.net/jdk/pull/6697 From kvn at openjdk.java.net Sat Dec 4 00:20:14 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 4 Dec 2021 00:20:14 GMT Subject: RFR: JDK-8277496 Remove duplication in c1 Block successor lists [v3] In-Reply-To: References: <4wdDk_HKbqPRXF47C895FeMhJpmrT6rt0aVT8q0miH0=.7b4bcdd0-671a-409b-ad6f-e433a3507e24@github.com> Message-ID: <5e9m3nHCy9ovR_zhFTnwa0YGByihBzFyoKUuanPPCuw=.6d9527d5-dd2e-4af7-adc7-6ae17c928a32@github.com> On Wed, 1 Dec 2021 13:20:49 GMT, Ludvig Janiuk wrote: >> Remove `BlockBegin::_successors`, leaving `BlockEnd::_sux` as the SSOT for the successors of a block. Prior to this PR, these two lists were both tracking the same list of successors of the same block. This necessitated a lot of syncing and verification code. >> >> With this PR, as long as a block has its end pointer assigned, its successors can always be reached by querying the `BlockEnd`. `BlockEnd::_sux` becomes the single place where the list of successors is maintained. When modified, the successor list no longer needs to be synchronized in two places, reducing complexity and confusion. Asserts on the two lists corresponding no longer need to be made. >> >> While being created in `GraphBuilder`, `BlockBegin`s don't have a `BlockEnd` assigned yet. To temporarily track block successors in this small interval, add a lookup structure `BlockListBuilder::_bci2block_successors`. >> >> This PR affects debug printing code. If the end pointer of a `BlockBegin `is NULL for some reason, then the successor list can no longer be printed (for obious reasons). >> >> This PR introduces an additional check to IR::verify to check that `BlockBegin::_end` is not set to null. >> >> This PR also performs some minor refactoring, polishing, inlining, and removing of dead code around the affected areas. >> >> The commit history has been polished to attempt to guide the reader through the changes. >> >> hs-tier1 and hs-tier2 tests pass. > > Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: > > changed if statement style Good. Testing results looks good too. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6614 From kvn at openjdk.java.net Sat Dec 4 00:57:11 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 4 Dec 2021 00:57:11 GMT Subject: RFR: 8277850: C2: optimize mask checks in counted loops In-Reply-To: <_gfsjS2vHwM9TTZtfGNeBOBu9ISOFAxfSdGl391M4MM=.4b05410a-5583-4bf2-99af-eab56a359e3c@github.com> References: <_gfsjS2vHwM9TTZtfGNeBOBu9ISOFAxfSdGl391M4MM=.4b05410a-5583-4bf2-99af-eab56a359e3c@github.com> Message-ID: On Fri, 3 Dec 2021 10:04:58 GMT, Roland Westrelin wrote: > This is another fix that addresses a performance issue with panama and that was brought up by Maurizio. The pattern to optimize is: > if ((base + (offset << 2)) & 3) != 0) { > } > > where base is loop independent but offset depends on a loop variable. This can be transformed to: > > if ((base & 3) != 0) { > > That check becomes loop independent and be optimized by loop predication (or I suppose loop unswitching but that wasn't the case of the micro benchmark I worked on). > > This change also optimizes the pattern: > > (offset << 2) & 3 > > to return 0. src/hotspot/share/opto/mulnode.cpp line 514: > 512: } > 513: > 514: return MulINode::Value(phase); There is no `MulINode::value()` or `MulLNode::value()`, only based class `MulNode` have it. You missing initial check done by `MulNode::value()`. I suggest to move this code into `AndINode::mul_ring()` and `AndLNode::mul_ring()`. Also in `mul_ring()` you can pass `r1->get_con()` as `mask` value to `AndIL_shift_and_mask()`. src/hotspot/share/opto/mulnode.cpp line 1716: > 1714: } > 1715: } > 1716: Add comment here too which pattern it is looking for. src/hotspot/share/opto/mulnode.cpp line 1717: > 1715: } > 1716: > 1717: bool MulNode::AndIL_shift_and_mask(PhaseGVN* phase, Node* mask, Node* shift, BasicType bt) const { Pass `mask` as value here. src/hotspot/share/opto/mulnode.cpp line 1719: > 1717: bool MulNode::AndIL_shift_and_mask(PhaseGVN* phase, Node* mask, Node* shift, BasicType bt) const { > 1718: if (mask == NULL || shift == NULL) { > 1719: return false; You need to check `shift` for `TOP`. src/hotspot/share/opto/mulnode.cpp line 1755: > 1753: Node* MulNode::AndIL_add_shift_and_mask(PhaseGVN* phase, BasicType bt) { > 1754: Node* in1 = in(1); > 1755: Node* in2 = in(2); Caller already determine that `in2` is `const int mask = t2->get_con()`. You can pass it here. ------------- PR: https://git.openjdk.java.net/jdk/pull/6697 From chagedorn at openjdk.java.net Sat Dec 4 08:31:11 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Sat, 4 Dec 2021 08:31:11 GMT Subject: RFR: 8275610: C2: Object field load floats above its null check resulting in a segfault In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 14:00:39 GMT, Christian Hagedorn wrote: > In the test case, a field load of an object floats above the object null check due to a `CastPP` that gets separated from its null check `If` node. C2 then schedules the field load before the object null check which results in a segfault. > > The problem can be traced back to the elimination of identical back-to-back ifs [(JDK-8143542)](https://bugs.openjdk.java.net/browse/JDK-8143542). This is done in split-if where we detect identical back-to-back ifs in `PhaseIdealLoop::identical_backtoback_ifs()`. We then replace the boolean node of the dominated `If` by a phi to use the split-if optimization to nicely fold away the dominated `If` later in IGVN. This, however, does not update any data nodes that were dependent on the dominated `If` projections. > > In the test case, we have the following graph right before splitting `313 If` (`NC 2`) through `190 Region`: > > ![Screenshot from 2021-12-03 14-08-31](https://user-images.githubusercontent.com/17833009/144608863-a1185bf8-dfa3-4bd5-a0c2-284ea2b27606.png) > `313 If` is dominated by the identical (= both share `308 Bool`) `309 If` (`NC 1`). The bool input for `313 If` is replaced by a phi and the split-if optimization is applied. However, the data nodes dependent on the out projections of the dominated `313 If` (`261 CastPP` in this case) are not processed separately and just end up at the newly inserted regions in split-if. In the test case, we get the following graph where `261 CastPP` ends up at the new `334 Region`: > > ![Screenshot from 2021-12-03 14-08-59](https://user-images.githubusercontent.com/17833009/144608891-48adf7a3-9b04-4e6d-bc26-310757b3d596.png) > Loopopts are applied and can remove the `325 CountedLoop` and we find that the `171 RangeCheck` (`RC 2`) is applied on a constant array index 1. > > Now at IGVN, the order in which the nodes are processed is important in order to trigger the segfault: > 1. `334 Region` with `332 If` and `333 If` are removed because of the special split-if setup we used to remove the identical `313 If`. The control input of `261 CastPP` is therefore updated to `172 IfTrue`. > 2. Applying `RangeCheck::Ideal()` for `171 RangeCheck` finds that `305 RangeCheck` (`RC 1`) already covers it and we remove `171 RangeCheck`. In this process, we rewire `261 CastPP` to the dominating `305 RangeCheck` and we have the following graph: > > ![Screenshot from 2021-12-03 14-09-21](https://user-images.githubusercontent.com/17833009/144608917-f5159b21-40b2-41c8-afab-fa495433217a.png) > `261 CastPP` - and also the field `263 LoadI` - have now `306 IfFalse` as early control. GCM is then scheduling `263 LoadI` before the null check `309 If` and we get a segfault. > > An easy fix is not straight forward. What we actually would want to do is rewiring `261 CastPP` from `334 Region` to `311 IfTrue` in the second graph after split-if to not separate it from the null check. But that's not possible because we would create a bad graph: The early control `311 IfTrue` of `261 CastPP` does not dominate its late control further down after `334 Region` because of the not yet removed `334 Region`. We would need to already clean the regions up and then do the rewiring. But then the question arises why to use the split-if optimization in the first place when we do not want to rely on IGVN to clean it up. > > I therefore suggest to go with an easy bailout fix for JDK 18 where we do not apply this identical back-to-back if removal optimization if there are data dependencies and rework this in an RFE for JDK 19. Roland already has some ideas how to do that. > > I ran some standard benchmarks and did not see any performance regressions with this fix. > > Thanks, > Christian Thanks Vladimir for your review! ------------- PR: https://git.openjdk.java.net/jdk/pull/6701 From shade at openjdk.java.net Sun Dec 5 21:40:10 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Sun, 5 Dec 2021 21:40:10 GMT Subject: RFR: 8278016: Add compiler tests to tier{2,3} [v2] In-Reply-To: <_tpNEJ7RnKYConrel3BdMSsn7DRHOZX1i1lqX3yVZD8=.82735e32-7713-4d6f-8142-9bba01cbee48@github.com> References: <_tpNEJ7RnKYConrel3BdMSsn7DRHOZX1i1lqX3yVZD8=.82735e32-7713-4d6f-8142-9bba01cbee48@github.com> Message-ID: On Thu, 2 Dec 2021 07:29:31 GMT, Vladimir Kozlov wrote: > I don't see issues with these changes in my testing. I submitted our tier1,2,3 testing in internal infra. Vladimir, are internal infra results good? ------------- PR: https://git.openjdk.java.net/jdk/pull/6622 From kvn at openjdk.java.net Sun Dec 5 22:23:11 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sun, 5 Dec 2021 22:23:11 GMT Subject: RFR: 8278016: Add compiler tests to tier{2,3} [v2] In-Reply-To: References: Message-ID: <58oAR2iULxhHhO3as1oPSIHAQKL3an6BOgafpjbvOPQ=.25df644d-a7c1-43ea-a18c-e838cb245171@github.com> On Tue, 30 Nov 2021 20:44:43 GMT, Aleksey Shipilev wrote: >> I have been looking at `hotspot:tier4` (catch-all not in lower tiers) run logs, and realized the whole bunch of compiler tests are running there. >> >> Since `hotspot:tier4` runs a lot of `vmTestbase` tests, contributors seldom run it, as it takes many hours. Which means that many compiler tests are not running regularly for many contributors. But these tests are rather fast themselves and cover important compiler features. >> >> We can properly add compiler tests to `tier{2,3}` to expose them on earlier tiers. The split logic between tiers is roughly: fast feature tests go into tier2, slower feature tests and debugging/printing stuff goes to tier3. >> >> Sample times for new subgroups (think about this as "How much time they add to existing tiers"): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:tier2_compiler 243 243 0 0 >> ============================== >> >> real 2m16.518s >> user 35m40.839s >> sys 1m35.334s >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:tier3_compiler 132 132 0 0 >> ============================== >> >> real 4m31.935s >> user 71m54.617s >> sys 2m13.073s > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Filter out tier1/2 groups too Yes, all passed. You can integrate. ------------- PR: https://git.openjdk.java.net/jdk/pull/6622 From eliu at openjdk.java.net Mon Dec 6 01:56:09 2021 From: eliu at openjdk.java.net (Eric Liu) Date: Mon, 6 Dec 2021 01:56:09 GMT Subject: RFR: 8262341: Refine identical code in AddI/LNode. In-Reply-To: <5q7zq7qSEn7iTQdlYBh3d92Qeg7oxO_b5T36iUaFSGs=.326c2ded-2a41-4589-8951-6a31a6d44257@github.com> References: <5q7zq7qSEn7iTQdlYBh3d92Qeg7oxO_b5T36iUaFSGs=.326c2ded-2a41-4589-8951-6a31a6d44257@github.com> Message-ID: On Tue, 30 Nov 2021 08:39:52 GMT, Roland Westrelin wrote: > AddINode::Ideal() and AddlNode::Ideal() are almost identical but the > same logic had to be duplicated because AddINode::Ideal() tests its > inputs for Op_AddI, Op_SubI etc. while AddLNode::Ideal() tests for > Op_AddL, Op_SubL etc. This patch refactors the code so the common > logic is in a single method parameterized by a BasicType argument. > > The way I've done this before in the context of int/long counted loops > was to use and extra virtual method operates_on(). So: > > n->Opcode() == Op_AddI becomes n->is_Add() && n->operates_on(T_INT) > > Working on this change made me realize that pattern doesn't work that well: > > - it's quite a bit more verbose and converting existing code is not as > mechanical as we would like to avoid conversion errors. > > - it breaks when a class has a subclass. For instance AddNode has > OrINode and OrLNode as subclasses so testing for n->is_Add() returns > true with an OrI node. > > Instead, this change introduces new functions. For instance of > AddI/AddL: > > int Op_Add(BasicType bt) > > that returns either Op_AddI or Op_AddL depending on bt. This made > refactoring the AddINode::Ideal() logic straightforward. I removed all > use of operates_on() as well and converted existing code to the new > Op_XXX() functions. src/hotspot/share/opto/addnode.cpp line 280: > 278: } > 279: if (op1 == Op_Sub(bt)) { > 280: const Type *t_sub1 = phase->type(in1->in(1)); I'm not very clear about the current code style which we should follow. Shall we need to align the style in the changed code? ------------- PR: https://git.openjdk.java.net/jdk/pull/6607 From shade at openjdk.java.net Mon Dec 6 06:30:18 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 6 Dec 2021 06:30:18 GMT Subject: RFR: 8278016: Add compiler tests to tier{2,3} [v2] In-Reply-To: References: Message-ID: On Tue, 30 Nov 2021 20:44:43 GMT, Aleksey Shipilev wrote: >> I have been looking at `hotspot:tier4` (catch-all not in lower tiers) run logs, and realized the whole bunch of compiler tests are running there. >> >> Since `hotspot:tier4` runs a lot of `vmTestbase` tests, contributors seldom run it, as it takes many hours. Which means that many compiler tests are not running regularly for many contributors. But these tests are rather fast themselves and cover important compiler features. >> >> We can properly add compiler tests to `tier{2,3}` to expose them on earlier tiers. The split logic between tiers is roughly: fast feature tests go into tier2, slower feature tests and debugging/printing stuff goes to tier3. >> >> Sample times for new subgroups (think about this as "How much time they add to existing tiers"): >> >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:tier2_compiler 243 243 0 0 >> ============================== >> >> real 2m16.518s >> user 35m40.839s >> sys 1m35.334s >> >> ============================== >> Test summary >> ============================== >> TEST TOTAL PASS FAIL ERROR >> jtreg:test/hotspot/jtreg:tier3_compiler 132 132 0 0 >> ============================== >> >> real 4m31.935s >> user 71m54.617s >> sys 2m13.073s > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Filter out tier1/2 groups too Thank you! ------------- PR: https://git.openjdk.java.net/jdk/pull/6622 From shade at openjdk.java.net Mon Dec 6 06:30:19 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 6 Dec 2021 06:30:19 GMT Subject: Integrated: 8278016: Add compiler tests to tier{2,3} In-Reply-To: References: Message-ID: On Tue, 30 Nov 2021 19:29:36 GMT, Aleksey Shipilev wrote: > I have been looking at `hotspot:tier4` (catch-all not in lower tiers) run logs, and realized the whole bunch of compiler tests are running there. > > Since `hotspot:tier4` runs a lot of `vmTestbase` tests, contributors seldom run it, as it takes many hours. Which means that many compiler tests are not running regularly for many contributors. But these tests are rather fast themselves and cover important compiler features. > > We can properly add compiler tests to `tier{2,3}` to expose them on earlier tiers. The split logic between tiers is roughly: fast feature tests go into tier2, slower feature tests and debugging/printing stuff goes to tier3. > > Sample times for new subgroups (think about this as "How much time they add to existing tiers"): > > > ============================== > Test summary > ============================== > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg:tier2_compiler 243 243 0 0 > ============================== > > real 2m16.518s > user 35m40.839s > sys 1m35.334s > > ============================== > Test summary > ============================== > TEST TOTAL PASS FAIL ERROR > jtreg:test/hotspot/jtreg:tier3_compiler 132 132 0 0 > ============================== > > real 4m31.935s > user 71m54.617s > sys 2m13.073s This pull request has now been integrated. Changeset: f180a459 Author: Aleksey Shipilev URL: https://git.openjdk.java.net/jdk/commit/f180a4591f52d0af0c030aa85be33c51b06c90ee Stats: 46 lines in 1 file changed: 46 ins; 0 del; 0 mod 8278016: Add compiler tests to tier{2,3} Reviewed-by: kvn, dholmes ------------- PR: https://git.openjdk.java.net/jdk/pull/6622 From duke at openjdk.java.net Mon Dec 6 07:35:57 2021 From: duke at openjdk.java.net (=?UTF-8?B?546L6LaF?=) Date: Mon, 6 Dec 2021 07:35:57 GMT Subject: RFR: JDK-8278135: Remove un-necessary null-check for get-static in c2 [v2] In-Reply-To: References: Message-ID: > When run the following test, lots of un-necessary null check deoptimization happen. > > Small Test: > > public class CodeDependenciesTest { > private Object obj; > private String[] strs; > private Object[][] objs; > private static Class clzOne; > private static Class clzTwo; > public static void main(String[] args) throws Exception { > CodeDependenciesTest codeDependenciesTest = new CodeDependenciesTest(); > codeDependenciesTest.obj = new String("1"); > for (int i = 0; i < 300000; i++) { > codeDependenciesTest.foo(); > } > } > > public void foo() { > objs = new Object[10][10]; > for (int i = 0; i < 10; i++) { > for (int j = 0; j < 10; j++) { > objs[i][j] = new Object(); > } > } > clzOne = InvokeTest.class; > clzTwo = clzOne; > } > > static class InvokeTest { > public void bar(String i) { > try { > Thread.sleep(Long.valueOf(i)); > } catch (Exception e) { > e.printStackTrace(); > } > } > } > } > > > The deoptimization log generated by `-XX:+TraceDeoptimization` is: > > Uncommon trap bci=63 pc=0x00007f0eafbe1e38, relative_pc=0x00000000000005f8, method=CodeDependenciesTest.foo()V, debug_id=0 > Uncommon trap occurred in CodeDependenciesTest::foo compiler=c2 compile_id=288 (@0x00007f0eafbe1e38) thread=20285 reason=null_assert_or_unreached0 action=make_not_entrant unloaded_class_index=-1 debug_id=0 > DEOPT UNPACKING thread 0x00007f0ec0028190 vframeArray 0x00007f0ec02348a0 mode 2 > {method} {0x00007f0e7c4004f0} 'foo' '()V' in 'CodeDependenciesTest' - putstatic @ bci 63 sp = 0x00007f0ec8a317c8 > Uncommon trap bci=63 pc=0x00007f0eafbe0f34, relative_pc=0x0000000000000514, method=CodeDependenciesTest.foo()V, debug_id=0 > Uncommon trap occurred in CodeDependenciesTest::foo compiler=c2 compile_id=287 (@0x00007f0eafbe0f34) thread=20285 reason=null_assert_or_unreached0 action=make_not_entrant unloaded_class_index=-1 debug_id=0 > DEOPT UNPACKING thread 0x00007f0ec0028190 vframeArray 0x00007f0ec0235e40 mode 2 > {method} {0x00007f0e7c4004f0} 'foo' '()V' in 'CodeDependenciesTest' - putstatic @ bci 63 sp = 0x00007f0ec8a317c8 > > > The corresponding opto assembly is, > > 230 B20: # out( B56 B21 ) <- in( B58 B43 B41 B19 ) Freq: 0.999369 > 230 movl [RBX + #112 (8-bit)], narrowoop: java/lang/Class:exact * # compressed ptr ! Field: CodeDependenciesTest.clzOne > 237 movq R10, RBX # ptr -> long > 23a movq RBP, java/lang/Class:exact * # ptr > 244 movq R11, RBP # ptr -> long > 247 xorq R11, R10 # long > 24a shrq R11, #22 > 24e testq R11, R11 > 251 je B56 P=0.000001 C=-1.000000 > > 257 B21: # out( B56 B22 ) <- in( B20 ) Freq: 0.999368 > 257 shrq R10, #9 > 25b addq R14, R10 # ptr > 25e cmpb [R14], #4 > 262 je B56 P=0.000001 C=-1.000000 > > 5ed B56: # out( N780 ) <- in( B55 B21 B20 ) Freq: 3.02464e-06 > 5ed movl RSI, #-20 # int > nop # 1 bytes pad for loops and calls > 5f3 call,static wrapper for: uncommon_trap(reason='null_assert_or_unreached0' action='make_not_entrant' debug_id='0') > # CodeDependenciesTest::foo @ bci:63 (line 23) L[0]=_ L[1]=_ L[2]=_ STK[0]=RBP > # OopMap {rbp=Oop off=1528/0x5f8} > 5f8 stop # ShouldNotReachHere > > > C2 tries to generate a null-check for the get-static in `clzTwo = clzOne;`, because it thinks that ciKlass of `java.lang.Class` is not loaded. > > The ciKlass of `java.lang.Class` is generated by the following stack trace, > > (gdb) bt > #0 SystemDictionary::find_instance_klass (class_name=0x800481080, class_loader=..., protection_domain=...) at /data/openjdk/jdk_dev/src/hotspot/share/classfile/systemDictionary.cpp:778 > #1 0x00007ffff610769d in SystemDictionary::find_instance_or_array_klass (class_name=0x800481080, class_loader=..., protection_domain=...) at /data/openjdk/jdk_dev/src/hotspot/share/classfile/systemDictionary.cpp:813 > #2 0x00007ffff610a55c in SystemDictionary::find_constrained_instance_or_array_klass (current=0x7ffff020a5e0, class_name=0x800481080, class_loader=...) at /data/openjdk/jdk_dev/src/hotspot/share/classfile/systemDictionary.cpp:1760 > #3 0x00007ffff55f1f29 in ciEnv::get_klass_by_name_impl (this=0x7fffc12066c0, accessing_klass=0x7fff8c0f6278, cpool=..., name=0x7ffff00334f8, require_local=false) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:519 > #4 0x00007ffff55f1d81 in ciEnv::get_klass_by_name_impl (this=0x7fffc12066c0, accessing_klass=0x7fff8c0f6278, cpool=..., name=0x7fff8c004518, require_local=false) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:488 > #5 0x00007ffff55f2466 in ciEnv::get_klass_by_index_impl (this=0x7fffc12066c0, cpool=..., index=34, is_accessible=@0x7fffc1204540: 112, accessor=0x7fff8c0f6278) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:611 > #6 0x00007ffff55f2632 in ciEnv::get_klass_by_index (this=0x7fffc12066c0, cpool=..., index=34, is_accessible=@0x7fffc1204540: 112, accessor=0x7fff8c0f6278) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:658 > #7 0x00007ffff55fb1bd in ciField::ciField (this=0x7fff8c0f9208, klass=0x7fff8c0f6278, index=65542) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciField.cpp:101 > #8 0x00007ffff55f344f in ciEnv::get_field_by_index_impl (this=0x7fffc12066c0, accessor=0x7fff8c0f6278, index=65542) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:798 > #9 0x00007ffff55f3506 in ciEnv::get_field_by_index (this=0x7fffc12066c0, accessor=0x7fff8c0f6278, index=65542) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:811 > #10 0x00007ffff56284ec in ciBytecodeStream::get_field (this=0x7fffc1204850, will_link=@0x7fffc12046cf: false) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciStreams.cpp:274 > #11 0x00007ffff562d10c in ciTypeFlow::StateVector::do_putstatic (this=0x7fff8c1bd078, str=0x7fffc1204850) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:798 > #12 0x00007ffff562e960 in ciTypeFlow::StateVector::apply_one_bytecode (this=0x7fff8c1bd078, str=0x7fffc1204850) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:1457 > #13 0x00007ffff563218d in ciTypeFlow::flow_block (this=0x7fff8c0f7a50, block=0x7fff8c0f8ec0, state=0x7fff8c1bd078, jsrs=0x7fff8c1bd0b8) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:2364 > #14 0x00007ffff563332d in ciTypeFlow::df_flow_types (this=0x7fff8c0f7a50, start=0x7fff8c0f80a8, do_flow=true, temp_vector=0x7fff8c1bd078, temp_set=0x7fff8c1bd0b8) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:2675 > #15 0x00007ffff563361f in ciTypeFlow::flow_types (this=0x7fff8c0f7a50) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:2725 > #16 0x00007ffff5634081 in ciTypeFlow::do_flow (this=0x7fff8c0f7a50) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:2886 > #17 0x00007ffff5605687 in ciMethod::get_flow_analysis (this=0x7fff8c0f6340) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciMethod.cpp:327 > #18 0x00007ffff5499d34 in InlineTree::check_can_parse (callee=0x7fff8c0f6340) at /data/openjdk/jdk_dev/src/hotspot/share/opto/bytecodeInfo.cpp:535 > #19 0x00007ffff55a0806 in CallGenerator::for_osr (m=0x7fff8c0f6340, osr_bci=22) at /data/openjdk/jdk_dev/src/hotspot/share/opto/callGenerator.cpp:299 > #20 0x00007ffff56aa2d9 in Compile::Compile (this=0x7fffc1205900, ci_env=0x7fffc12066c0, target=0x7fff8c0f6340, osr_bci=22, options=..., directive=0x7ffff01588a0) at /data/openjdk/jdk_dev/src/hotspot/share/opto/compile.cpp:687 > #21 0x00007ffff559da3e in C2Compiler::compile_method (this=0x7ffff0209f40, env=0x7fffc12066c0, target=0x7fff8c0f6340, entry_bci=22, install_code=true, directive=0x7ffff01588a0) at /data/openjdk/jdk_dev/src/hotspot/share/opto/c2compiler.cpp:108 > #22 0x00007ffff56c7f6e in CompileBroker::invoke_compiler_on_method (task=0x7ffff02322d0) at /data/openjdk/jdk_dev/src/hotspot/share/compiler/compileBroker.cpp:2291 > #23 0x00007ffff56c6aed in CompileBroker::compiler_thread_loop () at /data/openjdk/jdk_dev/src/hotspot/share/compiler/compileBroker.cpp:1966 > #24 0x00007ffff56e6f9f in CompilerThread::thread_entry (thread=0x7ffff020a5e0, __the_thread__=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/share/compiler/compilerThread.cpp:59 > #25 0x00007ffff614a42f in JavaThread::thread_main_inner (this=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/share/runtime/thread.cpp:1297 > #26 0x00007ffff614a2c5 in JavaThread::run (this=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/share/runtime/thread.cpp:1280 > #27 0x00007ffff6147b08 in Thread::call_run (this=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/share/runtime/thread.cpp:358 > #28 0x00007ffff5eafa4f in thread_native_entry (thread=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/os/linux/os_linux.cpp:705 > #29 0x00007ffff779cea5 in start_thread (arg=0x7fffc1207700) at pthread_create.c:307 > #30 0x00007ffff72c19fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > > > When `CodeDependenciesTest.foo` is compiled, the classloader of the holder of this method is AppClassLoader, so it finds `java.lang.Class` in AppClassLoader, and finds nothing, so it thinks that `java.lang.Class` is not loaded, but `java.lang.Class` is a built-in class which is definitely loaded and initialized. > > The following patch is the first patch we implemented, it tries to find Klass in the parent classloader when the Klass can not be found in current classloader. But the patch has potential risk, because user-defined classloader may not follow the 'parent delegation' style of classloading. > > diff --git a/src/hotspot/share/ci/ciEnv.cpp b/src/hotspot/share/ci/ciEnv.cpp > index e29b56a..3ceafc4 100644 > --- a/src/hotspot/share/ci/ciEnv.cpp > +++ b/src/hotspot/share/ci/ciEnv.cpp > @@ -514,11 +514,17 @@ ciKlass* ciEnv::get_klass_by_name_impl(ciKlass* accessing_klass, > { > ttyUnlocker ttyul; // release tty lock to avoid ordering problems > MutexLocker ml(current, Compile_lock); > - Klass* kls; > - if (!require_local) { > - kls = SystemDictionary::find_constrained_instance_or_array_klass(current, sym, loader); > - } else { > - kls = SystemDictionary::find_instance_or_array_klass(sym, loader, domain); > + Klass* kls = NULL; > + while (true) { > + if (!require_local) { > + kls = SystemDictionary::find_constrained_instance_or_array_klass(current, sym, loader); > + } else { > + kls = SystemDictionary::find_instance_or_array_klass(sym, loader, domain); > + } > + if (kls != NULL || loader() == NULL) { > + break; > + } > + loader = Handle(current, java_lang_ClassLoader::parent(loader())); > } > found_klass = kls; > } > > > When the Klass of the field is not loaded, the generated 'null check' helps nothing, we think remove it is the right way to avoid the deoptmization. ?? has updated the pull request incrementally with two additional commits since the last revision: - get the correct ciklass if java.lang.Class is not resolved before - Revert "Remove un-necessary null check" This reverts commit 12ae4bb441129c582e9f241e7b0bbde5de783533. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6667/files - new: https://git.openjdk.java.net/jdk/pull/6667/files/12ae4bb4..b6191ded Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6667&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6667&range=00-01 Stats: 32 lines in 2 files changed: 32 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6667.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6667/head:pull/6667 PR: https://git.openjdk.java.net/jdk/pull/6667 From duke at openjdk.java.net Mon Dec 6 07:41:16 2021 From: duke at openjdk.java.net (=?UTF-8?B?546L6LaF?=) Date: Mon, 6 Dec 2021 07:41:16 GMT Subject: RFR: JDK-8278135: Remove un-necessary null-check for get-static in c2 In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 21:21:02 GMT, Volker Simonis wrote: > So to cut a long story short, I don't think that your proposed fix is correct. I have to think more about it, but maybe one approach for a fix could be to handle `java.lang.Class` special because it is always loaded in the bootstrap classloader during VM initialization. > > I'll take another look at the problem next week :) Have a nice weekend, Volker Got it: the `null-check` is used to unload un-optimized version of jit code, it can not be removed. Thank you for the detailed explanation. I have updated the patch to handle `java.lang.Class` special. ------------- PR: https://git.openjdk.java.net/jdk/pull/6667 From thartmann at openjdk.java.net Mon Dec 6 07:59:10 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 6 Dec 2021 07:59:10 GMT Subject: RFR: JDK-8154011: Make TraceDeoptimization a diagnostic flag In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 16:14:50 GMT, Tobias Holenstein wrote: > Make TraceDeoptimization available in a product build. > > I checked that performance is not affected on Aurora. Looks good to me and as discussed in JIRA, this can't be easily converted to UL. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6703 From roland at openjdk.java.net Mon Dec 6 08:09:15 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 6 Dec 2021 08:09:15 GMT Subject: RFR: 8275610: C2: Object field load floats above its null check resulting in a segfault In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 14:00:39 GMT, Christian Hagedorn wrote: > In the test case, a field load of an object floats above the object null check due to a `CastPP` that gets separated from its null check `If` node. C2 then schedules the field load before the object null check which results in a segfault. > > The problem can be traced back to the elimination of identical back-to-back ifs [(JDK-8143542)](https://bugs.openjdk.java.net/browse/JDK-8143542). This is done in split-if where we detect identical back-to-back ifs in `PhaseIdealLoop::identical_backtoback_ifs()`. We then replace the boolean node of the dominated `If` by a phi to use the split-if optimization to nicely fold away the dominated `If` later in IGVN. This, however, does not update any data nodes that were dependent on the dominated `If` projections. > > In the test case, we have the following graph right before splitting `313 If` (`NC 2`) through `190 Region`: > > ![Screenshot from 2021-12-03 14-08-31](https://user-images.githubusercontent.com/17833009/144608863-a1185bf8-dfa3-4bd5-a0c2-284ea2b27606.png) > `313 If` is dominated by the identical (= both share `308 Bool`) `309 If` (`NC 1`). The bool input for `313 If` is replaced by a phi and the split-if optimization is applied. However, the data nodes dependent on the out projections of the dominated `313 If` (`261 CastPP` in this case) are not processed separately and just end up at the newly inserted regions in split-if. In the test case, we get the following graph where `261 CastPP` ends up at the new `334 Region`: > > ![Screenshot from 2021-12-03 14-08-59](https://user-images.githubusercontent.com/17833009/144608891-48adf7a3-9b04-4e6d-bc26-310757b3d596.png) > Loopopts are applied and can remove the `325 CountedLoop` and we find that the `171 RangeCheck` (`RC 2`) is applied on a constant array index 1. > > Now at IGVN, the order in which the nodes are processed is important in order to trigger the segfault: > 1. `334 Region` with `332 If` and `333 If` are removed because of the special split-if setup we used to remove the identical `313 If`. The control input of `261 CastPP` is therefore updated to `172 IfTrue`. > 2. Applying `RangeCheck::Ideal()` for `171 RangeCheck` finds that `305 RangeCheck` (`RC 1`) already covers it and we remove `171 RangeCheck`. In this process, we rewire `261 CastPP` to the dominating `305 RangeCheck` and we have the following graph: > > ![Screenshot from 2021-12-03 14-09-21](https://user-images.githubusercontent.com/17833009/144608917-f5159b21-40b2-41c8-afab-fa495433217a.png) > `261 CastPP` - and also the field `263 LoadI` - have now `306 IfFalse` as early control. GCM is then scheduling `263 LoadI` before the null check `309 If` and we get a segfault. > > An easy fix is not straight forward. What we actually would want to do is rewiring `261 CastPP` from `334 Region` to `311 IfTrue` in the second graph after split-if to not separate it from the null check. But that's not possible because we would create a bad graph: The early control `311 IfTrue` of `261 CastPP` does not dominate its late control further down after `334 Region` because of the not yet removed `334 Region`. We would need to already clean the regions up and then do the rewiring. But then the question arises why to use the split-if optimization in the first place when we do not want to rely on IGVN to clean it up. > > I therefore suggest to go with an easy bailout fix for JDK 18 where we do not apply this identical back-to-back if removal optimization if there are data dependencies and rework this in an RFE for JDK 19. Roland already has some ideas how to do that. > > I ran some standard benchmarks and did not see any performance regressions with this fix. > > Thanks, > Christian Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6701 From duke at openjdk.java.net Mon Dec 6 08:27:14 2021 From: duke at openjdk.java.net (Tobias Holenstein) Date: Mon, 6 Dec 2021 08:27:14 GMT Subject: Integrated: JDK-8154011: Make TraceDeoptimization a diagnostic flag In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 16:14:50 GMT, Tobias Holenstein wrote: > Make TraceDeoptimization available in a product build. > > I checked that performance is not affected on Aurora. This pull request has now been integrated. Changeset: f39fe5b3 Author: Tobias Holenstein Committer: Tobias Hartmann URL: https://git.openjdk.java.net/jdk/commit/f39fe5b3d629c6d557eb7bab8d1ff81350c616cc Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8154011: Make TraceDeoptimization a diagnostic flag Reviewed-by: kvn, dholmes, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/6703 From thartmann at openjdk.java.net Mon Dec 6 08:44:15 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 6 Dec 2021 08:44:15 GMT Subject: RFR: 8276116: C2: optimize long range checks in int counted loops [v3] In-Reply-To: References: <8bvd-Dtu9tKQVPWEV5lo0Xa7H2X76uVsgf1l6vKm7CM=.836388f2-a117-4f5d-9385-1690c7d0fd74@github.com> Message-ID: On Mon, 29 Nov 2021 10:26:40 GMT, Roland Westrelin wrote: >> Maurizio noticed that some of his panama micro benchmarks don't >> perform better avec 8259609 (C2: optimize long range checks in long >> counted loops). The reason is that 8259609 optimizes long range checks >> in long counted loops but some of his benchmarks include long range >> checks in int counted loops: >> >> for (int i = start; i < stop; i += inc) { >> Objects.checkIndex(scale * ((long)i) + offset, length); >> } >> >> This change applies the transformation from 8259609 for long counted >> loop/long range checks to int counted loop/long range checks. That >> includes creating a loop nest and transforming the long range check to >> an int range check that's subject to range elimination in the inner >> loop. >> >> The reason it's required to create a loop nest is that the long range >> check transformation logic depends on no overflow of scale * i for the >> range of values that the transformed range check is applied to. >> >> As a consequence, this change is mostly refactoring to make the loop >> nest creation and range check transformation parameterized by the type >> of the transformed loop. >> >> I think this transformation needs to be applied as late as possible >> but, in the case of an int counted loop, before pre/main/post loops >> are created. I had to move it to IdealLoopTree::iteration_split_impl() >> because of that. >> >> There's an alternate shape for a long range check in an int counted >> loop that Maurizio insisted needs to be supported: >> >> for (int i = start; i < stop; i += inc) { >> Objects.checkIndex(((long)(scale * i)) + offset, length); >> } >> >> scale * i can overflow in that case. This is also supported but as a >> corner case of the previous one. The code in >> PhaseIdealLoop::transform_long_range_checks() has a comment about >> that. >> >> Note also that this transformation works best if loop strip mining is >> enabled (that is for G1, ZGC, Shenandoah by default). The reason is >> that it needs a safepoint and when loop strip mining is enabled, the >> outer loop contains one that's always available. A way to have this >> work as well for all GCs would be to always construct the loop strip >> mining loop nest (whether loop strip mining is enabled or not) and >> then only once loop opts are over remove the outer loop when loop >> strip mining is disabled. I'm looking for feedback on this. >> >> BTW, something doesn't seem right in IdealLoopTree::iteration_split_impl(): >> >> https://github.com/rwestrel/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L3475 >> >> should_peel causes transformations to be skipped but peeling is never >> applied AFAICT. Does it make sense to anyone? > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > test fix New round of testing all passed. ------------- PR: https://git.openjdk.java.net/jdk/pull/6576 From roland at openjdk.java.net Mon Dec 6 08:45:16 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 6 Dec 2021 08:45:16 GMT Subject: RFR: 8277850: C2: optimize mask checks in counted loops In-Reply-To: References: <_gfsjS2vHwM9TTZtfGNeBOBu9ISOFAxfSdGl391M4MM=.4b05410a-5583-4bf2-99af-eab56a359e3c@github.com> Message-ID: On Sat, 4 Dec 2021 00:33:04 GMT, Vladimir Kozlov wrote: >> This is another fix that addresses a performance issue with panama and that was brought up by Maurizio. The pattern to optimize is: >> if ((base + (offset << 2)) & 3) != 0) { >> } >> >> where base is loop independent but offset depends on a loop variable. This can be transformed to: >> >> if ((base & 3) != 0) { >> >> That check becomes loop independent and be optimized by loop predication (or I suppose loop unswitching but that wasn't the case of the micro benchmark I worked on). >> >> This change also optimizes the pattern: >> >> (offset << 2) & 3 >> >> to return 0. > > src/hotspot/share/opto/mulnode.cpp line 514: > >> 512: } >> 513: >> 514: return MulINode::Value(phase); > > There is no `MulINode::value()` or `MulLNode::value()`, only based class `MulNode` have it. > You missing initial check done by `MulNode::value()`. > I suggest to move this code into `AndINode::mul_ring()` and `AndLNode::mul_ring()`. > Also in `mul_ring()` you can pass `r1->get_con()` as `mask` value to `AndIL_shift_and_mask()`. Thanks for reviewing this. MulNode::AndIL_shift_and_mask() needs the shift node so it can test for shift->Opcode() == Op_ConvI2L. mult_ring() only gets the Type* as input. Are you suggesting to change mul_ring's signature? > src/hotspot/share/opto/mulnode.cpp line 1719: > >> 1717: bool MulNode::AndIL_shift_and_mask(PhaseGVN* phase, Node* mask, Node* shift, BasicType bt) const { >> 1718: if (mask == NULL || shift == NULL) { >> 1719: return false; > > You need to check `shift` for `TOP`. Code below: const TypeInteger* shift_t = phase->type(shift)->isa_integer(bt); if (mask_t == NULL || shift_t == NULL) { catches the case where shift is top, I think. ------------- PR: https://git.openjdk.java.net/jdk/pull/6697 From roland at openjdk.java.net Mon Dec 6 08:48:21 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 6 Dec 2021 08:48:21 GMT Subject: RFR: 8262341: Refine identical code in AddI/LNode. In-Reply-To: References: <5q7zq7qSEn7iTQdlYBh3d92Qeg7oxO_b5T36iUaFSGs=.326c2ded-2a41-4589-8951-6a31a6d44257@github.com> Message-ID: On Mon, 6 Dec 2021 01:51:05 GMT, Eric Liu wrote: >> AddINode::Ideal() and AddlNode::Ideal() are almost identical but the >> same logic had to be duplicated because AddINode::Ideal() tests its >> inputs for Op_AddI, Op_SubI etc. while AddLNode::Ideal() tests for >> Op_AddL, Op_SubL etc. This patch refactors the code so the common >> logic is in a single method parameterized by a BasicType argument. >> >> The way I've done this before in the context of int/long counted loops >> was to use and extra virtual method operates_on(). So: >> >> n->Opcode() == Op_AddI becomes n->is_Add() && n->operates_on(T_INT) >> >> Working on this change made me realize that pattern doesn't work that well: >> >> - it's quite a bit more verbose and converting existing code is not as >> mechanical as we would like to avoid conversion errors. >> >> - it breaks when a class has a subclass. For instance AddNode has >> OrINode and OrLNode as subclasses so testing for n->is_Add() returns >> true with an OrI node. >> >> Instead, this change introduces new functions. For instance of >> AddI/AddL: >> >> int Op_Add(BasicType bt) >> >> that returns either Op_AddI or Op_AddL depending on bt. This made >> refactoring the AddINode::Ideal() logic straightforward. I removed all >> use of operates_on() as well and converted existing code to the new >> Op_XXX() functions. > > src/hotspot/share/opto/addnode.cpp line 280: > >> 278: } >> 279: if (op1 == Op_Sub(bt)) { >> 280: const Type *t_sub1 = phase->type(in1->in(1)); > > I'm not very clear about the current code style which we should follow. Shall we need to align the style in the changed code? Are you commenting about the change from: if( op1 to if (op1 ? The first style is used in a some places and the guideline is to switch to the second one. ------------- PR: https://git.openjdk.java.net/jdk/pull/6607 From duke at openjdk.java.net Mon Dec 6 09:02:18 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Mon, 6 Dec 2021 09:02:18 GMT Subject: Integrated: JDK-8277496 Remove duplication in c1 Block successor lists In-Reply-To: <4wdDk_HKbqPRXF47C895FeMhJpmrT6rt0aVT8q0miH0=.7b4bcdd0-671a-409b-ad6f-e433a3507e24@github.com> References: <4wdDk_HKbqPRXF47C895FeMhJpmrT6rt0aVT8q0miH0=.7b4bcdd0-671a-409b-ad6f-e433a3507e24@github.com> Message-ID: On Tue, 30 Nov 2021 14:14:36 GMT, Ludvig Janiuk wrote: > Remove `BlockBegin::_successors`, leaving `BlockEnd::_sux` as the SSOT for the successors of a block. Prior to this PR, these two lists were both tracking the same list of successors of the same block. This necessitated a lot of syncing and verification code. > > With this PR, as long as a block has its end pointer assigned, its successors can always be reached by querying the `BlockEnd`. `BlockEnd::_sux` becomes the single place where the list of successors is maintained. When modified, the successor list no longer needs to be synchronized in two places, reducing complexity and confusion. Asserts on the two lists corresponding no longer need to be made. > > While being created in `GraphBuilder`, `BlockBegin`s don't have a `BlockEnd` assigned yet. To temporarily track block successors in this small interval, add a lookup structure `BlockListBuilder::_bci2block_successors`. > > This PR affects debug printing code. If the end pointer of a `BlockBegin `is NULL for some reason, then the successor list can no longer be printed (for obious reasons). > > This PR introduces an additional check to IR::verify to check that `BlockBegin::_end` is not set to null. > > This PR also performs some minor refactoring, polishing, inlining, and removing of dead code around the affected areas. > > The commit history has been polished to attempt to guide the reader through the changes. > > hs-tier1 and hs-tier2 tests pass. This pull request has now been integrated. Changeset: 8d190dd0 Author: Ludvig Janiuk Committer: Nils Eliasson URL: https://git.openjdk.java.net/jdk/commit/8d190dd003c58aa9ebb403e95a73a128af7e8941 Stats: 194 lines in 8 files changed: 89 ins; 73 del; 32 mod 8277496: Remove duplication in c1 Block successor lists Reviewed-by: neliasso, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/6614 From pli at openjdk.java.net Mon Dec 6 09:09:16 2021 From: pli at openjdk.java.net (Pengfei Li) Date: Mon, 6 Dec 2021 09:09:16 GMT Subject: RFR: 8277168: AArch64: Enable arraycopy partial inlining with SVE In-Reply-To: References: Message-ID: On Thu, 18 Nov 2021 17:24:18 GMT, Andrew Haley wrote: >> Arraycopy partial inlining is a C2 compiler technique that avoids stub >> call overhead in small-sized arraycopy operations by generating masked >> vector instructions. So far it works on x86 AVX512 only and this patch >> enables it on AArch64 with SVE. >> >> We add AArch64 matching rule for VectorMaskGenNode and refactor that >> node a little bit. The major change is moving the element type field >> into its TypeVectMask bottom type. The reason is that AArch64 vector >> masks are different for different vector element types. >> >> E.g., an x86 AVX512 vector mask value masking 3 least significant vector >> lanes (of any type) is like >> >> `0000 0000 ... 0000 0000 0000 0000 0111` >> >> On AArch64 SVE, this mask value can only be used for masking the 3 least >> significant lanes of bytes. But for 3 lanes of ints, the value should be >> >> `0000 0000 ... 0000 0000 0001 0001 0001` >> >> where the least significant bit of each lane matters. So AArch64 matcher >> needs to know the vector element type to generate right masks. >> >> After this patch, the C2 generated code for copying a 50-byte array on >> AArch64 SVE looks like >> >> mov x12, #0x32 >> whilelo p0.b, xzr, x12 >> add x11, x11, #0x10 >> ld1b {z16.b}, p0/z, [x11] >> add x10, x10, #0x10 >> st1b {z16.b}, p0, [x10] >> >> We ran jtreg hotspot::hotspot_all, jdk::tier1~3 and langtools::tier1 on >> both x86 AVX512 and AArch64 SVE machines, no issue is found. We tested >> JMH org/openjdk/bench/java/lang/ArrayCopyAligned.java with small array >> size arguments on a 512-bit SVE-featured CPU. We got below performance >> data changes. >> >> Benchmark (length) (Performance) >> ArrayCopyAligned.testByte 10 -2.6% >> ArrayCopyAligned.testByte 20 +4.7% >> ArrayCopyAligned.testByte 30 +4.8% >> ArrayCopyAligned.testByte 40 +21.7% >> ArrayCopyAligned.testByte 50 +22.5% >> ArrayCopyAligned.testByte 60 +28.4% >> >> The test machine has SVE vector size of 512 bits, so we see performance >> gain for most array sizes less than 64 bytes. For very small arrays we >> see a bit regression because a vector load/store may be a bit slower >> than 1 or 2 scalar loads/stores. > > Hurrah! I have managed to duplicate your results. > > Old: > > Benchmark (length) Mode Cnt Score Error Units > ArrayCopyAligned.testByte 40 avgt 5 23.332 ? 0.016 ns/op > > > New: > > ArrayCopyAligned.testByte 40 avgt 5 18.092 ? 0.093 ns/op > > > ... and in fact your result is much better than this suggests, because the bulk of the test is fetching all of the arguments to arraycopy, not actually copying the bytes. I get it now. Hi @theRealAph , are you still looking at this? I have another big fix which depends on the vector mask change inside this patch. So I hope this can be integrated soon. ------------- PR: https://git.openjdk.java.net/jdk/pull/6444 From eliu at openjdk.java.net Mon Dec 6 09:09:20 2021 From: eliu at openjdk.java.net (Eric Liu) Date: Mon, 6 Dec 2021 09:09:20 GMT Subject: RFR: 8262341: Refine identical code in AddI/LNode. In-Reply-To: References: <5q7zq7qSEn7iTQdlYBh3d92Qeg7oxO_b5T36iUaFSGs=.326c2ded-2a41-4589-8951-6a31a6d44257@github.com> Message-ID: <_MIPghQgL_eI5AMZArQ_15OPvhqmxaVb5Uee7zWDYng=.5d799d9c-d731-44a5-8a17-80a9ba471071@github.com> On Mon, 6 Dec 2021 08:45:02 GMT, Roland Westrelin wrote: >> src/hotspot/share/opto/addnode.cpp line 280: >> >>> 278: } >>> 279: if (op1 == Op_Sub(bt)) { >>> 280: const Type *t_sub1 = phase->type(in1->in(1)); >> >> I'm not very clear about the current code style which we should follow. Shall we need to align the style in the changed code? > > Are you commenting about the change from: > if( op1 to if (op1 > ? > The first style is used in a some places and the guideline is to switch to the second one. How about the style like `const Type *t_sub1`? Whether it should be `const Type* t_sub1`. ------------- PR: https://git.openjdk.java.net/jdk/pull/6607 From shade at openjdk.java.net Mon Dec 6 10:17:52 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 6 Dec 2021 10:17:52 GMT Subject: RFR: 8277893: Arraycopy stress tests [v3] In-Reply-To: References: Message-ID: > I would like to fork the new tests off the JDK-8150730. These tests were instrumental in capturing many bugs in my arraycopy work, and I think they are good on their own merit, because they provide a test for the current baseline and on-going minor improvements in arraycopy on all platforms, not only x86_64, and they might be cleanly backportable. > > A brief tour of these tests: > > - Tests all data types; > - Tests small arrays exhaustively, which captures conjoint/disjoint cases, errors near the edges, etc; > - Tests large arrays with fuzzing around powers of two and powers of ten, both conjoint and disjoint cases; > - Tests all available compilation modes for arraycopy stubs; for example, running on AVX-512 enabled machine runs all versions down to `-XX:UseAVX=0 -XX:UseSSE=0` cases; > - Tests with/without compressed oops mode -- theoretically only needed for `Object` copies, but Hotspot cobbles together int+coops and long+no-coops loops, so I decided to alternate coops mode for all data types; > > My previous version used individual `@run` clauses for all configurations, but I think the Java driver is cleaner and easier to maintain. > > Test times: > > > # x86_64 (TR 3970X) > real 6m37.855s > user 56m23.004s > sys 0m20.148s > > # x86_32 (TR 3970X) > real 11m22.877s > user 168m8.137s > sys 5m7.037s > > # x86_64 (i5-11500) > real 15m55.424s > user 118m0.969s > sys 0m12.039s > > # AArch64 (ThunderX2) > real 4m5.177s > user 32m7.295s > sys 0m19.689s > > > Since these tests are quite long, especially on small machines, I hooked them up to `hotspot:tier3`. > > Additional testing: > - [x] Linux x86_64 fastdebug `compiler/stress/arraycopy` > - [x] Linux x86_32 fastdebug `compiler/stress/arraycopy` > - [x] Linux AArch64 fastdebug `compiler/stress/arraycopy` Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision: - Package declarations - Add safety check for small systems - Renames - Single driver for all the tests - Safer timeout settings - Post-merge TEST.groups cleanup - Merge branch 'master' into JDK-8277893-arraycopy-tests - Merge branch 'master' into JDK-8277893-arraycopy-tests - Separate test group and hooks into hotspot_slow_compiler - Trim down MAX_SIZE and explain the choice - ... and 1 more: https://git.openjdk.java.net/jdk/compare/228a50d8...118a3eb2 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6594/files - new: https://git.openjdk.java.net/jdk/pull/6594/files/da7ed51e..118a3eb2 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6594&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6594&range=01-02 Stats: 10957 lines in 493 files changed: 6701 ins; 2473 del; 1783 mod Patch: https://git.openjdk.java.net/jdk/pull/6594.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6594/head:pull/6594 PR: https://git.openjdk.java.net/jdk/pull/6594 From shade at openjdk.java.net Mon Dec 6 10:56:21 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 6 Dec 2021 10:56:21 GMT Subject: RFR: 8277893: Arraycopy stress tests [v3] In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 10:17:52 GMT, Aleksey Shipilev wrote: >> I would like to fork the new tests off the JDK-8150730. These tests were instrumental in capturing many bugs in my arraycopy work, and I think they are good on their own merit, because they provide a test for the current baseline and on-going minor improvements in arraycopy on all platforms, not only x86_64, and they might be cleanly backportable. >> >> A brief tour of these tests: >> >> - Tests all data types; >> - Tests small arrays exhaustively, which captures conjoint/disjoint cases, errors near the edges, etc; >> - Tests large arrays with fuzzing around powers of two and powers of ten, both conjoint and disjoint cases; >> - Tests all available compilation modes for arraycopy stubs; for example, running on AVX-512 enabled machine runs all versions down to `-XX:UseAVX=0 -XX:UseSSE=0` cases; >> - Tests with/without compressed oops mode -- theoretically only needed for `Object` copies, but Hotspot cobbles together int+coops and long+no-coops loops, so I decided to alternate coops mode for all data types; >> >> My previous version used individual `@run` clauses for all configurations, but I think the Java driver is cleaner and easier to maintain. >> >> Test times: >> >> >> # x86_64 (TR 3970X) >> real 4m6.192s >> user 52m50.523s >> sys 0m13.755s >> >> # x86_64 (TR 3970X) -XX:+UseZGC >> real 6m2.573s >> user 72m43.541s >> sys 0m25.697s >> >> # x86_32 (TR 3970X) >> real 6m56.405s >> user 92m56.377s >> sys 0m6.677s >> >> # x86_64 (i5-11500) >> real 29m19.024s >> user 103m52.925s >> sys 1m7.175s >> >> # AArch64 (ThunderX2) >> real 2m59.623s >> user 26m14.624s >> sys 0m9.771s >> >> >> Since these tests are quite long, especially on small machines, I hooked them up to `hotspot:tier3`. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `compiler/stress/arraycopy` >> - [x] Linux x86_32 fastdebug `compiler/stress/arraycopy` >> - [x] Linux AArch64 fastdebug `compiler/stress/arraycopy` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision: > > - Package declarations > - Add safety check for small systems > - Renames > - Single driver for all the tests > - Safer timeout settings > - Post-merge TEST.groups cleanup > - Merge branch 'master' into JDK-8277893-arraycopy-tests > - Merge branch 'master' into JDK-8277893-arraycopy-tests > - Separate test group and hooks into hotspot_slow_compiler > - Trim down MAX_SIZE and explain the choice > - ... and 1 more: https://git.openjdk.java.net/jdk/compare/3fffcc8b...118a3eb2 Yes, definitely GC arraycopy barriers in object array copy cases. Shenandoah cuts the overhead in half by emitting the runtime check on GC state, so it can skip calling to runtime in most cases. ZGC calls to runtime for all object copies. This adds up quite a bit for small object arrays. I looked whether to trim down the array sizes we feed into these tests, but that does not look compelling to me, as these tests were useful with current settings in arraycopy improvements work. The GC specific overheads also make object array tests take a disproportionate amount of time, wrecking parallelism that might have helped to subsume these penalties. So, new work: a) Sets up larger timeouts to cater for slow machines; b) Rewrites the per-type tests to use a single driver, so it can balance over single-type jobs; c) Defaults the stress parallelism to `N_CPU / 4` New test times are updated in PR body. @vnkozlov, would you like to try this in Oracle infra again? ------------- PR: https://git.openjdk.java.net/jdk/pull/6594 From jiefu at openjdk.java.net Mon Dec 6 11:33:34 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 6 Dec 2021 11:33:34 GMT Subject: RFR: 8278291: compiler/uncommontrap/TraceDeoptimizationNoRealloc.java fails with release VMs after JDK-8154011 Message-ID: Hi all, compiler/uncommontrap/TraceDeoptimizationNoRealloc.java fails with release VMs after JDK-8154011 due 'TraceDeoptimization' is diagnostic and must be enabled via -XX:+UnlockDiagnosticVMOptions. Let's add `-XX:+UnlockDiagnosticVMOptions` to fix it. Thanks. Best regards, Jie ------------- Commit messages: - 8278291: compiler/uncommontrap/TraceDeoptimizationNoRealloc.java fails after JDK-8154011 Changes: https://git.openjdk.java.net/jdk/pull/6721/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6721&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278291 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/6721.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6721/head:pull/6721 PR: https://git.openjdk.java.net/jdk/pull/6721 From aph at openjdk.java.net Mon Dec 6 11:39:11 2021 From: aph at openjdk.java.net (Andrew Haley) Date: Mon, 6 Dec 2021 11:39:11 GMT Subject: RFR: 8277168: AArch64: Enable arraycopy partial inlining with SVE In-Reply-To: References: Message-ID: <3FSZTQE51oLz9b3VnL8ydkDT8FgT7VcIQhFM6-BVUTQ=.5671f668-df59-48ed-9cef-47de3a5b3c20@github.com> On Thu, 18 Nov 2021 17:24:18 GMT, Andrew Haley wrote: >> Arraycopy partial inlining is a C2 compiler technique that avoids stub >> call overhead in small-sized arraycopy operations by generating masked >> vector instructions. So far it works on x86 AVX512 only and this patch >> enables it on AArch64 with SVE. >> >> We add AArch64 matching rule for VectorMaskGenNode and refactor that >> node a little bit. The major change is moving the element type field >> into its TypeVectMask bottom type. The reason is that AArch64 vector >> masks are different for different vector element types. >> >> E.g., an x86 AVX512 vector mask value masking 3 least significant vector >> lanes (of any type) is like >> >> `0000 0000 ... 0000 0000 0000 0000 0111` >> >> On AArch64 SVE, this mask value can only be used for masking the 3 least >> significant lanes of bytes. But for 3 lanes of ints, the value should be >> >> `0000 0000 ... 0000 0000 0001 0001 0001` >> >> where the least significant bit of each lane matters. So AArch64 matcher >> needs to know the vector element type to generate right masks. >> >> After this patch, the C2 generated code for copying a 50-byte array on >> AArch64 SVE looks like >> >> mov x12, #0x32 >> whilelo p0.b, xzr, x12 >> add x11, x11, #0x10 >> ld1b {z16.b}, p0/z, [x11] >> add x10, x10, #0x10 >> st1b {z16.b}, p0, [x10] >> >> We ran jtreg hotspot::hotspot_all, jdk::tier1~3 and langtools::tier1 on >> both x86 AVX512 and AArch64 SVE machines, no issue is found. We tested >> JMH org/openjdk/bench/java/lang/ArrayCopyAligned.java with small array >> size arguments on a 512-bit SVE-featured CPU. We got below performance >> data changes. >> >> Benchmark (length) (Performance) >> ArrayCopyAligned.testByte 10 -2.6% >> ArrayCopyAligned.testByte 20 +4.7% >> ArrayCopyAligned.testByte 30 +4.8% >> ArrayCopyAligned.testByte 40 +21.7% >> ArrayCopyAligned.testByte 50 +22.5% >> ArrayCopyAligned.testByte 60 +28.4% >> >> The test machine has SVE vector size of 512 bits, so we see performance >> gain for most array sizes less than 64 bytes. For very small arrays we >> see a bit regression because a vector load/store may be a bit slower >> than 1 or 2 scalar loads/stores. > > Hurrah! I have managed to duplicate your results. > > Old: > > Benchmark (length) Mode Cnt Score Error Units > ArrayCopyAligned.testByte 40 avgt 5 23.332 ? 0.016 ns/op > > > New: > > ArrayCopyAligned.testByte 40 avgt 5 18.092 ? 0.093 ns/op > > > ... and in fact your result is much better than this suggests, because the bulk of the test is fetching all of the arguments to arraycopy, not actually copying the bytes. I get it now. > Hi @theRealAph , are you still looking at this? I have another big fix which depends on the vector mask change inside this patch. So I hope this can be integrated soon. I'm quite happy with the AArch64 parts, but I'm not familiar with that part of the C2 compiler. I think you need an additional reviewer, perhaps @rwestrel . ------------- PR: https://git.openjdk.java.net/jdk/pull/6444 From shade at openjdk.java.net Mon Dec 6 11:54:16 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 6 Dec 2021 11:54:16 GMT Subject: RFR: 8278291: compiler/uncommontrap/TraceDeoptimizationNoRealloc.java fails with release VMs after JDK-8154011 In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 11:27:19 GMT, Jie Fu wrote: > Hi all, > > compiler/uncommontrap/TraceDeoptimizationNoRealloc.java fails with release VMs after JDK-8154011 due 'TraceDeoptimization' is diagnostic and must be enabled via -XX:+UnlockDiagnosticVMOptions. > Let's add `-XX:+UnlockDiagnosticVMOptions` to fix it. > > Thanks. > Best regards, > Jie Seems to me that `IgnoreUnrecognizedVMOptions` should be replaced with `UnlockDiagnosticVMOptions`. The first one made sense when option was `develop` and not recognized in release builds. ------------- PR: https://git.openjdk.java.net/jdk/pull/6721 From roland at openjdk.java.net Mon Dec 6 11:55:34 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 6 Dec 2021 11:55:34 GMT Subject: RFR: 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses Message-ID: Outside the type system code itself, c2 usually assumes that a TypeOopPtr or a TypeKlassPtr's java type is fully represented by its klass(). To have proper support for interfaces, that can't be true as a type needs to be represented by an instance class and a set of interfaces. This patch hides the klass() accessor of TypeOopPtr/TypeKlassPtr and reworks c2 code that relies on it in a way that makes that code suitable for proper interface support in a subsequent change. This patch doesn't add proper interface support yet and is mostly refactoring. "Mostly" because there are cases where the previous logic would use a ciKlass but the new one works with a TypeKlassPtr/TypeInstPtr which carries the ciKlass and whether the klass is exact or not. That extra bit of information can sometimes help and so could result in slightly different decisions. To remove the klass() accessors, the new logic either relies on: - new methods of TypeKlassPtr/TypeInstPtr. For instance, instead of: toop->klass()->is_subtype_of(other_toop->klass()) the new code is: toop->is_java_subtype_of(other_toop) - variants of the klass() accessors for narrower cases like TypeInstPtr::instance_klass() (returns _klass except if _klass is an interface in which case it returns Object), TypeOopPtr::unloaded_klass() (returns _klass but only when the klass is unloaed), TypeOopPtr::exact_klass() (returns _klass but only when the type is exact). When I tested this patch, for most changes in this patch, I had the previous logic, the new logic and a check that verified that they return the same result. I ran as much testing as I could that way. ------------- Commit messages: - whitespaces - remove klass accessor Changes: https://git.openjdk.java.net/jdk/pull/6717/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6717&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8275201 Stats: 1216 lines in 33 files changed: 556 ins; 175 del; 485 mod Patch: https://git.openjdk.java.net/jdk/pull/6717.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6717/head:pull/6717 PR: https://git.openjdk.java.net/jdk/pull/6717 From roland at openjdk.java.net Mon Dec 6 12:00:15 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 6 Dec 2021 12:00:15 GMT Subject: RFR: 8262341: Refine identical code in AddI/LNode. In-Reply-To: <_MIPghQgL_eI5AMZArQ_15OPvhqmxaVb5Uee7zWDYng=.5d799d9c-d731-44a5-8a17-80a9ba471071@github.com> References: <5q7zq7qSEn7iTQdlYBh3d92Qeg7oxO_b5T36iUaFSGs=.326c2ded-2a41-4589-8951-6a31a6d44257@github.com> <_MIPghQgL_eI5AMZArQ_15OPvhqmxaVb5Uee7zWDYng=.5d799d9c-d731-44a5-8a17-80a9ba471071@github.com> Message-ID: <0cP3JuZIqQMFgv3hh8LBlVJAU53Hqr1BRtBLYXy5H-c=.604e5289-3f82-451c-98df-259ceaec5305@github.com> On Mon, 6 Dec 2021 09:05:23 GMT, Eric Liu wrote: >> Are you commenting about the change from: >> if( op1 to if (op1 >> ? >> The first style is used in a some places and the guideline is to switch to the second one. > > How about the style like `const Type *t_sub1`? Whether it should be `const Type* t_sub1`. It should be const Type* t_sub1. I will fix it. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6607 From jiefu at openjdk.java.net Mon Dec 6 12:05:46 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 6 Dec 2021 12:05:46 GMT Subject: RFR: 8278291: compiler/uncommontrap/TraceDeoptimizationNoRealloc.java fails with release VMs after JDK-8154011 [v2] In-Reply-To: References: Message-ID: > Hi all, > > compiler/uncommontrap/TraceDeoptimizationNoRealloc.java fails with release VMs after JDK-8154011 due 'TraceDeoptimization' is diagnostic and must be enabled via -XX:+UnlockDiagnosticVMOptions. > Let's add `-XX:+UnlockDiagnosticVMOptions` to fix it. > > Thanks. > Best regards, > Jie Jie Fu has updated the pull request incrementally with one additional commit since the last revision: Remove unnecessary IgnoreUnrecognizedVMOptions ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6721/files - new: https://git.openjdk.java.net/jdk/pull/6721/files/3abab536..e5cf2812 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6721&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6721&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/6721.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6721/head:pull/6721 PR: https://git.openjdk.java.net/jdk/pull/6721 From jiefu at openjdk.java.net Mon Dec 6 12:05:48 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 6 Dec 2021 12:05:48 GMT Subject: RFR: 8278291: compiler/uncommontrap/TraceDeoptimizationNoRealloc.java fails with release VMs after JDK-8154011 [v2] In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 11:51:03 GMT, Aleksey Shipilev wrote: > Seems to me that `IgnoreUnrecognizedVMOptions` should be replaced with `UnlockDiagnosticVMOptions`. The first one made sense when option was `develop` and not recognized in release builds. Good catch! Updated. Thanks @shipilev . ------------- PR: https://git.openjdk.java.net/jdk/pull/6721 From thartmann at openjdk.java.net Mon Dec 6 12:09:18 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 6 Dec 2021 12:09:18 GMT Subject: RFR: 8278291: compiler/uncommontrap/TraceDeoptimizationNoRealloc.java fails with release VMs after JDK-8154011 [v2] In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 12:05:46 GMT, Jie Fu wrote: >> Hi all, >> >> compiler/uncommontrap/TraceDeoptimizationNoRealloc.java fails with release VMs after JDK-8154011 due 'TraceDeoptimization' is diagnostic and must be enabled via -XX:+UnlockDiagnosticVMOptions. >> Let's add `-XX:+UnlockDiagnosticVMOptions` to fix it. >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Remove unnecessary IgnoreUnrecognizedVMOptions Looks good and trivial. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6721 From shade at openjdk.java.net Mon Dec 6 12:13:12 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 6 Dec 2021 12:13:12 GMT Subject: RFR: 8278291: compiler/uncommontrap/TraceDeoptimizationNoRealloc.java fails with release VMs after JDK-8154011 [v2] In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 12:05:46 GMT, Jie Fu wrote: >> Hi all, >> >> compiler/uncommontrap/TraceDeoptimizationNoRealloc.java fails with release VMs after JDK-8154011 due 'TraceDeoptimization' is diagnostic and must be enabled via -XX:+UnlockDiagnosticVMOptions. >> Let's add `-XX:+UnlockDiagnosticVMOptions` to fix it. >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Remove unnecessary IgnoreUnrecognizedVMOptions Looks fine. Please test both `release` and `fastdebug` builds before pushing. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6721 From pli at openjdk.java.net Mon Dec 6 12:14:08 2021 From: pli at openjdk.java.net (Pengfei Li) Date: Mon, 6 Dec 2021 12:14:08 GMT Subject: RFR: 8277168: AArch64: Enable arraycopy partial inlining with SVE In-Reply-To: References: Message-ID: On Thu, 18 Nov 2021 03:50:45 GMT, Pengfei Li wrote: > Arraycopy partial inlining is a C2 compiler technique that avoids stub > call overhead in small-sized arraycopy operations by generating masked > vector instructions. So far it works on x86 AVX512 only and this patch > enables it on AArch64 with SVE. > > We add AArch64 matching rule for VectorMaskGenNode and refactor that > node a little bit. The major change is moving the element type field > into its TypeVectMask bottom type. The reason is that AArch64 vector > masks are different for different vector element types. > > E.g., an x86 AVX512 vector mask value masking 3 least significant vector > lanes (of any type) is like > > `0000 0000 ... 0000 0000 0000 0000 0111` > > On AArch64 SVE, this mask value can only be used for masking the 3 least > significant lanes of bytes. But for 3 lanes of ints, the value should be > > `0000 0000 ... 0000 0000 0001 0001 0001` > > where the least significant bit of each lane matters. So AArch64 matcher > needs to know the vector element type to generate right masks. > > After this patch, the C2 generated code for copying a 50-byte array on > AArch64 SVE looks like > > mov x12, #0x32 > whilelo p0.b, xzr, x12 > add x11, x11, #0x10 > ld1b {z16.b}, p0/z, [x11] > add x10, x10, #0x10 > st1b {z16.b}, p0, [x10] > > We ran jtreg hotspot::hotspot_all, jdk::tier1~3 and langtools::tier1 on > both x86 AVX512 and AArch64 SVE machines, no issue is found. We tested > JMH org/openjdk/bench/java/lang/ArrayCopyAligned.java with small array > size arguments on a 512-bit SVE-featured CPU. We got below performance > data changes. > > Benchmark (length) (Performance) > ArrayCopyAligned.testByte 10 -2.6% > ArrayCopyAligned.testByte 20 +4.7% > ArrayCopyAligned.testByte 30 +4.8% > ArrayCopyAligned.testByte 40 +21.7% > ArrayCopyAligned.testByte 50 +22.5% > ArrayCopyAligned.testByte 60 +28.4% > > The test machine has SVE vector size of 512 bits, so we see performance > gain for most array sizes less than 64 bytes. For very small arrays we > see a bit regression because a vector load/store may be a bit slower > than 1 or 2 scalar loads/stores. Thanks Andrew! Can any reviewer help look at the C2 mid-end part? ------------- PR: https://git.openjdk.java.net/jdk/pull/6444 From rkennke at openjdk.java.net Mon Dec 6 12:14:15 2021 From: rkennke at openjdk.java.net (Roman Kennke) Date: Mon, 6 Dec 2021 12:14:15 GMT Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v11] In-Reply-To: <3ij7xShJzxFD9U79iHoJPnHJzg6rPv7T1r67gIqEuD4=.3b964bd4-1374-4d14-a4b3-e209c5954093@github.com> References: <3ij7xShJzxFD9U79iHoJPnHJzg6rPv7T1r67gIqEuD4=.3b964bd4-1374-4d14-a4b3-e209c5954093@github.com> Message-ID: On Thu, 2 Dec 2021 14:41:53 GMT, Roman Kennke wrote: >> The flag UseHeavyMonitors seems to imply that it makes Hotspot always use inflated monitors, rather than stack locks. However, it is only implemented in the interpreter that way. When it calls into runtime, it would still happily stack-lock. Even worse, C1 uses another flag UseFastLocking to achieve something similar (with the same caveat that runtime would stack-lock anyway). C2 doesn't have any such mechanism at all. >> I would like to experiment with disabling stack-locking, and thus, having this flag work as expected would seem very useful. >> >> The change removes the C1 flag UseFastLocking, and replaces its uses with equivalent (i.e. inverted) UseHeavyMonitors instead. I think it makes sense to make UseHeavyMonitors develop (I wouldn't want anybody to use this in production, not currently without this change, and not with this change). I also added a flag VerifyHeavyMonitors to be able to verify that stack-locking is really disabled. We can't currently verify this uncondiftionally (e.g. in debug builds) because all non-x86_64 platforms would need work. >> >> Testing: >> - [x] tier1 >> - [x] tier2 >> - [x] tier3 >> - [ ] tier4 > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > PPC port by @TheRealMDoerr Can I get more reviews for this PR before the JDK18 window closes? I suggest to not wait for arm or s390 ports for now. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/6320 From roland at openjdk.java.net Mon Dec 6 13:01:46 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 6 Dec 2021 13:01:46 GMT Subject: RFR: 8262341: Refine identical code in AddI/LNode. [v2] In-Reply-To: <5q7zq7qSEn7iTQdlYBh3d92Qeg7oxO_b5T36iUaFSGs=.326c2ded-2a41-4589-8951-6a31a6d44257@github.com> References: <5q7zq7qSEn7iTQdlYBh3d92Qeg7oxO_b5T36iUaFSGs=.326c2ded-2a41-4589-8951-6a31a6d44257@github.com> Message-ID: <6UOcHiZGDoyGVimh7ooCJxL82KZK1yaOvPwNbFs7Y3g=.a6da7fe3-625b-4fb9-af19-2aec9b536a14@github.com> > AddINode::Ideal() and AddlNode::Ideal() are almost identical but the > same logic had to be duplicated because AddINode::Ideal() tests its > inputs for Op_AddI, Op_SubI etc. while AddLNode::Ideal() tests for > Op_AddL, Op_SubL etc. This patch refactors the code so the common > logic is in a single method parameterized by a BasicType argument. > > The way I've done this before in the context of int/long counted loops > was to use and extra virtual method operates_on(). So: > > n->Opcode() == Op_AddI becomes n->is_Add() && n->operates_on(T_INT) > > Working on this change made me realize that pattern doesn't work that well: > > - it's quite a bit more verbose and converting existing code is not as > mechanical as we would like to avoid conversion errors. > > - it breaks when a class has a subclass. For instance AddNode has > OrINode and OrLNode as subclasses so testing for n->is_Add() returns > true with an OrI node. > > Instead, this change introduces new functions. For instance of > AddI/AddL: > > int Op_Add(BasicType bt) > > that returns either Op_AddI or Op_AddL depending on bt. This made > refactoring the AddINode::Ideal() logic straightforward. I removed all > use of operates_on() as well and converted existing code to the new > Op_XXX() functions. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - T *p to -> T* p - Merge branch 'master' into JDK-8262341 - fix ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6607/files - new: https://git.openjdk.java.net/jdk/pull/6607/files/5160efe2..22292277 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6607&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6607&range=00-01 Stats: 11456 lines in 535 files changed: 7400 ins; 1974 del; 2082 mod Patch: https://git.openjdk.java.net/jdk/pull/6607.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6607/head:pull/6607 PR: https://git.openjdk.java.net/jdk/pull/6607 From mdoerr at openjdk.java.net Mon Dec 6 13:37:12 2021 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Mon, 6 Dec 2021 13:37:12 GMT Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v11] In-Reply-To: <3ij7xShJzxFD9U79iHoJPnHJzg6rPv7T1r67gIqEuD4=.3b964bd4-1374-4d14-a4b3-e209c5954093@github.com> References: <3ij7xShJzxFD9U79iHoJPnHJzg6rPv7T1r67gIqEuD4=.3b964bd4-1374-4d14-a4b3-e209c5954093@github.com> Message-ID: On Thu, 2 Dec 2021 14:41:53 GMT, Roman Kennke wrote: >> The flag UseHeavyMonitors seems to imply that it makes Hotspot always use inflated monitors, rather than stack locks. However, it is only implemented in the interpreter that way. When it calls into runtime, it would still happily stack-lock. Even worse, C1 uses another flag UseFastLocking to achieve something similar (with the same caveat that runtime would stack-lock anyway). C2 doesn't have any such mechanism at all. >> I would like to experiment with disabling stack-locking, and thus, having this flag work as expected would seem very useful. >> >> The change removes the C1 flag UseFastLocking, and replaces its uses with equivalent (i.e. inverted) UseHeavyMonitors instead. I think it makes sense to make UseHeavyMonitors develop (I wouldn't want anybody to use this in production, not currently without this change, and not with this change). I also added a flag VerifyHeavyMonitors to be able to verify that stack-locking is really disabled. We can't currently verify this uncondiftionally (e.g. in debug builds) because all non-x86_64 platforms would need work. >> >> Testing: >> - [x] tier1 >> - [x] tier2 >> - [x] tier3 >> - [ ] tier4 > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > PPC port by @TheRealMDoerr This version LGTM. ------------- Marked as reviewed by mdoerr (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6320 From jiefu at openjdk.java.net Mon Dec 6 13:50:17 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 6 Dec 2021 13:50:17 GMT Subject: RFR: 8278291: compiler/uncommontrap/TraceDeoptimizationNoRealloc.java fails with release VMs after JDK-8154011 [v2] In-Reply-To: References: Message-ID: <3BmYufvXM4iuyeWpog5ruvrFIllyOJdL7ciVq1vjNoY=.96dce167-984f-4b2b-b2c8-ff7e78682830@github.com> On Mon, 6 Dec 2021 12:09:46 GMT, Aleksey Shipilev wrote: > Looks fine. Please test both `release` and `fastdebug` builds before pushing. Thanks @TobiHartmann and @shipilev . Both passed on Linux/x64. So integrate it. ------------- PR: https://git.openjdk.java.net/jdk/pull/6721 From jiefu at openjdk.java.net Mon Dec 6 13:50:17 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 6 Dec 2021 13:50:17 GMT Subject: Integrated: 8278291: compiler/uncommontrap/TraceDeoptimizationNoRealloc.java fails with release VMs after JDK-8154011 In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 11:27:19 GMT, Jie Fu wrote: > Hi all, > > compiler/uncommontrap/TraceDeoptimizationNoRealloc.java fails with release VMs after JDK-8154011 due 'TraceDeoptimization' is diagnostic and must be enabled via -XX:+UnlockDiagnosticVMOptions. > Let's add `-XX:+UnlockDiagnosticVMOptions` to fix it. > > Thanks. > Best regards, > Jie This pull request has now been integrated. Changeset: 6994d809 Author: Jie Fu URL: https://git.openjdk.java.net/jdk/commit/6994d809371e80c1e24cd296c48c7f75886577b7 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8278291: compiler/uncommontrap/TraceDeoptimizationNoRealloc.java fails with release VMs after JDK-8154011 Reviewed-by: shade, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/6721 From duke at openjdk.java.net Mon Dec 6 14:27:34 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Mon, 6 Dec 2021 14:27:34 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" Message-ID: Hi, This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. While working on this patch, I spot some regressions regarding compilation on AVX1. Thank you very much. ------------- Commit messages: - grammar in comment - missing copyright - add comments - vector reshape compiler tests - Merge branch 'master' into vectorReshapeTests - initial commit Changes: https://git.openjdk.java.net/jdk/pull/6724/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6724&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8259610 Stats: 4097 lines in 15 files changed: 4097 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6724.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6724/head:pull/6724 PR: https://git.openjdk.java.net/jdk/pull/6724 From chagedorn at openjdk.java.net Mon Dec 6 14:51:24 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Mon, 6 Dec 2021 14:51:24 GMT Subject: RFR: 8275610: C2: Object field load floats above its null check resulting in a segfault In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 14:00:39 GMT, Christian Hagedorn wrote: > In the test case, a field load of an object floats above the object null check due to a `CastPP` that gets separated from its null check `If` node. C2 then schedules the field load before the object null check which results in a segfault. > > The problem can be traced back to the elimination of identical back-to-back ifs [(JDK-8143542)](https://bugs.openjdk.java.net/browse/JDK-8143542). This is done in split-if where we detect identical back-to-back ifs in `PhaseIdealLoop::identical_backtoback_ifs()`. We then replace the boolean node of the dominated `If` by a phi to use the split-if optimization to nicely fold away the dominated `If` later in IGVN. This, however, does not update any data nodes that were dependent on the dominated `If` projections. > > In the test case, we have the following graph right before splitting `313 If` (`NC 2`) through `190 Region`: > > ![Screenshot from 2021-12-03 14-08-31](https://user-images.githubusercontent.com/17833009/144608863-a1185bf8-dfa3-4bd5-a0c2-284ea2b27606.png) > `313 If` is dominated by the identical (= both share `308 Bool`) `309 If` (`NC 1`). The bool input for `313 If` is replaced by a phi and the split-if optimization is applied. However, the data nodes dependent on the out projections of the dominated `313 If` (`261 CastPP` in this case) are not processed separately and just end up at the newly inserted regions in split-if. In the test case, we get the following graph where `261 CastPP` ends up at the new `334 Region`: > > ![Screenshot from 2021-12-03 14-08-59](https://user-images.githubusercontent.com/17833009/144608891-48adf7a3-9b04-4e6d-bc26-310757b3d596.png) > Loopopts are applied and can remove the `325 CountedLoop` and we find that the `171 RangeCheck` (`RC 2`) is applied on a constant array index 1. > > Now at IGVN, the order in which the nodes are processed is important in order to trigger the segfault: > 1. `334 Region` with `332 If` and `333 If` are removed because of the special split-if setup we used to remove the identical `313 If`. The control input of `261 CastPP` is therefore updated to `172 IfTrue`. > 2. Applying `RangeCheck::Ideal()` for `171 RangeCheck` finds that `305 RangeCheck` (`RC 1`) already covers it and we remove `171 RangeCheck`. In this process, we rewire `261 CastPP` to the dominating `305 RangeCheck` and we have the following graph: > > ![Screenshot from 2021-12-03 14-09-21](https://user-images.githubusercontent.com/17833009/144608917-f5159b21-40b2-41c8-afab-fa495433217a.png) > `261 CastPP` - and also the field `263 LoadI` - have now `306 IfFalse` as early control. GCM is then scheduling `263 LoadI` before the null check `309 If` and we get a segfault. > > An easy fix is not straight forward. What we actually would want to do is rewiring `261 CastPP` from `334 Region` to `311 IfTrue` in the second graph after split-if to not separate it from the null check. But that's not possible because we would create a bad graph: The early control `311 IfTrue` of `261 CastPP` does not dominate its late control further down after `334 Region` because of the not yet removed `334 Region`. We would need to already clean the regions up and then do the rewiring. But then the question arises why to use the split-if optimization in the first place when we do not want to rely on IGVN to clean it up. > > I therefore suggest to go with an easy bailout fix for JDK 18 where we do not apply this identical back-to-back if removal optimization if there are data dependencies and rework this in an RFE for JDK 19. Roland already has some ideas how to do that. > > I ran some standard benchmarks and did not see any performance regressions with this fix. > > Thanks, > Christian Thanks Roland for your review! ------------- PR: https://git.openjdk.java.net/jdk/pull/6701 From chagedorn at openjdk.java.net Mon Dec 6 14:51:26 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Mon, 6 Dec 2021 14:51:26 GMT Subject: Integrated: 8275610: C2: Object field load floats above its null check resulting in a segfault In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 14:00:39 GMT, Christian Hagedorn wrote: > In the test case, a field load of an object floats above the object null check due to a `CastPP` that gets separated from its null check `If` node. C2 then schedules the field load before the object null check which results in a segfault. > > The problem can be traced back to the elimination of identical back-to-back ifs [(JDK-8143542)](https://bugs.openjdk.java.net/browse/JDK-8143542). This is done in split-if where we detect identical back-to-back ifs in `PhaseIdealLoop::identical_backtoback_ifs()`. We then replace the boolean node of the dominated `If` by a phi to use the split-if optimization to nicely fold away the dominated `If` later in IGVN. This, however, does not update any data nodes that were dependent on the dominated `If` projections. > > In the test case, we have the following graph right before splitting `313 If` (`NC 2`) through `190 Region`: > > ![Screenshot from 2021-12-03 14-08-31](https://user-images.githubusercontent.com/17833009/144608863-a1185bf8-dfa3-4bd5-a0c2-284ea2b27606.png) > `313 If` is dominated by the identical (= both share `308 Bool`) `309 If` (`NC 1`). The bool input for `313 If` is replaced by a phi and the split-if optimization is applied. However, the data nodes dependent on the out projections of the dominated `313 If` (`261 CastPP` in this case) are not processed separately and just end up at the newly inserted regions in split-if. In the test case, we get the following graph where `261 CastPP` ends up at the new `334 Region`: > > ![Screenshot from 2021-12-03 14-08-59](https://user-images.githubusercontent.com/17833009/144608891-48adf7a3-9b04-4e6d-bc26-310757b3d596.png) > Loopopts are applied and can remove the `325 CountedLoop` and we find that the `171 RangeCheck` (`RC 2`) is applied on a constant array index 1. > > Now at IGVN, the order in which the nodes are processed is important in order to trigger the segfault: > 1. `334 Region` with `332 If` and `333 If` are removed because of the special split-if setup we used to remove the identical `313 If`. The control input of `261 CastPP` is therefore updated to `172 IfTrue`. > 2. Applying `RangeCheck::Ideal()` for `171 RangeCheck` finds that `305 RangeCheck` (`RC 1`) already covers it and we remove `171 RangeCheck`. In this process, we rewire `261 CastPP` to the dominating `305 RangeCheck` and we have the following graph: > > ![Screenshot from 2021-12-03 14-09-21](https://user-images.githubusercontent.com/17833009/144608917-f5159b21-40b2-41c8-afab-fa495433217a.png) > `261 CastPP` - and also the field `263 LoadI` - have now `306 IfFalse` as early control. GCM is then scheduling `263 LoadI` before the null check `309 If` and we get a segfault. > > An easy fix is not straight forward. What we actually would want to do is rewiring `261 CastPP` from `334 Region` to `311 IfTrue` in the second graph after split-if to not separate it from the null check. But that's not possible because we would create a bad graph: The early control `311 IfTrue` of `261 CastPP` does not dominate its late control further down after `334 Region` because of the not yet removed `334 Region`. We would need to already clean the regions up and then do the rewiring. But then the question arises why to use the split-if optimization in the first place when we do not want to rely on IGVN to clean it up. > > I therefore suggest to go with an easy bailout fix for JDK 18 where we do not apply this identical back-to-back if removal optimization if there are data dependencies and rework this in an RFE for JDK 19. Roland already has some ideas how to do that. > > I ran some standard benchmarks and did not see any performance regressions with this fix. > > Thanks, > Christian This pull request has now been integrated. Changeset: 7c6f57fc Author: Christian Hagedorn URL: https://git.openjdk.java.net/jdk/commit/7c6f57fcb1f1fcecf26f7b8046a5a41ca6d9c315 Stats: 112 lines in 2 files changed: 112 ins; 0 del; 0 mod 8275610: C2: Object field load floats above its null check resulting in a segfault Reviewed-by: kvn, roland ------------- PR: https://git.openjdk.java.net/jdk/pull/6701 From phedlin at openjdk.java.net Mon Dec 6 16:16:32 2021 From: phedlin at openjdk.java.net (Patric Hedlin) Date: Mon, 6 Dec 2021 16:16:32 GMT Subject: RFR: 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 Message-ID: Implementation of ISO/ASCII char set encoding, extending current implementation with ASCII encoding support. Implementation focusing on balance between small footprint and efficiency, trying to utilise a dual SIMD path (e.g. Neoverse N1) for the additional Ascii-check. Testing: tier1-6 Benchmarks (ran on Aurora/Ampere Altra): openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:ASCII..........72.23% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:BIG5...........70.38% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:ISO_8859_15....67.81% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:UTF_16......... 3.72% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:UTF_8..........68.50% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:ASCII...........65.59% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:BIG5............60.59% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:ISO_8859_15.....63.79% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:UTF_16.......... 1.04% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:UTF_8...........63.33% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:ASCII............57.25% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:BIG5.............49.33% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:ISO_8859_15......61.37% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:UTF_16........... 0.02% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:UTF_8............54.75% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:ASCII............54.52% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:BIG5.............40.41% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:ISO_8859_15......58.46% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:UTF_16...........-0.55% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:UTF_8............55.98% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:ASCII............47.37% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:BIG5.............36.41% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:ISO_8859_15......50.83% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:UTF_16........... 8.63% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:UTF_8............48.95% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:ASCII.............17.55% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:BIG5..............18.58% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:ISO_8859_15.......20.82% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:UTF_16............ 4.16% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:UTF_8.............18.44% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:ASCII.............21.96% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:BIG5..............22.42% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:ISO_8859_15.......30.27% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:UTF_16............-1.17% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:UTF_8.............35.99% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:ASCII............. 6.19% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:BIG5.............. 7.34% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:ISO_8859_15....... 8.34% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:UTF_16............-0.46% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:UTF_8............. 6.80% ------------- Commit messages: - Removing old implementation of encode_iso_array(). - Interleaved ISO and ASCII check code. Using post inc in main loop. - 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 Changes: https://git.openjdk.java.net/jdk/pull/6723/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6723&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8274243 Stats: 218 lines in 5 files changed: 89 ins; 92 del; 37 mod Patch: https://git.openjdk.java.net/jdk/pull/6723.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6723/head:pull/6723 PR: https://git.openjdk.java.net/jdk/pull/6723 From phedlin at openjdk.java.net Mon Dec 6 16:16:33 2021 From: phedlin at openjdk.java.net (Patric Hedlin) Date: Mon, 6 Dec 2021 16:16:33 GMT Subject: RFR: 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 14:09:07 GMT, Patric Hedlin wrote: > Implementation of ISO/ASCII char set encoding, extending current implementation with ASCII encoding support. > > Implementation focusing on balance between small footprint and efficiency, trying to utilise a dual SIMD path (e.g. Neoverse N1) for the additional Ascii-check. > > Testing: tier1-6 > > Benchmarks (ran on Aurora/Ampere Altra): > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:ASCII..........72.23% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:BIG5...........70.38% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:ISO_8859_15....67.81% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:UTF_16......... 3.72% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:UTF_8..........68.50% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:ASCII...........65.59% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:BIG5............60.59% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:ISO_8859_15.....63.79% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:UTF_16.......... 1.04% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:UTF_8...........63.33% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:ASCII............57.25% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:BIG5.............49.33% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:ISO_8859_15......61.37% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:UTF_16........... 0.02% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:UTF_8............54.75% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:ASCII............54.52% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:BIG5.............40.41% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:ISO_8859_15......58.46% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:UTF_16...........-0.55% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:UTF_8............55.98% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:ASCII............47.37% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:BIG5.............36.41% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:ISO_8859_15......50.83% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:UTF_16........... 8.63% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:UTF_8............48.95% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:ASCII.............17.55% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:BIG5..............18.58% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:ISO_8859_15.......20.82% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:UTF_16............ 4.16% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:UTF_8.............18.44% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:ASCII.............21.96% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:BIG5..............22.42% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:ISO_8859_15.......30.27% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:UTF_16............-1.17% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:UTF_8.............35.99% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:ASCII............. 6.19% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:BIG5.............. 7.34% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:ISO_8859_15....... 8.34% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:UTF_16............-0.46% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:UTF_8............. 6.80% Current implementation (including prefetch hint). Benchmark (charsetName) (message) (timesToAppend) Mode Cnt Score Error Units EncoderBenchmarks.charsetEncoder UTF-8 This is a simple ASCII message 3 avgt 4 144.807 ? 9.557 ns/op EncoderBenchmarks.charsetEncoder UTF-8 This is a message with unicode ?? 3 avgt 4 378.458 ? 206.193 ns/op EncoderBenchmarks.charsetEncoderWithAllocation UTF-8 This is a simple ASCII message 3 avgt 4 200.844 ? 14.998 ns/op EncoderBenchmarks.charsetEncoderWithAllocation UTF-8 This is a message with unicode ?? 3 avgt 4 356.589 ? 8.588 ns/op EncoderBenchmarks.charsetEncoderWithAllocationWrappingBuilder UTF-8 This is a simple ASCII message 3 avgt 4 698.518 ? 17.269 ns/op EncoderBenchmarks.charsetEncoderWithAllocationWrappingBuilder UTF-8 This is a message with unicode ?? 3 avgt 4 862.678 ? 30.872 ns/op EncoderBenchmarks.toStringGetBytes UTF-8 This is a simple ASCII message 3 avgt 4 109.413 ? 2.780 ns/op EncoderBenchmarks.toStringGetBytes UTF-8 This is a message with unicode ?? 3 avgt 4 522.050 ? 34.763 ns/op **-XX:SoftwarePrefetchHintDistance=128** Benchmark (charsetName) (message) (timesToAppend) Mode Cnt Score Error Units EncoderBenchmarks.charsetEncoder UTF-8 This is a simple ASCII message 3 avgt 4 144.519 ? 12.731 ns/op EncoderBenchmarks.charsetEncoder UTF-8 This is a message with unicode ?? 3 avgt 4 302.409 ? 51.020 ns/op EncoderBenchmarks.charsetEncoderWithAllocation UTF-8 This is a simple ASCII message 3 avgt 4 201.144 ? 14.624 ns/op EncoderBenchmarks.charsetEncoderWithAllocation UTF-8 This is a message with unicode ?? 3 avgt 4 469.724 ? 4.871 ns/op EncoderBenchmarks.charsetEncoderWithAllocationWrappingBuilder UTF-8 This is a simple ASCII message 3 avgt 4 695.666 ? 22.061 ns/op EncoderBenchmarks.charsetEncoderWithAllocationWrappingBuilder UTF-8 This is a message with unicode ?? 3 avgt 4 858.812 ? 22.913 ns/op EncoderBenchmarks.toStringGetBytes UTF-8 This is a simple ASCII message 3 avgt 4 109.598 ? 1.921 ns/op EncoderBenchmarks.toStringGetBytes UTF-8 This is a message with unicode ?? 3 avgt 4 511.589 ? 34.407 ns/op New implementation (disregards prefetch hint). Benchmark (charsetName) (message) (timesToAppend) Mode Cnt Score Error Units EncoderBenchmarks.charsetEncoder UTF-8 This is a simple ASCII message 3 avgt 4 116.916 ? 14.340 ns/op EncoderBenchmarks.charsetEncoder UTF-8 This is a message with unicode ?? 3 avgt 4 292.334 ? 10.038 ns/op EncoderBenchmarks.charsetEncoderWithAllocation UTF-8 This is a simple ASCII message 3 avgt 4 178.490 ? 11.258 ns/op EncoderBenchmarks.charsetEncoderWithAllocation UTF-8 This is a message with unicode ?? 3 avgt 4 363.741 ? 13.080 ns/op EncoderBenchmarks.charsetEncoderWithAllocationWrappingBuilder UTF-8 This is a simple ASCII message 3 avgt 4 695.520 ? 20.217 ns/op EncoderBenchmarks.charsetEncoderWithAllocationWrappingBuilder UTF-8 This is a message with unicode ?? 3 avgt 4 862.785 ? 13.673 ns/op EncoderBenchmarks.toStringGetBytes UTF-8 This is a simple ASCII message 3 avgt 4 108.263 ? 6.123 ns/op EncoderBenchmarks.toStringGetBytes UTF-8 This is a message with unicode ?? 3 avgt 4 516.571 ? 22.347 ns/op ------------- PR: https://git.openjdk.java.net/jdk/pull/6723 From psandoz at openjdk.java.net Mon Dec 6 17:07:16 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Mon, 6 Dec 2021 17:07:16 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 14:21:24 GMT, Mai ??ng Qu?n Anh wrote: > Hi, > > This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. > > While working on this patch, I spot some regressions regarding compilation on AVX1. > > Thank you very much. Nice work. I need to look at this more closely, but this is the kind of thing i was hoping might be possible with the C2 IR test framework. May i suggest you request to become an [author](https://openjdk.java.net/projects/#project-author)? Then you will have access to the [issue tracker](https://bugs.openjdk.java.net), and further it's one step in the process to becoming a committer, which is definitely achievable if you keep up the current rate of contributions. ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From eliu at openjdk.java.net Mon Dec 6 17:14:13 2021 From: eliu at openjdk.java.net (Eric Liu) Date: Mon, 6 Dec 2021 17:14:13 GMT Subject: RFR: 8262341: Refine identical code in AddI/LNode. [v2] In-Reply-To: <6UOcHiZGDoyGVimh7ooCJxL82KZK1yaOvPwNbFs7Y3g=.a6da7fe3-625b-4fb9-af19-2aec9b536a14@github.com> References: <5q7zq7qSEn7iTQdlYBh3d92Qeg7oxO_b5T36iUaFSGs=.326c2ded-2a41-4589-8951-6a31a6d44257@github.com> <6UOcHiZGDoyGVimh7ooCJxL82KZK1yaOvPwNbFs7Y3g=.a6da7fe3-625b-4fb9-af19-2aec9b536a14@github.com> Message-ID: On Mon, 6 Dec 2021 13:01:46 GMT, Roland Westrelin wrote: >> AddINode::Ideal() and AddlNode::Ideal() are almost identical but the >> same logic had to be duplicated because AddINode::Ideal() tests its >> inputs for Op_AddI, Op_SubI etc. while AddLNode::Ideal() tests for >> Op_AddL, Op_SubL etc. This patch refactors the code so the common >> logic is in a single method parameterized by a BasicType argument. >> >> The way I've done this before in the context of int/long counted loops >> was to use and extra virtual method operates_on(). So: >> >> n->Opcode() == Op_AddI becomes n->is_Add() && n->operates_on(T_INT) >> >> Working on this change made me realize that pattern doesn't work that well: >> >> - it's quite a bit more verbose and converting existing code is not as >> mechanical as we would like to avoid conversion errors. >> >> - it breaks when a class has a subclass. For instance AddNode has >> OrINode and OrLNode as subclasses so testing for n->is_Add() returns >> true with an OrI node. >> >> Instead, this change introduces new functions. For instance of >> AddI/AddL: >> >> int Op_Add(BasicType bt) >> >> that returns either Op_AddI or Op_AddL depending on bt. This made >> refactoring the AddINode::Ideal() logic straightforward. I removed all >> use of operates_on() as well and converted existing code to the new >> Op_XXX() functions. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - T *p to -> T* p > - Merge branch 'master' into JDK-8262341 > - fix LGTM. ------------- Marked as reviewed by eliu (Author). PR: https://git.openjdk.java.net/jdk/pull/6607 From redestad at openjdk.java.net Mon Dec 6 17:15:20 2021 From: redestad at openjdk.java.net (Claes Redestad) Date: Mon, 6 Dec 2021 17:15:20 GMT Subject: RFR: 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 14:09:07 GMT, Patric Hedlin wrote: > Implementation of ISO/ASCII char set encoding, extending current implementation with ASCII encoding support. > > Implementation focusing on balance between small footprint and efficiency, trying to utilise a dual SIMD path (e.g. Neoverse N1) for the additional Ascii-check. > > Testing: tier1-6 > > Benchmarks (ran on Aurora/Ampere Altra): > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:ASCII..........72.23% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:BIG5...........70.38% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:ISO_8859_15....67.81% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:UTF_16......... 3.72% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:UTF_8..........68.50% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:ASCII...........65.59% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:BIG5............60.59% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:ISO_8859_15.....63.79% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:UTF_16.......... 1.04% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:UTF_8...........63.33% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:ASCII............57.25% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:BIG5.............49.33% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:ISO_8859_15......61.37% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:UTF_16........... 0.02% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:UTF_8............54.75% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:ASCII............54.52% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:BIG5.............40.41% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:ISO_8859_15......58.46% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:UTF_16...........-0.55% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:UTF_8............55.98% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:ASCII............47.37% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:BIG5.............36.41% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:ISO_8859_15......50.83% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:UTF_16........... 8.63% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:UTF_8............48.95% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:ASCII.............17.55% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:BIG5..............18.58% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:ISO_8859_15.......20.82% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:UTF_16............ 4.16% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:UTF_8.............18.44% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:ASCII.............21.96% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:BIG5..............22.42% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:ISO_8859_15.......30.27% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:UTF_16............-1.17% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:UTF_8.............35.99% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:ASCII............. 6.19% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:BIG5.............. 7.34% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:ISO_8859_15....... 8.34% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:UTF_16............-0.46% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:UTF_8............. 6.80% Great to see this come along, thanks! I can't review the code, but I think it'd be good to collect benchmark scores for `ISO-8859-1` before and after, along with absolute numbers so we see how things stack up against the existing ISO-8859-1-only intrinsic. ------------- PR: https://git.openjdk.java.net/jdk/pull/6723 From kvn at openjdk.java.net Mon Dec 6 17:20:16 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Dec 2021 17:20:16 GMT Subject: RFR: 8277850: C2: optimize mask checks in counted loops In-Reply-To: References: <_gfsjS2vHwM9TTZtfGNeBOBu9ISOFAxfSdGl391M4MM=.4b05410a-5583-4bf2-99af-eab56a359e3c@github.com> Message-ID: On Mon, 6 Dec 2021 08:42:09 GMT, Roland Westrelin wrote: >> src/hotspot/share/opto/mulnode.cpp line 1719: >> >>> 1717: bool MulNode::AndIL_shift_and_mask(PhaseGVN* phase, Node* mask, Node* shift, BasicType bt) const { >>> 1718: if (mask == NULL || shift == NULL) { >>> 1719: return false; >> >> You need to check `shift` for `TOP`. > > Code below: > > const TypeInteger* shift_t = phase->type(shift)->isa_integer(bt); > if (mask_t == NULL || shift_t == NULL) { > > catches the case where shift is top, I think. You are right. It is type check. ------------- PR: https://git.openjdk.java.net/jdk/pull/6697 From jbhateja at openjdk.java.net Mon Dec 6 17:44:01 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 6 Dec 2021 17:44:01 GMT Subject: RFR: 8277997: Intrinsic creation for VectorMask.fromLong API [v3] In-Reply-To: References: Message-ID: <4eJ7dJ21N_lmAxD-2q8yEl4S-aeKhZ3Z1Dx7huOihxg=.655f9b0d-b681-49b9-9d7c-5399ecbf7e09@github.com> > Summary of changes: > > 1) Inline expansion of VectorMask.fromLong API, this includes Java API implementation and C2 IR changes. > 2) X86 backend support for AVX512 and AVX2 targets. > 3) New IR transformation to handle following patterns:- > a) Mask2Long + Long2Mask -> MaskCast (when source and destination mask lengths are equal) > b) Long2Mask + Mask2Long -> Long > 4) Following performance data is collected for new JMH micro included with the patch:- > > System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) > > Benchmark | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain factor | Baseline AVX3 (ops/ms) | Withopt AVX3(ops/ms) | Gain factor > -- | -- | -- | -- | -- | -- | -- > MaskFromLongBenchmark.microMaskFromLong_Byte128 | 20050.884 | 36414.349 | 1.816096936 | 19699.631 | 36412.252 | 1.848372287 > MaskFromLongBenchmark.microMaskFromLong_Byte256 | 17589.496 | 36418.368 | 2.070461143 | 17211.451 | 36407.44 | 2.115303352 > MaskFromLongBenchmark.microMaskFromLong_Byte512 | 2824.411 | 2492.795 | 0.882589326 | 6359.071 | 36405.344 | 5.72494693 > MaskFromLongBenchmark.microMaskFromLong_Byte64 | 23507.28 | 36424.668 | 1.549505855 | 22659.666 | 36420.345 | 1.607276338 > MaskFromLongBenchmark.microMaskFromLong_Integer128 | 24567.895 | 36411.602 | 1.482080659 | 24620.619 | 36397.005 | 1.478313969 > MaskFromLongBenchmark.microMaskFromLong_Integer256 | 23495.078 | 36411.981 | 1.549770595 | 22823.846 | 36395.703 | 1.594634971 > MaskFromLongBenchmark.microMaskFromLong_Integer512 | 12377.022 | 11478.101 | 0.927371786 | 19701.118 | 36394.878 | 1.847350897 > MaskFromLongBenchmark.microMaskFromLong_Integer64 | 22169.231 | 17791.849 | 0.802546962 | 23603.169 | 18055.166 | 0.76494669 > MaskFromLongBenchmark.microMaskFromLong_Long128 | 22312.568 | 17859.474 | 0.800422166 | 22171.303 | 18106.295 | 0.816654529 > MaskFromLongBenchmark.microMaskFromLong_Long256 | 24271.19 | 36416.883 | 1.500416049 | 24621.327 | 36390.41 | 1.478003602 > MaskFromLongBenchmark.microMaskFromLong_Long512 | 15289.749 | 13860.775 | 0.906540389 | 23003.816 | 36396.033 | 1.582173714 > MaskFromLongBenchmark.microMaskFromLong_Long64 | 27086.471 | 20490.828 | 0.756496777 | 27177.133 | 20441.112 | 0.752143797 > MaskFromLongBenchmark.microMaskFromLong_Short128 | 23504.216 | 36412.66 | 1.549196961 | 22823.401 | 36417.799 | 1.595634191 > MaskFromLongBenchmark.microMaskFromLong_Short256 | 20056.61 | 36403.277 | 1.815026418 | 19699.502 | 36412.605 | 1.84840231 > MaskFromLongBenchmark.microMaskFromLong_Short512 | 4775.721 | 6827.594 | 1.429646749 | 17209.782 | 36388.226 | 2.114392036 > MaskFromLongBenchmark.microMaskFromLong_Short64 | 24759.049 | 36381.539 | 1.469423927 | 24506.013 | 36413.099 | 1.48588426 > > > > Kindly review and share feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8277997: Review comments resolved. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6646/files - new: https://git.openjdk.java.net/jdk/pull/6646/files/a1d3f019..1826c9e9 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6646&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6646&range=01-02 Stats: 40 lines in 36 files changed: 0 ins; 3 del; 37 mod Patch: https://git.openjdk.java.net/jdk/pull/6646.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6646/head:pull/6646 PR: https://git.openjdk.java.net/jdk/pull/6646 From jbhateja at openjdk.java.net Mon Dec 6 17:44:03 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 6 Dec 2021 17:44:03 GMT Subject: RFR: 8277997: Intrinsic creation for VectorMask.fromLong API [v2] In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 20:44:49 GMT, Paul Sandoz wrote: > The coerced terms comes from representing the value to convert (one bit, byte, float, long, etc) to a mask or vector as a set of bits held in a long value. > > Thus having an extra mode `MODE_BITS_COERCED_BROADCAST` is confusing in that regard and i think you can just reuse `MODE_BROADCAST` when broadcast the 1 bit to a mask. Since the class argument determines whether we are referring to a vector or not, as determined by `is_mask`. > > Thus i would retain the existing `scalar2vector` boolean argument, thereby the mode is localized just to the intrinsic. Comments addressed. ------------- PR: https://git.openjdk.java.net/jdk/pull/6646 From kvn at openjdk.java.net Mon Dec 6 17:46:14 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Dec 2021 17:46:14 GMT Subject: RFR: 8277850: C2: optimize mask checks in counted loops In-Reply-To: <_gfsjS2vHwM9TTZtfGNeBOBu9ISOFAxfSdGl391M4MM=.4b05410a-5583-4bf2-99af-eab56a359e3c@github.com> References: <_gfsjS2vHwM9TTZtfGNeBOBu9ISOFAxfSdGl391M4MM=.4b05410a-5583-4bf2-99af-eab56a359e3c@github.com> Message-ID: <-kSV6-zs3gp7oFCTaTvkJlKOyjAMYVgfeGYXrn5aDfg=.1ada96a3-cacd-4512-9c8b-d069c41afb43@github.com> On Fri, 3 Dec 2021 10:04:58 GMT, Roland Westrelin wrote: > This is another fix that addresses a performance issue with panama and that was brought up by Maurizio. The pattern to optimize is: > if ((base + (offset << 2)) & 3) != 0) { > } > > where base is loop independent but offset depends on a loop variable. This can be transformed to: > > if ((base & 3) != 0) { > > That check becomes loop independent and be optimized by loop predication (or I suppose loop unswitching but that wasn't the case of the micro benchmark I worked on). > > This change also optimizes the pattern: > > (offset << 2) & 3 > > to return 0. src/hotspot/share/opto/mulnode.hpp line 86: > 84: static MulNode* make(Node* in1, Node* in2, BasicType bt); > 85: > 86: bool AndIL_shift_and_mask(PhaseGVN* phase, Node* mask, Node* shift, BasicType bt) const; This one could be static method. ------------- PR: https://git.openjdk.java.net/jdk/pull/6697 From kvn at openjdk.java.net Mon Dec 6 17:46:16 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Dec 2021 17:46:16 GMT Subject: RFR: 8277850: C2: optimize mask checks in counted loops In-Reply-To: References: <_gfsjS2vHwM9TTZtfGNeBOBu9ISOFAxfSdGl391M4MM=.4b05410a-5583-4bf2-99af-eab56a359e3c@github.com> Message-ID: On Mon, 6 Dec 2021 08:39:23 GMT, Roland Westrelin wrote: >> src/hotspot/share/opto/mulnode.cpp line 514: >> >>> 512: } >>> 513: >>> 514: return MulINode::Value(phase); >> >> There is no `MulINode::value()` or `MulLNode::value()`, only based class `MulNode` have it. >> You missing initial check done by `MulNode::value()`. >> I suggest to move this code into `AndINode::mul_ring()` and `AndLNode::mul_ring()`. >> Also in `mul_ring()` you can pass `r1->get_con()` as `mask` value to `AndIL_shift_and_mask()`. > > Thanks for reviewing this. > MulNode::AndIL_shift_and_mask() needs the shift node so it can test for shift->Opcode() == Op_ConvI2L. mult_ring() only gets the Type* as input. Are you suggesting to change mul_ring's signature? You are right, I missed that. It is unfortunate - we will have many duplicated checks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6697 From dcubed at openjdk.java.net Mon Dec 6 16:51:13 2021 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Mon, 6 Dec 2021 16:51:13 GMT Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v11] In-Reply-To: <3ij7xShJzxFD9U79iHoJPnHJzg6rPv7T1r67gIqEuD4=.3b964bd4-1374-4d14-a4b3-e209c5954093@github.com> References: <3ij7xShJzxFD9U79iHoJPnHJzg6rPv7T1r67gIqEuD4=.3b964bd4-1374-4d14-a4b3-e209c5954093@github.com> Message-ID: <4D7syupp5pCK0i_VGPAzbt8DFJOz1TOe2jn1Kjz6O0c=.0ed2ca51-0a95-45e9-a279-68952683ba2d@github.com> On Thu, 2 Dec 2021 14:41:53 GMT, Roman Kennke wrote: >> The flag UseHeavyMonitors seems to imply that it makes Hotspot always use inflated monitors, rather than stack locks. However, it is only implemented in the interpreter that way. When it calls into runtime, it would still happily stack-lock. Even worse, C1 uses another flag UseFastLocking to achieve something similar (with the same caveat that runtime would stack-lock anyway). C2 doesn't have any such mechanism at all. >> I would like to experiment with disabling stack-locking, and thus, having this flag work as expected would seem very useful. >> >> The change removes the C1 flag UseFastLocking, and replaces its uses with equivalent (i.e. inverted) UseHeavyMonitors instead. I think it makes sense to make UseHeavyMonitors develop (I wouldn't want anybody to use this in production, not currently without this change, and not with this change). I also added a flag VerifyHeavyMonitors to be able to verify that stack-locking is really disabled. We can't currently verify this uncondiftionally (e.g. in debug builds) because all non-x86_64 platforms would need work. >> >> Testing: >> - [x] tier1 >> - [x] tier2 >> - [x] tier3 >> - [ ] tier4 > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > PPC port by @TheRealMDoerr Thumbs up. I'm going to kick off some Mach5 testing for this change or at least try to... ------------- Marked as reviewed by dcubed (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6320 From rkennke at openjdk.java.net Mon Dec 6 18:43:23 2021 From: rkennke at openjdk.java.net (Roman Kennke) Date: Mon, 6 Dec 2021 18:43:23 GMT Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v11] In-Reply-To: <4D7syupp5pCK0i_VGPAzbt8DFJOz1TOe2jn1Kjz6O0c=.0ed2ca51-0a95-45e9-a279-68952683ba2d@github.com> References: <3ij7xShJzxFD9U79iHoJPnHJzg6rPv7T1r67gIqEuD4=.3b964bd4-1374-4d14-a4b3-e209c5954093@github.com> <4D7syupp5pCK0i_VGPAzbt8DFJOz1TOe2jn1Kjz6O0c=.0ed2ca51-0a95-45e9-a279-68952683ba2d@github.com> Message-ID: On Mon, 6 Dec 2021 16:47:39 GMT, Daniel D. Daugherty wrote: > Thumbs up. I'm going to kick off some Mach5 testing for this change or at least try to... Alright, I will wait for it! Thanks for testing! ------------- PR: https://git.openjdk.java.net/jdk/pull/6320 From psandoz at openjdk.java.net Mon Dec 6 19:40:13 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Mon, 6 Dec 2021 19:40:13 GMT Subject: RFR: 8277997: Intrinsic creation for VectorMask.fromLong API [v3] In-Reply-To: <4eJ7dJ21N_lmAxD-2q8yEl4S-aeKhZ3Z1Dx7huOihxg=.655f9b0d-b681-49b9-9d7c-5399ecbf7e09@github.com> References: <4eJ7dJ21N_lmAxD-2q8yEl4S-aeKhZ3Z1Dx7huOihxg=.655f9b0d-b681-49b9-9d7c-5399ecbf7e09@github.com> Message-ID: On Mon, 6 Dec 2021 17:44:01 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> 1) Inline expansion of VectorMask.fromLong API, this includes Java API implementation and C2 IR changes. >> 2) X86 backend support for AVX512 and AVX2 targets. >> 3) New IR transformation to handle following patterns:- >> a) Mask2Long + Long2Mask -> MaskCast (when source and destination mask lengths are equal) >> b) Long2Mask + Mask2Long -> Long >> 4) Following performance data is collected for new JMH micro included with the patch:- >> >> System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> Benchmark | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain factor | Baseline AVX3 (ops/ms) | Withopt AVX3(ops/ms) | Gain factor >> -- | -- | -- | -- | -- | -- | -- >> MaskFromLongBenchmark.microMaskFromLong_Byte128 | 20050.884 | 36414.349 | 1.816096936 | 19699.631 | 36412.252 | 1.848372287 >> MaskFromLongBenchmark.microMaskFromLong_Byte256 | 17589.496 | 36418.368 | 2.070461143 | 17211.451 | 36407.44 | 2.115303352 >> MaskFromLongBenchmark.microMaskFromLong_Byte512 | 2824.411 | 2492.795 | 0.882589326 | 6359.071 | 36405.344 | 5.72494693 >> MaskFromLongBenchmark.microMaskFromLong_Byte64 | 23507.28 | 36424.668 | 1.549505855 | 22659.666 | 36420.345 | 1.607276338 >> MaskFromLongBenchmark.microMaskFromLong_Integer128 | 24567.895 | 36411.602 | 1.482080659 | 24620.619 | 36397.005 | 1.478313969 >> MaskFromLongBenchmark.microMaskFromLong_Integer256 | 23495.078 | 36411.981 | 1.549770595 | 22823.846 | 36395.703 | 1.594634971 >> MaskFromLongBenchmark.microMaskFromLong_Integer512 | 12377.022 | 11478.101 | 0.927371786 | 19701.118 | 36394.878 | 1.847350897 >> MaskFromLongBenchmark.microMaskFromLong_Integer64 | 22169.231 | 17791.849 | 0.802546962 | 23603.169 | 18055.166 | 0.76494669 >> MaskFromLongBenchmark.microMaskFromLong_Long128 | 22312.568 | 17859.474 | 0.800422166 | 22171.303 | 18106.295 | 0.816654529 >> MaskFromLongBenchmark.microMaskFromLong_Long256 | 24271.19 | 36416.883 | 1.500416049 | 24621.327 | 36390.41 | 1.478003602 >> MaskFromLongBenchmark.microMaskFromLong_Long512 | 15289.749 | 13860.775 | 0.906540389 | 23003.816 | 36396.033 | 1.582173714 >> MaskFromLongBenchmark.microMaskFromLong_Long64 | 27086.471 | 20490.828 | 0.756496777 | 27177.133 | 20441.112 | 0.752143797 >> MaskFromLongBenchmark.microMaskFromLong_Short128 | 23504.216 | 36412.66 | 1.549196961 | 22823.401 | 36417.799 | 1.595634191 >> MaskFromLongBenchmark.microMaskFromLong_Short256 | 20056.61 | 36403.277 | 1.815026418 | 19699.502 | 36412.605 | 1.84840231 >> MaskFromLongBenchmark.microMaskFromLong_Short512 | 4775.721 | 6827.594 | 1.429646749 | 17209.782 | 36388.226 | 2.114392036 >> MaskFromLongBenchmark.microMaskFromLong_Short64 | 24759.049 | 36381.539 | 1.469423927 | 24506.013 | 36413.099 | 1.48588426 >> >> >> >> Kindly review and share feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8277997: Review comments resolved. Much better, more clarity i think. Needs HotSpot reviewer. src/hotspot/share/opto/vectorIntrinsics.cpp line 801: > 799: // Mode argument determines the mode of operation it can take following values:- > 800: // MODE_BROADCAST for vector Vector.boradcast operation. > 801: // MODE_BITS_COERCED_BROADCAST for VectorMask.maskAll operation. This line can now be removed and the the comment merged into the line above ------------- Marked as reviewed by psandoz (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6646 From dnsimon at openjdk.java.net Mon Dec 6 21:03:17 2021 From: dnsimon at openjdk.java.net (Doug Simon) Date: Mon, 6 Dec 2021 21:03:17 GMT Subject: RFR: JDK-8154011: Make TraceDeoptimization a diagnostic flag In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 16:14:50 GMT, Tobias Holenstein wrote: > Make TraceDeoptimization available in a product build. > > I checked that performance is not affected on Aurora. Sorry for only bringing it up now, but @tkrodriguez just observed that several of the places that test `TraceDeoptimization` aren't included in a `PRODUCT` build. In particular, [here](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/vframeArray.cpp#L443), [here](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/deoptimization.cpp#L265), and [here](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/deoptimization.cpp#L731). Without these bits of code included, the functionality of `TraceDeoptimzation` is somewhat compromised. ------------- PR: https://git.openjdk.java.net/jdk/pull/6703 From kvn at openjdk.java.net Mon Dec 6 21:12:20 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Dec 2021 21:12:20 GMT Subject: RFR: JDK-8154011: Make TraceDeoptimization a diagnostic flag In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 16:14:50 GMT, Tobias Holenstein wrote: > Make TraceDeoptimization available in a product build. > > I checked that performance is not affected on Aurora. @tobiasholenstein please file follow up bug and fix places pointed by Doug. Also make sure run testing to find and fix affected tests. ------------- PR: https://git.openjdk.java.net/jdk/pull/6703 From never at openjdk.java.net Mon Dec 6 21:12:20 2021 From: never at openjdk.java.net (Tom Rodriguez) Date: Mon, 6 Dec 2021 21:12:20 GMT Subject: RFR: JDK-8154011: Make TraceDeoptimization a diagnostic flag In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 16:14:50 GMT, Tobias Holenstein wrote: > Make TraceDeoptimization available in a product build. > > I checked that performance is not affected on Aurora. Some of those might be benign, thought they are inconsistent. The DEOPT PACKING messages are controlled by PrintDeoptimizationDetails, but DEOPT UNPACKING is controlled by TraceDeoptimization. The REALLOC messages could likely be under PrintDeoptimizationDetails. Maybe it would make sense to rationalize them a bit and make PrintDeoptimizationDetails diagnostic as well. ------------- PR: https://git.openjdk.java.net/jdk/pull/6703 From kvn at openjdk.java.net Mon Dec 6 21:25:13 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Dec 2021 21:25:13 GMT Subject: RFR: 8276116: C2: optimize long range checks in int counted loops [v3] In-Reply-To: References: <8bvd-Dtu9tKQVPWEV5lo0Xa7H2X76uVsgf1l6vKm7CM=.836388f2-a117-4f5d-9385-1690c7d0fd74@github.com> Message-ID: On Mon, 29 Nov 2021 10:26:40 GMT, Roland Westrelin wrote: >> Maurizio noticed that some of his panama micro benchmarks don't >> perform better avec 8259609 (C2: optimize long range checks in long >> counted loops). The reason is that 8259609 optimizes long range checks >> in long counted loops but some of his benchmarks include long range >> checks in int counted loops: >> >> for (int i = start; i < stop; i += inc) { >> Objects.checkIndex(scale * ((long)i) + offset, length); >> } >> >> This change applies the transformation from 8259609 for long counted >> loop/long range checks to int counted loop/long range checks. That >> includes creating a loop nest and transforming the long range check to >> an int range check that's subject to range elimination in the inner >> loop. >> >> The reason it's required to create a loop nest is that the long range >> check transformation logic depends on no overflow of scale * i for the >> range of values that the transformed range check is applied to. >> >> As a consequence, this change is mostly refactoring to make the loop >> nest creation and range check transformation parameterized by the type >> of the transformed loop. >> >> I think this transformation needs to be applied as late as possible >> but, in the case of an int counted loop, before pre/main/post loops >> are created. I had to move it to IdealLoopTree::iteration_split_impl() >> because of that. >> >> There's an alternate shape for a long range check in an int counted >> loop that Maurizio insisted needs to be supported: >> >> for (int i = start; i < stop; i += inc) { >> Objects.checkIndex(((long)(scale * i)) + offset, length); >> } >> >> scale * i can overflow in that case. This is also supported but as a >> corner case of the previous one. The code in >> PhaseIdealLoop::transform_long_range_checks() has a comment about >> that. >> >> Note also that this transformation works best if loop strip mining is >> enabled (that is for G1, ZGC, Shenandoah by default). The reason is >> that it needs a safepoint and when loop strip mining is enabled, the >> outer loop contains one that's always available. A way to have this >> work as well for all GCs would be to always construct the loop strip >> mining loop nest (whether loop strip mining is enabled or not) and >> then only once loop opts are over remove the outer loop when loop >> strip mining is disabled. I'm looking for feedback on this. >> >> BTW, something doesn't seem right in IdealLoopTree::iteration_split_impl(): >> >> https://github.com/rwestrel/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L3475 >> >> should_peel causes transformations to be skipped but peeling is never >> applied AFAICT. Does it make sense to anyone? > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > test fix In general looks good to me. src/hotspot/cpu/x86/x86_32.ad line 13132: > 13130: %} > 13131: > 13132: instruct cmovLL_reg_LTGE_U(cmpOpU cmp, flagsReg_ulong_LTGE flags, eRegL dst, eRegL src) %{ How it is related to these changes? Seems like addition to [8277324](https://github.com/openjdk/jdk/pull/6427) changes. Could be pushed separately. ------------- PR: https://git.openjdk.java.net/jdk/pull/6576 From kvn at openjdk.java.net Mon Dec 6 22:01:15 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 6 Dec 2021 22:01:15 GMT Subject: RFR: 8277893: Arraycopy stress tests [v3] In-Reply-To: References: Message-ID: <8gzkgbvL-ist_ZhNekJ5V7MX5hqjrjtfgKdF84LZT5E=.9f743cfa-8f2b-4497-9f58-c9cf5360b1bc@github.com> On Mon, 6 Dec 2021 10:17:52 GMT, Aleksey Shipilev wrote: >> I would like to fork the new tests off the JDK-8150730. These tests were instrumental in capturing many bugs in my arraycopy work, and I think they are good on their own merit, because they provide a test for the current baseline and on-going minor improvements in arraycopy on all platforms, not only x86_64, and they might be cleanly backportable. >> >> A brief tour of these tests: >> >> - Tests all data types; >> - Tests small arrays exhaustively, which captures conjoint/disjoint cases, errors near the edges, etc; >> - Tests large arrays with fuzzing around powers of two and powers of ten, both conjoint and disjoint cases; >> - Tests all available compilation modes for arraycopy stubs; for example, running on AVX-512 enabled machine runs all versions down to `-XX:UseAVX=0 -XX:UseSSE=0` cases; >> - Tests with/without compressed oops mode -- theoretically only needed for `Object` copies, but Hotspot cobbles together int+coops and long+no-coops loops, so I decided to alternate coops mode for all data types; >> >> My previous version used individual `@run` clauses for all configurations, but I think the Java driver is cleaner and easier to maintain. >> >> Test times: >> >> >> # x86_64 (TR 3970X) >> real 4m6.192s >> user 52m50.523s >> sys 0m13.755s >> >> # x86_64 (TR 3970X) -XX:+UseZGC >> real 6m2.573s >> user 72m43.541s >> sys 0m25.697s >> >> # x86_32 (TR 3970X) >> real 6m56.405s >> user 92m56.377s >> sys 0m6.677s >> >> # x86_64 (i5-11500) >> real 29m19.024s >> user 103m52.925s >> sys 1m7.175s >> >> # AArch64 (ThunderX2) >> real 2m59.623s >> user 26m14.624s >> sys 0m9.771s >> >> >> Since these tests are quite long, especially on small machines, I hooked them up to `hotspot:tier3`. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `compiler/stress/arraycopy` >> - [x] Linux x86_32 fastdebug `compiler/stress/arraycopy` >> - [x] Linux AArch64 fastdebug `compiler/stress/arraycopy` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision: > > - Package declarations > - Add safety check for small systems > - Renames > - Single driver for all the tests > - Safer timeout settings > - Post-merge TEST.groups cleanup > - Merge branch 'master' into JDK-8277893-arraycopy-tests > - Merge branch 'master' into JDK-8277893-arraycopy-tests > - Separate test group and hooks into hotspot_slow_compiler > - Trim down MAX_SIZE and explain the choice > - ... and 1 more: https://git.openjdk.java.net/jdk/compare/b83c9348...118a3eb2 I started testing. ------------- PR: https://git.openjdk.java.net/jdk/pull/6594 From duke at openjdk.java.net Tue Dec 7 00:29:13 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Tue, 7 Dec 2021 00:29:13 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 14:21:24 GMT, Mai ??ng Qu?n Anh wrote: > Hi, > > This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. > > While working on this patch, I spot some regressions regarding compilation on AVX1. > > Thank you very much. Hi, I'd love to. What do I need to do now? Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From eliu at openjdk.java.net Tue Dec 7 01:18:15 2021 From: eliu at openjdk.java.net (Eric Liu) Date: Tue, 7 Dec 2021 01:18:15 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 14:21:24 GMT, Mai ??ng Qu?n Anh wrote: > Hi, > > This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. > > While working on this patch, I spot some regressions regarding compilation on AVX1. > > Thank you very much. Please refer to https://openjdk.java.net/projects/#project-author ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From eliu at openjdk.java.net Tue Dec 7 01:26:10 2021 From: eliu at openjdk.java.net (Eric Liu) Date: Tue, 7 Dec 2021 01:26:10 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 14:21:24 GMT, Mai ??ng Qu?n Anh wrote: > Hi, > > This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. > > While working on this patch, I spot some regressions regarding compilation on AVX1. > > Thank you very much. test/hotspot/jtreg/compiler/vectorapi/reshape/TestVectorCastSVE.java line 39: > 37: * @modules java.base/jdk.internal.misc > 38: * @summary Test that vector cast intrinsics work as intended on sve. > 39: * @requires vm.cpu.features ~= ".*sve.*" May I ask do you have physical machine to verify SVE? ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From duke at openjdk.java.net Tue Dec 7 01:48:24 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Tue, 7 Dec 2021 01:48:24 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" In-Reply-To: References: Message-ID: <0VdBPKXwHVof5uv2FQ7nMFItN4Do9cg0_2nRCTC5hPk=.8e98fd30-d561-4577-aed4-8c4c72549730@github.com> On Tue, 7 Dec 2021 01:23:21 GMT, Eric Liu wrote: >> Hi, >> >> This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. >> >> While working on this patch, I spot some regressions regarding compilation on AVX1. >> >> Thank you very much. > > test/hotspot/jtreg/compiler/vectorapi/reshape/TestVectorCastSVE.java line 39: > >> 37: * @modules java.base/jdk.internal.misc >> 38: * @summary Test that vector cast intrinsics work as intended on sve. >> 39: * @requires vm.cpu.features ~= ".*sve.*" > > May I ask do you have physical machine to verify SVE? I only have physical machine to verify AVX1 and AVX2. So it would be necessary to have verification from other machines, too. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From fgao at openjdk.java.net Tue Dec 7 01:50:21 2021 From: fgao at openjdk.java.net (Fei Gao) Date: Tue, 7 Dec 2021 01:50:21 GMT Subject: RFR: 8262341: Refine identical code in AddI/LNode. [v2] In-Reply-To: <6UOcHiZGDoyGVimh7ooCJxL82KZK1yaOvPwNbFs7Y3g=.a6da7fe3-625b-4fb9-af19-2aec9b536a14@github.com> References: <5q7zq7qSEn7iTQdlYBh3d92Qeg7oxO_b5T36iUaFSGs=.326c2ded-2a41-4589-8951-6a31a6d44257@github.com> <6UOcHiZGDoyGVimh7ooCJxL82KZK1yaOvPwNbFs7Y3g=.a6da7fe3-625b-4fb9-af19-2aec9b536a14@github.com> Message-ID: On Mon, 6 Dec 2021 13:01:46 GMT, Roland Westrelin wrote: >> AddINode::Ideal() and AddlNode::Ideal() are almost identical but the >> same logic had to be duplicated because AddINode::Ideal() tests its >> inputs for Op_AddI, Op_SubI etc. while AddLNode::Ideal() tests for >> Op_AddL, Op_SubL etc. This patch refactors the code so the common >> logic is in a single method parameterized by a BasicType argument. >> >> The way I've done this before in the context of int/long counted loops >> was to use and extra virtual method operates_on(). So: >> >> n->Opcode() == Op_AddI becomes n->is_Add() && n->operates_on(T_INT) >> >> Working on this change made me realize that pattern doesn't work that well: >> >> - it's quite a bit more verbose and converting existing code is not as >> mechanical as we would like to avoid conversion errors. >> >> - it breaks when a class has a subclass. For instance AddNode has >> OrINode and OrLNode as subclasses so testing for n->is_Add() returns >> true with an OrI node. >> >> Instead, this change introduces new functions. For instance of >> AddI/AddL: >> >> int Op_Add(BasicType bt) >> >> that returns either Op_AddI or Op_AddL depending on bt. This made >> refactoring the AddINode::Ideal() logic straightforward. I removed all >> use of operates_on() as well and converted existing code to the new >> Op_XXX() functions. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - T *p to -> T* p > - Merge branch 'master' into JDK-8262341 > - fix src/hotspot/share/opto/addnode.hpp line 54: > 52: // We also canonicalize the Node, moving constants to the right input, > 53: // and flatten expressions (so that 1+x+2 becomes x+3). > 54: virtual Node* Ideal(PhaseGVN *phase, bool can_reshape); Should be `PhaseGVN* phase` here? ------------- PR: https://git.openjdk.java.net/jdk/pull/6607 From eliu at openjdk.java.net Tue Dec 7 02:14:09 2021 From: eliu at openjdk.java.net (Eric Liu) Date: Tue, 7 Dec 2021 02:14:09 GMT Subject: RFR: 8276985: AArch64: [vectorapi] Backend support of VectorMaskToLongNode In-Reply-To: References: Message-ID: On Mon, 29 Nov 2021 09:40:23 GMT, Eric Liu wrote: > The lack of codegen for VectorMaskToLong results in a regression on > AArch64 for VectorMask.laneIsSet, which relies on the intrinsification > of VectorMask.toLong after JDK-8273949. > > This patch implements bitmask extraction on AArch64 for NEON and SVE by > using scalar instructions, which is equivalent to the PMOVMSK > instructions on X86. The performance of VectorMask.laneIsSet improves > about 10x for NEON and 2x~4x for SVE on my test machines. Could anyone help to review this patch? ------------- PR: https://git.openjdk.java.net/jdk/pull/6585 From sviswanathan at openjdk.java.net Tue Dec 7 02:22:14 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 7 Dec 2021 02:22:14 GMT Subject: RFR: 8277997: Intrinsic creation for VectorMask.fromLong API [v3] In-Reply-To: <4eJ7dJ21N_lmAxD-2q8yEl4S-aeKhZ3Z1Dx7huOihxg=.655f9b0d-b681-49b9-9d7c-5399ecbf7e09@github.com> References: <4eJ7dJ21N_lmAxD-2q8yEl4S-aeKhZ3Z1Dx7huOihxg=.655f9b0d-b681-49b9-9d7c-5399ecbf7e09@github.com> Message-ID: <-0DjYlkLVjPjBEGF7wzC654OvIe19dLuZNANYd4G3QU=.7393c4f1-6e43-4757-946e-7c56d56700a1@github.com> On Mon, 6 Dec 2021 17:44:01 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> 1) Inline expansion of VectorMask.fromLong API, this includes Java API implementation and C2 IR changes. >> 2) X86 backend support for AVX512 and AVX2 targets. >> 3) New IR transformation to handle following patterns:- >> a) Mask2Long + Long2Mask -> MaskCast (when source and destination mask lengths are equal) >> b) Long2Mask + Mask2Long -> Long >> 4) Following performance data is collected for new JMH micro included with the patch:- >> >> System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> Benchmark | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain factor | Baseline AVX3 (ops/ms) | Withopt AVX3(ops/ms) | Gain factor >> -- | -- | -- | -- | -- | -- | -- >> MaskFromLongBenchmark.microMaskFromLong_Byte128 | 20050.884 | 36414.349 | 1.816096936 | 19699.631 | 36412.252 | 1.848372287 >> MaskFromLongBenchmark.microMaskFromLong_Byte256 | 17589.496 | 36418.368 | 2.070461143 | 17211.451 | 36407.44 | 2.115303352 >> MaskFromLongBenchmark.microMaskFromLong_Byte512 | 2824.411 | 2492.795 | 0.882589326 | 6359.071 | 36405.344 | 5.72494693 >> MaskFromLongBenchmark.microMaskFromLong_Byte64 | 23507.28 | 36424.668 | 1.549505855 | 22659.666 | 36420.345 | 1.607276338 >> MaskFromLongBenchmark.microMaskFromLong_Integer128 | 24567.895 | 36411.602 | 1.482080659 | 24620.619 | 36397.005 | 1.478313969 >> MaskFromLongBenchmark.microMaskFromLong_Integer256 | 23495.078 | 36411.981 | 1.549770595 | 22823.846 | 36395.703 | 1.594634971 >> MaskFromLongBenchmark.microMaskFromLong_Integer512 | 12377.022 | 11478.101 | 0.927371786 | 19701.118 | 36394.878 | 1.847350897 >> MaskFromLongBenchmark.microMaskFromLong_Integer64 | 22169.231 | 17791.849 | 0.802546962 | 23603.169 | 18055.166 | 0.76494669 >> MaskFromLongBenchmark.microMaskFromLong_Long128 | 22312.568 | 17859.474 | 0.800422166 | 22171.303 | 18106.295 | 0.816654529 >> MaskFromLongBenchmark.microMaskFromLong_Long256 | 24271.19 | 36416.883 | 1.500416049 | 24621.327 | 36390.41 | 1.478003602 >> MaskFromLongBenchmark.microMaskFromLong_Long512 | 15289.749 | 13860.775 | 0.906540389 | 23003.816 | 36396.033 | 1.582173714 >> MaskFromLongBenchmark.microMaskFromLong_Long64 | 27086.471 | 20490.828 | 0.756496777 | 27177.133 | 20441.112 | 0.752143797 >> MaskFromLongBenchmark.microMaskFromLong_Short128 | 23504.216 | 36412.66 | 1.549196961 | 22823.401 | 36417.799 | 1.595634191 >> MaskFromLongBenchmark.microMaskFromLong_Short256 | 20056.61 | 36403.277 | 1.815026418 | 19699.502 | 36412.605 | 1.84840231 >> MaskFromLongBenchmark.microMaskFromLong_Short512 | 4775.721 | 6827.594 | 1.429646749 | 17209.782 | 36388.226 | 2.114392036 >> MaskFromLongBenchmark.microMaskFromLong_Short64 | 24759.049 | 36381.539 | 1.469423927 | 24506.013 | 36413.099 | 1.48588426 >> >> >> >> Kindly review and share feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8277997: Review comments resolved. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4070: > 4068: movq(rtmp2, src); > 4069: mov64(rtmp1, 0x0101010101010101L); > 4070: pdep(rtmp1, rtmp2, rtmp1); For masklen < 8, we could directly generate pdep(rtmp1, src, rtmp1); rtmp2 is not required in that case. src/hotspot/cpu/x86/x86.ad line 9516: > 9514: int vec_enc = vector_length_encoding(mask_len*8); > 9515: __ vector_long_to_maskvec($dst$$XMMRegister, $src$$Register, $rtmp1$$Register, > 9516: $rtmp2$$Register, $xtmp1$$XMMRegister, mask_len, vec_enc); xtmp2 is not being used here? src/hotspot/cpu/x86/x86.ad line 9529: > 9527: int mask_len = Matcher::vector_length(this); > 9528: __ movq($rtmp$$Register, $src$$Register); > 9529: __ kmov($dst$$KRegister, $rtmp$$Register); why extra move to rtmp here? Cannot we generate directly kmov(dst, src)? src/hotspot/share/opto/vectorIntrinsics.cpp line 803: > 801: // MODE_BITS_COERCED_BROADCAST for VectorMask.maskAll operation. > 802: // MODE_BITS_COERCED_LONG_TO_MASK for VectorMask.fromLong operation. > 803: const TypeInt* mode = gvn().type(argument(5))->isa_int(); Isn't mode argument(4)? src/hotspot/share/opto/vectornode.cpp line 1506: > 1504: if (src->Opcode() == Op_VectorStoreMask) { > 1505: src = src->in(1); > 1506: } What if src happened to be a phi node here? ------------- PR: https://git.openjdk.java.net/jdk/pull/6646 From njian at openjdk.java.net Tue Dec 7 02:36:11 2021 From: njian at openjdk.java.net (Ningsheng Jian) Date: Tue, 7 Dec 2021 02:36:11 GMT Subject: RFR: 8276985: AArch64: [vectorapi] Backend support of VectorMaskToLongNode In-Reply-To: References: Message-ID: On Mon, 29 Nov 2021 09:40:23 GMT, Eric Liu wrote: > The lack of codegen for VectorMaskToLong results in a regression on > AArch64 for VectorMask.laneIsSet, which relies on the intrinsification > of VectorMask.toLong after JDK-8273949. > > This patch implements bitmask extraction on AArch64 for NEON and SVE by > using scalar instructions, which is equivalent to the PMOVMSK > instructions on X86. The performance of VectorMask.laneIsSet improves > about 10x for NEON and 2x~4x for SVE on my test machines. src/hotspot/cpu/aarch64/aarch64_neon_ad.m4 line 2499: > 2497: %} > 2498: > 2499: instruct vmask_tolong16B(iRegLNoSp dst, vecX src, iRegL tmp) %{ "iRegL tmp" should be "iRegLNoSp tmp" or use rscratch directly? src/hotspot/cpu/aarch64/aarch64_sve_ad.m4 line 3179: > 3177: %} > 3178: > 3179: instruct vmask_tolong(iRegLNoSp dst, pReg src, vReg vtmp1, vReg vtmp2, pRegGov pgtmp, iRegL tmp, rFlagsReg cr) %{ And this iRegL ------------- PR: https://git.openjdk.java.net/jdk/pull/6585 From roland at openjdk.java.net Tue Dec 7 08:25:50 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 7 Dec 2021 08:25:50 GMT Subject: RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert [v3] In-Reply-To: References: Message-ID: > Root cause is identical to 8273165 AFIU: late inline of a virtual call > can throw from 2 different paths (null check and the call > itself). That breaks because the logic for exceptions expects the > stack for all paths that throw exceptions to have the same stack size. > > AFAIU, the stack doesn't matter exception handling: either the > exception is caught by a exception handler and then the stack is > popped and the exception is pushed or, the exception is rethrown to > the caller in which case the current stack is also popped (that is the > jvm state for the current method). As a consequence the fix I propose > is to ignore the stack in GraphKit::combine_exception_states(). > > AFAIU, the same fix would work for 8273165 but I left the current work > around as is: not sure if we want to be conservative for now or not Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - comment - Merge branch 'master' into JDK-8275638 - alternate fix - make test runnable with release build - more - fix ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6572/files - new: https://git.openjdk.java.net/jdk/pull/6572/files/e3a04acd..5469d079 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6572&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6572&range=01-02 Stats: 11463 lines in 534 files changed: 7414 ins; 1974 del; 2075 mod Patch: https://git.openjdk.java.net/jdk/pull/6572.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6572/head:pull/6572 PR: https://git.openjdk.java.net/jdk/pull/6572 From thartmann at openjdk.java.net Tue Dec 7 08:45:18 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 7 Dec 2021 08:45:18 GMT Subject: RFR: 8277850: C2: optimize mask checks in counted loops In-Reply-To: <_gfsjS2vHwM9TTZtfGNeBOBu9ISOFAxfSdGl391M4MM=.4b05410a-5583-4bf2-99af-eab56a359e3c@github.com> References: <_gfsjS2vHwM9TTZtfGNeBOBu9ISOFAxfSdGl391M4MM=.4b05410a-5583-4bf2-99af-eab56a359e3c@github.com> Message-ID: On Fri, 3 Dec 2021 10:04:58 GMT, Roland Westrelin wrote: > This is another fix that addresses a performance issue with panama and that was brought up by Maurizio. The pattern to optimize is: > if ((base + (offset << 2)) & 3) != 0) { > } > > where base is loop independent but offset depends on a loop variable. This can be transformed to: > > if ((base & 3) != 0) { > > That check becomes loop independent and be optimized by loop predication (or I suppose loop unswitching but that wasn't the case of the micro benchmark I worked on). > > This change also optimizes the pattern: > > (offset << 2) & 3 > > to return 0. Looks good to me, I submitted testing. src/hotspot/share/opto/mulnode.cpp line 702: > 700: > 701: Node* in1 = in(1); > 702: int op = in1->Opcode(); I got curious and checked the code, we have similar patterns everywhere. Filed [JDK-8278328](https://bugs.openjdk.java.net/browse/JDK-8278328) to clean up this mess. src/hotspot/share/opto/mulnode.cpp line 1753: > 1751: } > 1752: > 1753: Node* MulNode::AndIL_add_shift_and_mask(PhaseGVN* phase, BasicType bt) { Please add a comment describing the pattern. test/hotspot/jtreg/compiler/c2/irTests/TestShiftAndMask.java line 30: > 28: /* > 29: * @test > 30: * @bug JDK-8277850 Should be `@bug 8277850` (not sure if that is a requirement but all other tests use that format). test/hotspot/jtreg/compiler/c2/irTests/TestShiftAndMask.java line 36: > 34: */ > 35: > 36: public class TestShiftAndMask { Nice tests! ------------- PR: https://git.openjdk.java.net/jdk/pull/6697 From thartmann at openjdk.java.net Tue Dec 7 09:22:09 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 7 Dec 2021 09:22:09 GMT Subject: RFR: 8277850: C2: optimize mask checks in counted loops In-Reply-To: References: <_gfsjS2vHwM9TTZtfGNeBOBu9ISOFAxfSdGl391M4MM=.4b05410a-5583-4bf2-99af-eab56a359e3c@github.com> Message-ID: <5PHaNETbSKHOBQBnMUV_BpGv53iMFP8l4QnPvT809oQ=.28e5aa27-a1f9-4c49-ad9f-200fe03acc38@github.com> On Tue, 7 Dec 2021 08:42:22 GMT, Tobias Hartmann wrote: > I submitted testing. Build on Windows failed: [2021-12-07T09:08:24,657Z] t:\workspace\open\src\hotspot\share\opto\mulnode.cpp(1746): error C2220: the following warning is treated as an error [2021-12-07T09:08:24,657Z] t:\workspace\open\src\hotspot\share\opto\mulnode.cpp(1746): warning C4334: '<<': result of 32-bit shift implicitly converted to 64 bits (was 64-bit shift intended?) [2021-12-07T09:08:24,703Z] lib/CompileJvm.gmk:141: recipe for target '/cygdrive/t/workspace/build/windows-x64/hotspot/variant-server/libjvm/objs/mulnode.obj' failed ------------- PR: https://git.openjdk.java.net/jdk/pull/6697 From duke at openjdk.java.net Tue Dec 7 09:45:17 2021 From: duke at openjdk.java.net (Tobias Holenstein) Date: Tue, 7 Dec 2021 09:45:17 GMT Subject: RFR: JDK-8154011: Make TraceDeoptimization a diagnostic flag In-Reply-To: References: Message-ID: On Fri, 3 Dec 2021 16:14:50 GMT, Tobias Holenstein wrote: > Make TraceDeoptimization available in a product build. > > I checked that performance is not affected on Aurora. Thanks for catching that! I filed a Bug on JIRA (JDK-8278329) ------------- PR: https://git.openjdk.java.net/jdk/pull/6703 From roland at openjdk.java.net Tue Dec 7 10:13:18 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 7 Dec 2021 10:13:18 GMT Subject: RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert [v3] In-Reply-To: References: Message-ID: On Tue, 7 Dec 2021 08:25:50 GMT, Roland Westrelin wrote: >> Root cause is identical to 8273165 AFIU: late inline of a virtual call >> can throw from 2 different paths (null check and the call >> itself). That breaks because the logic for exceptions expects the >> stack for all paths that throw exceptions to have the same stack size. >> >> AFAIU, the stack doesn't matter exception handling: either the >> exception is caught by a exception handler and then the stack is >> popped and the exception is pushed or, the exception is rethrown to >> the caller in which case the current stack is also popped (that is the >> jvm state for the current method). As a consequence the fix I propose >> is to ignore the stack in GraphKit::combine_exception_states(). >> >> AFAIU, the same fix would work for 8273165 but I left the current work >> around as is: not sure if we want to be conservative for now or not > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - comment > - Merge branch 'master' into JDK-8275638 > - alternate fix > - make test runnable with release build > - more > - fix I pushed a new change that pops the stack when there's no handler in the method. ------------- PR: https://git.openjdk.java.net/jdk/pull/6572 From aph at openjdk.java.net Tue Dec 7 10:15:17 2021 From: aph at openjdk.java.net (Andrew Haley) Date: Tue, 7 Dec 2021 10:15:17 GMT Subject: RFR: 8276985: AArch64: [vectorapi] Backend support of VectorMaskToLongNode In-Reply-To: References: Message-ID: <5DixrjuVdNkZOemuH5B72x1PTSeoqMEVh3Bm_j5qRiY=.7874fa97-2669-4f18-84ef-cbe143e4807c@github.com> On Tue, 7 Dec 2021 02:30:32 GMT, Ningsheng Jian wrote: >> The lack of codegen for VectorMaskToLong results in a regression on >> AArch64 for VectorMask.laneIsSet, which relies on the intrinsification >> of VectorMask.toLong after JDK-8273949. >> >> This patch implements bitmask extraction on AArch64 for NEON and SVE by >> using scalar instructions, which is equivalent to the PMOVMSK >> instructions on X86. The performance of VectorMask.laneIsSet improves >> about 10x for NEON and 2x~4x for SVE on my test machines. > > src/hotspot/cpu/aarch64/aarch64_neon_ad.m4 line 2499: > >> 2497: %} >> 2498: >> 2499: instruct vmask_tolong16B(iRegLNoSp dst, vecX src, iRegL tmp) %{ > > "iRegL tmp" should be "iRegLNoSp tmp" or use rscratch directly? Yes, good catch. This kind of bug is horrible to detect. ------------- PR: https://git.openjdk.java.net/jdk/pull/6585 From roland at openjdk.java.net Tue Dec 7 10:30:14 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 7 Dec 2021 10:30:14 GMT Subject: RFR: 8277168: AArch64: Enable arraycopy partial inlining with SVE In-Reply-To: References: Message-ID: On Thu, 18 Nov 2021 03:50:45 GMT, Pengfei Li wrote: > Arraycopy partial inlining is a C2 compiler technique that avoids stub > call overhead in small-sized arraycopy operations by generating masked > vector instructions. So far it works on x86 AVX512 only and this patch > enables it on AArch64 with SVE. > > We add AArch64 matching rule for VectorMaskGenNode and refactor that > node a little bit. The major change is moving the element type field > into its TypeVectMask bottom type. The reason is that AArch64 vector > masks are different for different vector element types. > > E.g., an x86 AVX512 vector mask value masking 3 least significant vector > lanes (of any type) is like > > `0000 0000 ... 0000 0000 0000 0000 0111` > > On AArch64 SVE, this mask value can only be used for masking the 3 least > significant lanes of bytes. But for 3 lanes of ints, the value should be > > `0000 0000 ... 0000 0000 0001 0001 0001` > > where the least significant bit of each lane matters. So AArch64 matcher > needs to know the vector element type to generate right masks. > > After this patch, the C2 generated code for copying a 50-byte array on > AArch64 SVE looks like > > mov x12, #0x32 > whilelo p0.b, xzr, x12 > add x11, x11, #0x10 > ld1b {z16.b}, p0/z, [x11] > add x10, x10, #0x10 > st1b {z16.b}, p0, [x10] > > We ran jtreg hotspot::hotspot_all, jdk::tier1~3 and langtools::tier1 on > both x86 AVX512 and AArch64 SVE machines, no issue is found. We tested > JMH org/openjdk/bench/java/lang/ArrayCopyAligned.java with small array > size arguments on a 512-bit SVE-featured CPU. We got below performance > data changes. > > Benchmark (length) (Performance) > ArrayCopyAligned.testByte 10 -2.6% > ArrayCopyAligned.testByte 20 +4.7% > ArrayCopyAligned.testByte 30 +4.8% > ArrayCopyAligned.testByte 40 +21.7% > ArrayCopyAligned.testByte 50 +22.5% > ArrayCopyAligned.testByte 60 +28.4% > > The test machine has SVE vector size of 512 bits, so we see performance > gain for most array sizes less than 64 bytes. For very small arrays we > see a bit regression because a vector load/store may be a bit slower > than 1 or 2 scalar loads/stores. C2 platform independent code looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6444 From roland at openjdk.java.net Tue Dec 7 10:36:59 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 7 Dec 2021 10:36:59 GMT Subject: RFR: 8262341: Refine identical code in AddI/LNode. [v3] In-Reply-To: <5q7zq7qSEn7iTQdlYBh3d92Qeg7oxO_b5T36iUaFSGs=.326c2ded-2a41-4589-8951-6a31a6d44257@github.com> References: <5q7zq7qSEn7iTQdlYBh3d92Qeg7oxO_b5T36iUaFSGs=.326c2ded-2a41-4589-8951-6a31a6d44257@github.com> Message-ID: > AddINode::Ideal() and AddlNode::Ideal() are almost identical but the > same logic had to be duplicated because AddINode::Ideal() tests its > inputs for Op_AddI, Op_SubI etc. while AddLNode::Ideal() tests for > Op_AddL, Op_SubL etc. This patch refactors the code so the common > logic is in a single method parameterized by a BasicType argument. > > The way I've done this before in the context of int/long counted loops > was to use and extra virtual method operates_on(). So: > > n->Opcode() == Op_AddI becomes n->is_Add() && n->operates_on(T_INT) > > Working on this change made me realize that pattern doesn't work that well: > > - it's quite a bit more verbose and converting existing code is not as > mechanical as we would like to avoid conversion errors. > > - it breaks when a class has a subclass. For instance AddNode has > OrINode and OrLNode as subclasses so testing for n->is_Add() returns > true with an OrI node. > > Instead, this change introduces new functions. For instance of > AddI/AddL: > > int Op_Add(BasicType bt) > > that returns either Op_AddI or Op_AddL depending on bt. This made > refactoring the AddINode::Ideal() logic straightforward. I removed all > use of operates_on() as well and converted existing code to the new > Op_XXX() functions. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: more T *p to -> T* p ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6607/files - new: https://git.openjdk.java.net/jdk/pull/6607/files/22292277..12d738ca Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6607&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6607&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/6607.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6607/head:pull/6607 PR: https://git.openjdk.java.net/jdk/pull/6607 From roland at openjdk.java.net Tue Dec 7 10:37:04 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 7 Dec 2021 10:37:04 GMT Subject: RFR: 8262341: Refine identical code in AddI/LNode. [v2] In-Reply-To: References: <5q7zq7qSEn7iTQdlYBh3d92Qeg7oxO_b5T36iUaFSGs=.326c2ded-2a41-4589-8951-6a31a6d44257@github.com> <6UOcHiZGDoyGVimh7ooCJxL82KZK1yaOvPwNbFs7Y3g=.a6da7fe3-625b-4fb9-af19-2aec9b536a14@github.com> Message-ID: On Tue, 7 Dec 2021 01:47:23 GMT, Fei Gao wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - T *p to -> T* p >> - Merge branch 'master' into JDK-8262341 >> - fix > > src/hotspot/share/opto/addnode.hpp line 54: > >> 52: // We also canonicalize the Node, moving constants to the right input, >> 53: // and flatten expressions (so that 1+x+2 becomes x+3). >> 54: virtual Node* Ideal(PhaseGVN *phase, bool can_reshape); > > Should be `PhaseGVN* phase` here? Thanks for looking at this. You're right. Fixed. ------------- PR: https://git.openjdk.java.net/jdk/pull/6607 From roland at openjdk.java.net Tue Dec 7 11:06:53 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 7 Dec 2021 11:06:53 GMT Subject: RFR: 8277850: C2: optimize mask checks in counted loops [v2] In-Reply-To: <_gfsjS2vHwM9TTZtfGNeBOBu9ISOFAxfSdGl391M4MM=.4b05410a-5583-4bf2-99af-eab56a359e3c@github.com> References: <_gfsjS2vHwM9TTZtfGNeBOBu9ISOFAxfSdGl391M4MM=.4b05410a-5583-4bf2-99af-eab56a359e3c@github.com> Message-ID: <6lfZqBTpkTDRif35AoaWvEGwUQY8HXH697SMVzr4_Fk=.3fb1e570-345e-4a56-be7a-5006ab1fe739@github.com> > This is another fix that addresses a performance issue with panama and that was brought up by Maurizio. The pattern to optimize is: > if ((base + (offset << 2)) & 3) != 0) { > } > > where base is loop independent but offset depends on a loop variable. This can be transformed to: > > if ((base & 3) != 0) { > > That check becomes loop independent and be optimized by loop predication (or I suppose loop unswitching but that wasn't the case of the micro benchmark I worked on). > > This change also optimizes the pattern: > > (offset << 2) & 3 > > to return 0. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - reviews - Merge branch 'master' into JDK-8277850 - whitespace - fix - fix ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6697/files - new: https://git.openjdk.java.net/jdk/pull/6697/files/0d0a7c9c..f7165648 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6697&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6697&range=00-01 Stats: 11458 lines in 536 files changed: 7406 ins; 1974 del; 2078 mod Patch: https://git.openjdk.java.net/jdk/pull/6697.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6697/head:pull/6697 PR: https://git.openjdk.java.net/jdk/pull/6697 From roland at openjdk.java.net Tue Dec 7 11:10:20 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 7 Dec 2021 11:10:20 GMT Subject: RFR: 8276116: C2: optimize long range checks in int counted loops [v3] In-Reply-To: References: <8bvd-Dtu9tKQVPWEV5lo0Xa7H2X76uVsgf1l6vKm7CM=.836388f2-a117-4f5d-9385-1690c7d0fd74@github.com> Message-ID: On Mon, 6 Dec 2021 18:08:52 GMT, Vladimir Kozlov wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> test fix > > src/hotspot/cpu/x86/x86_32.ad line 13132: > >> 13130: %} >> 13131: >> 13132: instruct cmovLL_reg_LTGE_U(cmpOpU cmp, flagsReg_ulong_LTGE flags, eRegL dst, eRegL src) %{ > > How it is related to these changes? Seems like addition to [8277324](https://github.com/openjdk/jdk/pull/6427) changes. Could be pushed separately. > In general looks good to me. Thanks for reviewing this change. ------------- PR: https://git.openjdk.java.net/jdk/pull/6576 From roland at openjdk.java.net Tue Dec 7 11:13:12 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 7 Dec 2021 11:13:12 GMT Subject: RFR: 8276116: C2: optimize long range checks in int counted loops [v3] In-Reply-To: References: <8bvd-Dtu9tKQVPWEV5lo0Xa7H2X76uVsgf1l6vKm7CM=.836388f2-a117-4f5d-9385-1690c7d0fd74@github.com> Message-ID: <1F1oVVUoBk3KfUT_9e0pqabNYg8NeK11P_E89swLzqE=.2b28b9db-6e45-40b9-bc6c-0361a78d52c3@github.com> On Tue, 7 Dec 2021 11:07:13 GMT, Roland Westrelin wrote: >> src/hotspot/cpu/x86/x86_32.ad line 13132: >> >>> 13130: %} >>> 13131: >>> 13132: instruct cmovLL_reg_LTGE_U(cmpOpU cmp, flagsReg_ulong_LTGE flags, eRegL dst, eRegL src) %{ >> >> How it is related to these changes? Seems like addition to [8277324](https://github.com/openjdk/jdk/pull/6427) changes. Could be pushed separately. > >> In general looks good to me. > > Thanks for reviewing this change. > How it is related to these changes? Seems like addition to 8277324 changes. Could be pushed separately. That showed on github testing because of the new unsigned_min I think. So not including it would break x86_32. ------------- PR: https://git.openjdk.java.net/jdk/pull/6576 From aph at openjdk.java.net Tue Dec 7 11:17:11 2021 From: aph at openjdk.java.net (Andrew Haley) Date: Tue, 7 Dec 2021 11:17:11 GMT Subject: RFR: 8277168: AArch64: Enable arraycopy partial inlining with SVE In-Reply-To: References: Message-ID: On Thu, 18 Nov 2021 03:50:45 GMT, Pengfei Li wrote: > Arraycopy partial inlining is a C2 compiler technique that avoids stub > call overhead in small-sized arraycopy operations by generating masked > vector instructions. So far it works on x86 AVX512 only and this patch > enables it on AArch64 with SVE. > > We add AArch64 matching rule for VectorMaskGenNode and refactor that > node a little bit. The major change is moving the element type field > into its TypeVectMask bottom type. The reason is that AArch64 vector > masks are different for different vector element types. > > E.g., an x86 AVX512 vector mask value masking 3 least significant vector > lanes (of any type) is like > > `0000 0000 ... 0000 0000 0000 0000 0111` > > On AArch64 SVE, this mask value can only be used for masking the 3 least > significant lanes of bytes. But for 3 lanes of ints, the value should be > > `0000 0000 ... 0000 0000 0001 0001 0001` > > where the least significant bit of each lane matters. So AArch64 matcher > needs to know the vector element type to generate right masks. > > After this patch, the C2 generated code for copying a 50-byte array on > AArch64 SVE looks like > > mov x12, #0x32 > whilelo p0.b, xzr, x12 > add x11, x11, #0x10 > ld1b {z16.b}, p0/z, [x11] > add x10, x10, #0x10 > st1b {z16.b}, p0, [x10] > > We ran jtreg hotspot::hotspot_all, jdk::tier1~3 and langtools::tier1 on > both x86 AVX512 and AArch64 SVE machines, no issue is found. We tested > JMH org/openjdk/bench/java/lang/ArrayCopyAligned.java with small array > size arguments on a 512-bit SVE-featured CPU. We got below performance > data changes. > > Benchmark (length) (Performance) > ArrayCopyAligned.testByte 10 -2.6% > ArrayCopyAligned.testByte 20 +4.7% > ArrayCopyAligned.testByte 30 +4.8% > ArrayCopyAligned.testByte 40 +21.7% > ArrayCopyAligned.testByte 50 +22.5% > ArrayCopyAligned.testByte 60 +28.4% > > The test machine has SVE vector size of 512 bits, so we see performance > gain for most array sizes less than 64 bytes. For very small arrays we > see a bit regression because a vector load/store may be a bit slower > than 1 or 2 scalar loads/stores. Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6444 From lucy at openjdk.java.net Tue Dec 7 11:26:21 2021 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Tue, 7 Dec 2021 11:26:21 GMT Subject: RFR: 8278302: [s390] Implement fast-path for ASCII-compatible CharsetEncoders Message-ID: <1AFDoyJM3APH5zUG5ErwRIosgRDaWO919QOwlMTejTw=.dea4ed48-a0ed-442c-a7b1-cc62d16c3d93@github.com> This pull request contains the s390x variant of the fast-path for ASCII-compatible CharsetEncoders. The intrinsic implementation is pretty much similar to that of PPC64. Compared to the previously existing intrinsic implementation, changes to program logic are minimal. There is a new match rule for ascii strings and the bit mask used to detect invalid bits in the string characters is set depending on the target charset: 0xff00 for ISO, 0xff80 for ascii. **Request:** SAP does not maintain a full build and test infrastructure for s390x anymore. Could somebody please test this PR on s390x? At least build and tier1 tests, the more the better. Thank you! With my private testing (some microbenchmarks, some SPEC benchmarks) I could not find any issues. With the fast-path active I see a 2x, sometime 2.5x, improvement of CharsetEncodeDecode.encode. **Without the fast-path patch:** CharsetEncodeDecode.encode 16384 UTF-8 avgt 30 16.350 ? 0.067 us/op CharsetEncodeDecode.encode 16384 BIG5 avgt 30 15.972 ? 0.136 us/op CharsetEncodeDecode.encode 16384 ISO-8859-15 avgt 30 14.372 ? 0.110 us/op CharsetEncodeDecode.encode 16384 ASCII avgt 30 14.287 ? 0.051 us/op **With the fast-path patch:** CharsetEncodeDecode.encode 16384 UTF-8 avgt 30 6.833 ? 0.064 us/op CharsetEncodeDecode.encode 16384 BIG5 avgt 30 8.195 ? 0.059 us/op CharsetEncodeDecode.encode 16384 ISO-8859-15 avgt 30 6.387 ? 0.149 us/op CharsetEncodeDecode.encode 16384 ASCII avgt 30 6.321 ? 0.049 us/op ------------- Commit messages: - 8278302: [s390] Implement fast-path for ASCII-compatible CharsetEncoders Changes: https://git.openjdk.java.net/jdk/pull/6738/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6738&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278302 Stats: 45 lines in 4 files changed: 27 ins; 0 del; 18 mod Patch: https://git.openjdk.java.net/jdk/pull/6738.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6738/head:pull/6738 PR: https://git.openjdk.java.net/jdk/pull/6738 From simonis at openjdk.java.net Tue Dec 7 13:03:18 2021 From: simonis at openjdk.java.net (Volker Simonis) Date: Tue, 7 Dec 2021 13:03:18 GMT Subject: RFR: JDK-8278135: Remove un-necessary null-check for get-static in c2 [v2] In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 07:35:57 GMT, ?? wrote: >> When run the following test, lots of un-necessary null check deoptimization happen. >> >> Small Test: >> >> public class CodeDependenciesTest { >> private Object obj; >> private String[] strs; >> private Object[][] objs; >> private static Class clzOne; >> private static Class clzTwo; >> public static void main(String[] args) throws Exception { >> CodeDependenciesTest codeDependenciesTest = new CodeDependenciesTest(); >> codeDependenciesTest.obj = new String("1"); >> for (int i = 0; i < 300000; i++) { >> codeDependenciesTest.foo(); >> } >> } >> >> public void foo() { >> objs = new Object[10][10]; >> for (int i = 0; i < 10; i++) { >> for (int j = 0; j < 10; j++) { >> objs[i][j] = new Object(); >> } >> } >> clzOne = InvokeTest.class; >> clzTwo = clzOne; >> } >> >> static class InvokeTest { >> public void bar(String i) { >> try { >> Thread.sleep(Long.valueOf(i)); >> } catch (Exception e) { >> e.printStackTrace(); >> } >> } >> } >> } >> >> >> The deoptimization log generated by `-XX:+TraceDeoptimization` is: >> >> Uncommon trap bci=63 pc=0x00007f0eafbe1e38, relative_pc=0x00000000000005f8, method=CodeDependenciesTest.foo()V, debug_id=0 >> Uncommon trap occurred in CodeDependenciesTest::foo compiler=c2 compile_id=288 (@0x00007f0eafbe1e38) thread=20285 reason=null_assert_or_unreached0 action=make_not_entrant unloaded_class_index=-1 debug_id=0 >> DEOPT UNPACKING thread 0x00007f0ec0028190 vframeArray 0x00007f0ec02348a0 mode 2 >> {method} {0x00007f0e7c4004f0} 'foo' '()V' in 'CodeDependenciesTest' - putstatic @ bci 63 sp = 0x00007f0ec8a317c8 >> Uncommon trap bci=63 pc=0x00007f0eafbe0f34, relative_pc=0x0000000000000514, method=CodeDependenciesTest.foo()V, debug_id=0 >> Uncommon trap occurred in CodeDependenciesTest::foo compiler=c2 compile_id=287 (@0x00007f0eafbe0f34) thread=20285 reason=null_assert_or_unreached0 action=make_not_entrant unloaded_class_index=-1 debug_id=0 >> DEOPT UNPACKING thread 0x00007f0ec0028190 vframeArray 0x00007f0ec0235e40 mode 2 >> {method} {0x00007f0e7c4004f0} 'foo' '()V' in 'CodeDependenciesTest' - putstatic @ bci 63 sp = 0x00007f0ec8a317c8 >> >> >> The corresponding opto assembly is, >> >> 230 B20: # out( B56 B21 ) <- in( B58 B43 B41 B19 ) Freq: 0.999369 >> 230 movl [RBX + #112 (8-bit)], narrowoop: java/lang/Class:exact * # compressed ptr ! Field: CodeDependenciesTest.clzOne >> 237 movq R10, RBX # ptr -> long >> 23a movq RBP, java/lang/Class:exact * # ptr >> 244 movq R11, RBP # ptr -> long >> 247 xorq R11, R10 # long >> 24a shrq R11, #22 >> 24e testq R11, R11 >> 251 je B56 P=0.000001 C=-1.000000 >> >> 257 B21: # out( B56 B22 ) <- in( B20 ) Freq: 0.999368 >> 257 shrq R10, #9 >> 25b addq R14, R10 # ptr >> 25e cmpb [R14], #4 >> 262 je B56 P=0.000001 C=-1.000000 >> >> 5ed B56: # out( N780 ) <- in( B55 B21 B20 ) Freq: 3.02464e-06 >> 5ed movl RSI, #-20 # int >> nop # 1 bytes pad for loops and calls >> 5f3 call,static wrapper for: uncommon_trap(reason='null_assert_or_unreached0' action='make_not_entrant' debug_id='0') >> # CodeDependenciesTest::foo @ bci:63 (line 23) L[0]=_ L[1]=_ L[2]=_ STK[0]=RBP >> # OopMap {rbp=Oop off=1528/0x5f8} >> 5f8 stop # ShouldNotReachHere >> >> >> C2 tries to generate a null-check for the get-static in `clzTwo = clzOne;`, because it thinks that ciKlass of `java.lang.Class` is not loaded. >> >> The ciKlass of `java.lang.Class` is generated by the following stack trace, >> >> (gdb) bt >> #0 SystemDictionary::find_instance_klass (class_name=0x800481080, class_loader=..., protection_domain=...) at /data/openjdk/jdk_dev/src/hotspot/share/classfile/systemDictionary.cpp:778 >> #1 0x00007ffff610769d in SystemDictionary::find_instance_or_array_klass (class_name=0x800481080, class_loader=..., protection_domain=...) at /data/openjdk/jdk_dev/src/hotspot/share/classfile/systemDictionary.cpp:813 >> #2 0x00007ffff610a55c in SystemDictionary::find_constrained_instance_or_array_klass (current=0x7ffff020a5e0, class_name=0x800481080, class_loader=...) at /data/openjdk/jdk_dev/src/hotspot/share/classfile/systemDictionary.cpp:1760 >> #3 0x00007ffff55f1f29 in ciEnv::get_klass_by_name_impl (this=0x7fffc12066c0, accessing_klass=0x7fff8c0f6278, cpool=..., name=0x7ffff00334f8, require_local=false) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:519 >> #4 0x00007ffff55f1d81 in ciEnv::get_klass_by_name_impl (this=0x7fffc12066c0, accessing_klass=0x7fff8c0f6278, cpool=..., name=0x7fff8c004518, require_local=false) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:488 >> #5 0x00007ffff55f2466 in ciEnv::get_klass_by_index_impl (this=0x7fffc12066c0, cpool=..., index=34, is_accessible=@0x7fffc1204540: 112, accessor=0x7fff8c0f6278) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:611 >> #6 0x00007ffff55f2632 in ciEnv::get_klass_by_index (this=0x7fffc12066c0, cpool=..., index=34, is_accessible=@0x7fffc1204540: 112, accessor=0x7fff8c0f6278) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:658 >> #7 0x00007ffff55fb1bd in ciField::ciField (this=0x7fff8c0f9208, klass=0x7fff8c0f6278, index=65542) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciField.cpp:101 >> #8 0x00007ffff55f344f in ciEnv::get_field_by_index_impl (this=0x7fffc12066c0, accessor=0x7fff8c0f6278, index=65542) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:798 >> #9 0x00007ffff55f3506 in ciEnv::get_field_by_index (this=0x7fffc12066c0, accessor=0x7fff8c0f6278, index=65542) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciEnv.cpp:811 >> #10 0x00007ffff56284ec in ciBytecodeStream::get_field (this=0x7fffc1204850, will_link=@0x7fffc12046cf: false) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciStreams.cpp:274 >> #11 0x00007ffff562d10c in ciTypeFlow::StateVector::do_putstatic (this=0x7fff8c1bd078, str=0x7fffc1204850) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:798 >> #12 0x00007ffff562e960 in ciTypeFlow::StateVector::apply_one_bytecode (this=0x7fff8c1bd078, str=0x7fffc1204850) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:1457 >> #13 0x00007ffff563218d in ciTypeFlow::flow_block (this=0x7fff8c0f7a50, block=0x7fff8c0f8ec0, state=0x7fff8c1bd078, jsrs=0x7fff8c1bd0b8) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:2364 >> #14 0x00007ffff563332d in ciTypeFlow::df_flow_types (this=0x7fff8c0f7a50, start=0x7fff8c0f80a8, do_flow=true, temp_vector=0x7fff8c1bd078, temp_set=0x7fff8c1bd0b8) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:2675 >> #15 0x00007ffff563361f in ciTypeFlow::flow_types (this=0x7fff8c0f7a50) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:2725 >> #16 0x00007ffff5634081 in ciTypeFlow::do_flow (this=0x7fff8c0f7a50) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciTypeFlow.cpp:2886 >> #17 0x00007ffff5605687 in ciMethod::get_flow_analysis (this=0x7fff8c0f6340) at /data/openjdk/jdk_dev/src/hotspot/share/ci/ciMethod.cpp:327 >> #18 0x00007ffff5499d34 in InlineTree::check_can_parse (callee=0x7fff8c0f6340) at /data/openjdk/jdk_dev/src/hotspot/share/opto/bytecodeInfo.cpp:535 >> #19 0x00007ffff55a0806 in CallGenerator::for_osr (m=0x7fff8c0f6340, osr_bci=22) at /data/openjdk/jdk_dev/src/hotspot/share/opto/callGenerator.cpp:299 >> #20 0x00007ffff56aa2d9 in Compile::Compile (this=0x7fffc1205900, ci_env=0x7fffc12066c0, target=0x7fff8c0f6340, osr_bci=22, options=..., directive=0x7ffff01588a0) at /data/openjdk/jdk_dev/src/hotspot/share/opto/compile.cpp:687 >> #21 0x00007ffff559da3e in C2Compiler::compile_method (this=0x7ffff0209f40, env=0x7fffc12066c0, target=0x7fff8c0f6340, entry_bci=22, install_code=true, directive=0x7ffff01588a0) at /data/openjdk/jdk_dev/src/hotspot/share/opto/c2compiler.cpp:108 >> #22 0x00007ffff56c7f6e in CompileBroker::invoke_compiler_on_method (task=0x7ffff02322d0) at /data/openjdk/jdk_dev/src/hotspot/share/compiler/compileBroker.cpp:2291 >> #23 0x00007ffff56c6aed in CompileBroker::compiler_thread_loop () at /data/openjdk/jdk_dev/src/hotspot/share/compiler/compileBroker.cpp:1966 >> #24 0x00007ffff56e6f9f in CompilerThread::thread_entry (thread=0x7ffff020a5e0, __the_thread__=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/share/compiler/compilerThread.cpp:59 >> #25 0x00007ffff614a42f in JavaThread::thread_main_inner (this=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/share/runtime/thread.cpp:1297 >> #26 0x00007ffff614a2c5 in JavaThread::run (this=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/share/runtime/thread.cpp:1280 >> #27 0x00007ffff6147b08 in Thread::call_run (this=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/share/runtime/thread.cpp:358 >> #28 0x00007ffff5eafa4f in thread_native_entry (thread=0x7ffff020a5e0) at /data/openjdk/jdk_dev/src/hotspot/os/linux/os_linux.cpp:705 >> #29 0x00007ffff779cea5 in start_thread (arg=0x7fffc1207700) at pthread_create.c:307 >> #30 0x00007ffff72c19fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 >> >> >> When `CodeDependenciesTest.foo` is compiled, the classloader of the holder of this method is AppClassLoader, so it finds `java.lang.Class` in AppClassLoader, and finds nothing, so it thinks that `java.lang.Class` is not loaded, but `java.lang.Class` is a built-in class which is definitely loaded and initialized. >> >> The following patch is the first patch we implemented, it tries to find Klass in the parent classloader when the Klass can not be found in current classloader. But the patch has potential risk, because user-defined classloader may not follow the 'parent delegation' style of classloading. >> >> diff --git a/src/hotspot/share/ci/ciEnv.cpp b/src/hotspot/share/ci/ciEnv.cpp >> index e29b56a..3ceafc4 100644 >> --- a/src/hotspot/share/ci/ciEnv.cpp >> +++ b/src/hotspot/share/ci/ciEnv.cpp >> @@ -514,11 +514,17 @@ ciKlass* ciEnv::get_klass_by_name_impl(ciKlass* accessing_klass, >> { >> ttyUnlocker ttyul; // release tty lock to avoid ordering problems >> MutexLocker ml(current, Compile_lock); >> - Klass* kls; >> - if (!require_local) { >> - kls = SystemDictionary::find_constrained_instance_or_array_klass(current, sym, loader); >> - } else { >> - kls = SystemDictionary::find_instance_or_array_klass(sym, loader, domain); >> + Klass* kls = NULL; >> + while (true) { >> + if (!require_local) { >> + kls = SystemDictionary::find_constrained_instance_or_array_klass(current, sym, loader); >> + } else { >> + kls = SystemDictionary::find_instance_or_array_klass(sym, loader, domain); >> + } >> + if (kls != NULL || loader() == NULL) { >> + break; >> + } >> + loader = Handle(current, java_lang_ClassLoader::parent(loader())); >> } >> found_klass = kls; >> } >> >> >> When the Klass of the field is not loaded, the generated 'null check' helps nothing, we think remove it is the right way to avoid the deoptmization. > > ?? has updated the pull request incrementally with two additional commits since the last revision: > > - get the correct ciklass if java.lang.Class is not resolved before > - Revert "Remove un-necessary null check" > > This reverts commit 12ae4bb441129c582e9f241e7b0bbde5de783533. I think this will work but I'm a little uncomfortable with "hacking" the classes which can bee seen. I think if `java.lang.Class` appears as loaded by a class loader, it should also be registered in its dictionary. One possibility to achieve that would be to load `java.lang.Class` when the constant pool entry for a `ldc` bytecode is resolved because in that case the code is implicitly creating a `java.lang.Class` instance anyway: diff --git a/src/hotspot/share/interpreter/interpreterRuntime.cpp b/src/hotspot/share/interpreter/interpreterRuntime.cpp index 5e61410ddc2..c4bda692f4e 100644 --- a/src/hotspot/share/interpreter/interpreterRuntime.cpp +++ b/src/hotspot/share/interpreter/interpreterRuntime.cpp @@ -26,6 +26,7 @@ #include "jvm_io.h" #include "classfile/javaClasses.inline.hpp" #include "classfile/symbolTable.hpp" +#include "classfile/systemDictionary.hpp" #include "classfile/vmClasses.hpp" #include "classfile/vmSymbols.hpp" #include "code/codeCache.hpp" @@ -156,6 +157,11 @@ JRT_ENTRY(void, InterpreterRuntime::ldc(JavaThread* current, bool wide)) Klass* klass = pool->klass_at(index, CHECK); oop java_class = klass->java_mirror(); current->set_vm_result(java_class); + // Also load java.lang.Class from the class loader which loaded 'java_class'. This + // might be required later in the compiler to avoid deoptimizations (see JDK-8278135). + Handle loader (THREAD, pool->pool_holder()->class_loader()); + Handle protection_domain (THREAD, pool->pool_holder()->protection_domain()); + SystemDictionary::resolve_or_null(vmSymbols::java_lang_Class(), loader, protection_domain, THREAD); JRT_END JRT_ENTRY(void, InterpreterRuntime::resolve_ldc(JavaThread* current, Bytecodes::Code bytecode)) { However, I'm really not sure what's the best solution here and I'd like to here another opinion on this topic. In the meantime, could you please also add a regression test to your PR. You can take e.g. https://git.openjdk.java.net/jdk/pull/6541 as an example of how that could be done. src/hotspot/share/ci/ciEnv.cpp line 563: > 561: } > 562: > 563: // java.lang.Class may not be loaded by AppClassLoader This can happen for any user-defined class loader. ------------- PR: https://git.openjdk.java.net/jdk/pull/6667 From phedlin at openjdk.java.net Tue Dec 7 13:44:21 2021 From: phedlin at openjdk.java.net (Patric Hedlin) Date: Tue, 7 Dec 2021 13:44:21 GMT Subject: Withdrawn: 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 14:09:07 GMT, Patric Hedlin wrote: > Implementation of ISO/ASCII char set encoding, extending current implementation with ASCII encoding support. > > Implementation focusing on balance between small footprint and efficiency, trying to utilise a dual SIMD path (e.g. Neoverse N1) for the additional Ascii-check. > > Testing: tier1-6 > > Benchmarks, 18-b26 vs. update (ran on Aurora/Ampere Altra): > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:ASCII..........72.23% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:BIG5...........70.38% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:ISO_8859_15....67.81% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:UTF_16......... 3.72% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:UTF_8..........68.50% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:ASCII...........65.59% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:BIG5............60.59% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:ISO_8859_15.....63.79% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:UTF_16.......... 1.04% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:UTF_8...........63.33% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:ASCII............57.25% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:BIG5.............49.33% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:ISO_8859_15......61.37% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:UTF_16........... 0.02% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:UTF_8............54.75% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:ASCII............54.52% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:BIG5.............40.41% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:ISO_8859_15......58.46% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:UTF_16...........-0.55% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:UTF_8............55.98% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:ASCII............47.37% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:BIG5.............36.41% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:ISO_8859_15......50.83% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:UTF_16........... 8.63% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:UTF_8............48.95% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:ASCII.............17.55% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:BIG5..............18.58% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:ISO_8859_15.......20.82% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:UTF_16............ 4.16% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:UTF_8.............18.44% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:ASCII.............21.96% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:BIG5..............22.42% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:ISO_8859_15.......30.27% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:UTF_16............-1.17% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:UTF_8.............35.99% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:ASCII............. 6.19% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:BIG5.............. 7.34% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:ISO_8859_15....... 8.34% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:UTF_16............-0.46% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:UTF_8............. 6.80% This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/6723 From dcubed at openjdk.java.net Tue Dec 7 14:40:25 2021 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Tue, 7 Dec 2021 14:40:25 GMT Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v11] In-Reply-To: <3ij7xShJzxFD9U79iHoJPnHJzg6rPv7T1r67gIqEuD4=.3b964bd4-1374-4d14-a4b3-e209c5954093@github.com> References: <3ij7xShJzxFD9U79iHoJPnHJzg6rPv7T1r67gIqEuD4=.3b964bd4-1374-4d14-a4b3-e209c5954093@github.com> Message-ID: On Thu, 2 Dec 2021 14:41:53 GMT, Roman Kennke wrote: >> The flag UseHeavyMonitors seems to imply that it makes Hotspot always use inflated monitors, rather than stack locks. However, it is only implemented in the interpreter that way. When it calls into runtime, it would still happily stack-lock. Even worse, C1 uses another flag UseFastLocking to achieve something similar (with the same caveat that runtime would stack-lock anyway). C2 doesn't have any such mechanism at all. >> I would like to experiment with disabling stack-locking, and thus, having this flag work as expected would seem very useful. >> >> The change removes the C1 flag UseFastLocking, and replaces its uses with equivalent (i.e. inverted) UseHeavyMonitors instead. I think it makes sense to make UseHeavyMonitors develop (I wouldn't want anybody to use this in production, not currently without this change, and not with this change). I also added a flag VerifyHeavyMonitors to be able to verify that stack-locking is really disabled. We can't currently verify this uncondiftionally (e.g. in debug builds) because all non-x86_64 platforms would need work. >> >> Testing: >> - [x] tier1 >> - [x] tier2 >> - [x] tier3 >> - [ ] tier4 > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > PPC port by @TheRealMDoerr Mach5 Tier1: - dcubed-8276901_for_jdk18.git-20211206-1656-26967213 - no test failures Mach5 Tier2: - dcubed-8276901_for_jdk18.git-20211206-2009-26973681 - 2 known, unrelated test failures Mach5 Tier3: - dcubed-8276901_for_jdk18.git-20211206-2009-26973700 - no failures Mach5 Tier4: - dcubed-8276901_for_jdk18.git-20211206-2124-26976450 - 11 known, unrelated test failures Mach5 Tier5: - dcubed-8276901_for_jdk18.git-20211206-2125-26976484 - 3 known, unrelated failures - still running macosx-x64 tasks Mach5 Tier6: - dcubed-8276901_for_jdk18.git-20211206-2220-26978292 - still running Mach5 Tier7: - dcubed-8276901_for_jdk18.git-20211207-0044-26982843 - 2 known, unrelated failures - still running macosx-x64 tasks Mach5 Tier8: - dcubed-8276901_for_jdk18.git-20211207-0045-26982871 - still running 24 hour tasks and macosx-x64 tasks So Tier[1-4] are complete and look fine. Mach5 is a bit over loaded at the moment so Tier[5-8] are still running. I think the Tier[1-4] results and the partial Tier[5-8] are good enough to say that these bits are okay. ------------- PR: https://git.openjdk.java.net/jdk/pull/6320 From rkennke at openjdk.java.net Tue Dec 7 14:40:25 2021 From: rkennke at openjdk.java.net (Roman Kennke) Date: Tue, 7 Dec 2021 14:40:25 GMT Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v11] In-Reply-To: References: <3ij7xShJzxFD9U79iHoJPnHJzg6rPv7T1r67gIqEuD4=.3b964bd4-1374-4d14-a4b3-e209c5954093@github.com> Message-ID: On Tue, 7 Dec 2021 14:34:48 GMT, Daniel D. Daugherty wrote: > So Tier[1-4] are complete and look fine. Mach5 is a bit over loaded at the moment so Tier[5-8] are still running. I think the Tier[1-4] results and the partial Tier[5-8] are good enough to say that these bits are okay. Thanks, David! I will then go ahead and integrate this PR. Cheers, Roman ------------- PR: https://git.openjdk.java.net/jdk/pull/6320 From rkennke at openjdk.java.net Tue Dec 7 14:45:20 2021 From: rkennke at openjdk.java.net (Roman Kennke) Date: Tue, 7 Dec 2021 14:45:20 GMT Subject: Integrated: 8276901: Implement UseHeavyMonitors consistently In-Reply-To: References: Message-ID: On Tue, 9 Nov 2021 21:54:58 GMT, Roman Kennke wrote: > The flag UseHeavyMonitors seems to imply that it makes Hotspot always use inflated monitors, rather than stack locks. However, it is only implemented in the interpreter that way. When it calls into runtime, it would still happily stack-lock. Even worse, C1 uses another flag UseFastLocking to achieve something similar (with the same caveat that runtime would stack-lock anyway). C2 doesn't have any such mechanism at all. > I would like to experiment with disabling stack-locking, and thus, having this flag work as expected would seem very useful. > > The change removes the C1 flag UseFastLocking, and replaces its uses with equivalent (i.e. inverted) UseHeavyMonitors instead. I think it makes sense to make UseHeavyMonitors develop (I wouldn't want anybody to use this in production, not currently without this change, and not with this change). I also added a flag VerifyHeavyMonitors to be able to verify that stack-locking is really disabled. We can't currently verify this uncondiftionally (e.g. in debug builds) because all non-x86_64 platforms would need work. > > Testing: > - [x] tier1 > - [x] tier2 > - [x] tier3 > - [ ] tier4 This pull request has now been integrated. Changeset: 5b81d5ee Author: Roman Kennke URL: https://git.openjdk.java.net/jdk/commit/5b81d5eeb4124ff04dc3b9a96d0b53edcfa07c5f Stats: 462 lines in 19 files changed: 144 ins; 20 del; 298 mod 8276901: Implement UseHeavyMonitors consistently Reviewed-by: coleenp, mdoerr, dcubed ------------- PR: https://git.openjdk.java.net/jdk/pull/6320 From mdoerr at openjdk.java.net Thu Dec 9 17:09:21 2021 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Thu, 9 Dec 2021 17:09:21 GMT Subject: Integrated: 8253860: PPC: Relocation::pd_set_data_value conflates compressed oops and klasses In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 09:29:45 GMT, Martin Doerr wrote: > Casting narrow Klass pointer to a narrow Oop is problematic. Note that `NativeMovConstReg::set_narrow_oop` only supports narrow Oops. > It turns out, that the problematic code is unused. We never patch narrow Klass pointers in the instruction stream on PPC64 (`metadata_Relocation::pd_fix_value` has an empty implementation). In contrast to that, narrow Oops in the instructions stream always get patched when the nmethod gets installed (by `fix_oop_relocations`). This makes sense as Metadata doesn't get relocated, but Oops may be moved by GC and the instructions need to get the current value during nmethod installation. > Note that the initial constants we were using for narrow Oops in the instruction stream were not correct (Oop compression missing, not updated by GC). So, I think it's better to use 0 to avoid confusion. This pull request has now been integrated. Changeset: 01b30bfa Author: Martin Doerr URL: https://git.openjdk.java.net/jdk/commit/01b30bfa99e95cf1e9209c8de1f3c3c762596708 Stats: 39 lines in 5 files changed: 1 ins; 24 del; 14 mod 8253860: PPC: Relocation::pd_set_data_value conflates compressed oops and klasses Reviewed-by: dlong, rrich ------------- PR: https://git.openjdk.java.net/jdk/pull/6716 From psandoz at openjdk.java.net Thu Dec 9 17:49:16 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Thu, 9 Dec 2021 17:49:16 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v4] In-Reply-To: References: Message-ID: On Thu, 9 Dec 2021 10:52:36 GMT, Mai ??ng Qu?n Anh wrote: >> Hi, >> >> This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. >> >> While working on this patch, I spot some regressions regarding compilation on AVX1. >> >> Thank you very much. > > Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: > > fix cpu pattern on Neon Very nice work. Just a few comments. test/hotspot/jtreg/compiler/vectorapi/reshape/TestVectorCastAVX1.java line 50: > 48: String testMethods = String.join(",", TestCastMethods.AVX1_CAST_TESTS.stream() > 49: .map(VectorSpeciesPair::format) > 50: .toList()); Use a joining collector: Suggestion: String testMethods = TestCastMethods.AVX1_CAST_TESTS.stream() .map(VectorSpeciesPair::format) .collect(joining(",")); There is also a fair bit of repetition in each Test class, consider an abstract class with static method accepting the test class, additional flags, and test cases list? That would be more applicable if three general cases in `TestVectorReinterpret` were separated out, or at least the expand tests from the rebracket tests. test/hotspot/jtreg/compiler/vectorapi/reshape/TestVectorReinterpret.java line 48: > 46: private static final List SHAPE_LIST = List.of( > 47: VectorShape.S_64_BIT, VectorShape.S_128_BIT, VectorShape.S_256_BIT, VectorShape.S_512_BIT > 48: ); Suggestion: private static final List SHAPE_LIST = List.of(VectorShape.values()); test/hotspot/jtreg/compiler/vectorapi/reshape/TestVectorReinterpret.java line 58: > 56: expandShrink.addHelperClasses(VectorReshapeHelper.class); > 57: expandShrink.addFlags("--add-modules=jdk.incubator.vector"); > 58: var expandShrinkTests = String.join(",", SHAPE_LIST.stream() Use `collect(joining(","))` here and in other places. test/hotspot/jtreg/compiler/vectorapi/reshape/tests/TestVectorCast.java line 39: > 37: * In each cast, the VectorCastNode is expected to appear exactly once. > 38: */ > 39: public class TestVectorCast { At some point i would like to explore autogenerating such files., from another Java program and a template mechanism (unfortunately we cannot use a library like Java Poet). Not now though! This also relates to redesigning the existing vector unit tests, moving away from a bash script to a more flexible Java program, from which we can generate unit tests and/or IR tests. The challenge with the IR tests is knowing associated IR nodes and the set supported on various platforms. I like how you enumerated the list for the conversions. Ideally we could go to the source of truth in C2 and determine that, but it does not seem easy to determine. test/hotspot/jtreg/compiler/vectorapi/reshape/utils/TestCastMethods.java line 42: > 40: makePair(BSPEC64, ISPEC128), > 41: makePair(BSPEC64, FSPEC128), > 42: // makePair(BSPEC64, DSPEC256), May later we consider a negative test that we don't expect the required IR node(s). I am uncertain in these cases whether it's an implementation restriction or a hardware restriction, if the former when the restriction goes away the test will fail and we update it. test/hotspot/jtreg/compiler/vectorapi/reshape/utils/VectorReshapeHelper.java line 296: > 294: for (int i = 0; i < osp.vectorByteSize(); i++) { > 295: int expected = i < isp.vectorByteSize() ? UnsafeUtils.getByte(input, ibase, i) : 0; > 296: int actual = UnsafeUtils.getByte(output, obase, i); When `MemorySegment` previews we can remove the use of unsafe accesses, since we can view the array as a segment. Same i think also applies to direct VH accessors. ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From psandoz at openjdk.java.net Thu Dec 9 17:52:21 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Thu, 9 Dec 2021 17:52:21 GMT Subject: RFR: 8277997: Intrinsic creation for VectorMask.fromLong API [v6] In-Reply-To: References: Message-ID: On Thu, 9 Dec 2021 07:21:36 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> 1) Inline expansion of VectorMask.fromLong API, this includes Java API implementation and C2 IR changes. >> 2) X86 backend support for AVX512 and AVX2 targets. >> 3) New IR transformation to handle following patterns:- >> a) Mask2Long + Long2Mask -> MaskCast (when source and destination mask lengths are equal) >> b) Long2Mask + Mask2Long -> Long >> 4) Following performance data is collected for new JMH micro included with the patch:- >> >> System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> Benchmark | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain factor | Baseline AVX3 (ops/ms) | Withopt AVX3(ops/ms) | Gain factor >> -- | -- | -- | -- | -- | -- | -- >> MaskFromLongBenchmark.microMaskFromLong_Byte128 | 20050.884 | 36414.349 | 1.816096936 | 19699.631 | 36412.252 | 1.848372287 >> MaskFromLongBenchmark.microMaskFromLong_Byte256 | 17589.496 | 36418.368 | 2.070461143 | 17211.451 | 36407.44 | 2.115303352 >> MaskFromLongBenchmark.microMaskFromLong_Byte512 | 2824.411 | 2492.795 | 0.882589326 | 6359.071 | 36405.344 | 5.72494693 >> MaskFromLongBenchmark.microMaskFromLong_Byte64 | 23507.28 | 36424.668 | 1.549505855 | 22659.666 | 36420.345 | 1.607276338 >> MaskFromLongBenchmark.microMaskFromLong_Integer128 | 24567.895 | 36411.602 | 1.482080659 | 24620.619 | 36397.005 | 1.478313969 >> MaskFromLongBenchmark.microMaskFromLong_Integer256 | 23495.078 | 36411.981 | 1.549770595 | 22823.846 | 36395.703 | 1.594634971 >> MaskFromLongBenchmark.microMaskFromLong_Integer512 | 12377.022 | 11478.101 | 0.927371786 | 19701.118 | 36394.878 | 1.847350897 >> MaskFromLongBenchmark.microMaskFromLong_Integer64 | 22169.231 | 17791.849 | 0.802546962 | 23603.169 | 18055.166 | 0.76494669 >> MaskFromLongBenchmark.microMaskFromLong_Long128 | 22312.568 | 17859.474 | 0.800422166 | 22171.303 | 18106.295 | 0.816654529 >> MaskFromLongBenchmark.microMaskFromLong_Long256 | 24271.19 | 36416.883 | 1.500416049 | 24621.327 | 36390.41 | 1.478003602 >> MaskFromLongBenchmark.microMaskFromLong_Long512 | 15289.749 | 13860.775 | 0.906540389 | 23003.816 | 36396.033 | 1.582173714 >> MaskFromLongBenchmark.microMaskFromLong_Long64 | 27086.471 | 20490.828 | 0.756496777 | 27177.133 | 20441.112 | 0.752143797 >> MaskFromLongBenchmark.microMaskFromLong_Short128 | 23504.216 | 36412.66 | 1.549196961 | 22823.401 | 36417.799 | 1.595634191 >> MaskFromLongBenchmark.microMaskFromLong_Short256 | 20056.61 | 36403.277 | 1.815026418 | 19699.502 | 36412.605 | 1.84840231 >> MaskFromLongBenchmark.microMaskFromLong_Short512 | 4775.721 | 6827.594 | 1.429646749 | 17209.782 | 36388.226 | 2.114392036 >> MaskFromLongBenchmark.microMaskFromLong_Short64 | 24759.049 | 36381.539 | 1.469423927 | 24506.013 | 36413.099 | 1.48588426 >> >> >> >> Kindly review and share feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8277997: Review comments resolution. 64-bit tests passed. ------------- PR: https://git.openjdk.java.net/jdk/pull/6646 From jbhateja at openjdk.java.net Thu Dec 9 18:19:18 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 9 Dec 2021 18:19:18 GMT Subject: RFR: 8277997: Intrinsic creation for VectorMask.fromLong API [v6] In-Reply-To: References: Message-ID: On Thu, 9 Dec 2021 16:15:17 GMT, Vladimir Kozlov wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8277997: Review comments resolution. > > Looks like the issue only with AVX512. I don't see failures with AVX2 or AVX1. Hi @vnkozlov , thanks for reporting this, it look like a bug unrelated to this patch, its occurring because MaskAll instruction pattern is not supported for 32 bit JVM, due to which operations fall backs over replicateB operation which broadcasts the mask value in a vector and after unboxing-boxing optimization this vector eventually reaches to XorVMask which see one operand in opmask register and other in vector. Current patch extends the C2 inline expansions broadcast Coerced routine to support VectorMask.fromLong operation. I have create separate [bug entry](https://bugs.openjdk.java.net/browse/JDK-8278508) for it targeted for JDK-18 Please let me know if there are other comments on this patch. ------------- PR: https://git.openjdk.java.net/jdk/pull/6646 From kvn at openjdk.java.net Thu Dec 9 19:16:16 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 9 Dec 2021 19:16:16 GMT Subject: RFR: 8277997: Intrinsic creation for VectorMask.fromLong API [v6] In-Reply-To: References: Message-ID: On Thu, 9 Dec 2021 07:21:36 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> 1) Inline expansion of VectorMask.fromLong API, this includes Java API implementation and C2 IR changes. >> 2) X86 backend support for AVX512 and AVX2 targets. >> 3) New IR transformation to handle following patterns:- >> a) Mask2Long + Long2Mask -> MaskCast (when source and destination mask lengths are equal) >> b) Long2Mask + Mask2Long -> Long >> 4) Following performance data is collected for new JMH micro included with the patch:- >> >> System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> Benchmark | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain factor | Baseline AVX3 (ops/ms) | Withopt AVX3(ops/ms) | Gain factor >> -- | -- | -- | -- | -- | -- | -- >> MaskFromLongBenchmark.microMaskFromLong_Byte128 | 20050.884 | 36414.349 | 1.816096936 | 19699.631 | 36412.252 | 1.848372287 >> MaskFromLongBenchmark.microMaskFromLong_Byte256 | 17589.496 | 36418.368 | 2.070461143 | 17211.451 | 36407.44 | 2.115303352 >> MaskFromLongBenchmark.microMaskFromLong_Byte512 | 2824.411 | 2492.795 | 0.882589326 | 6359.071 | 36405.344 | 5.72494693 >> MaskFromLongBenchmark.microMaskFromLong_Byte64 | 23507.28 | 36424.668 | 1.549505855 | 22659.666 | 36420.345 | 1.607276338 >> MaskFromLongBenchmark.microMaskFromLong_Integer128 | 24567.895 | 36411.602 | 1.482080659 | 24620.619 | 36397.005 | 1.478313969 >> MaskFromLongBenchmark.microMaskFromLong_Integer256 | 23495.078 | 36411.981 | 1.549770595 | 22823.846 | 36395.703 | 1.594634971 >> MaskFromLongBenchmark.microMaskFromLong_Integer512 | 12377.022 | 11478.101 | 0.927371786 | 19701.118 | 36394.878 | 1.847350897 >> MaskFromLongBenchmark.microMaskFromLong_Integer64 | 22169.231 | 17791.849 | 0.802546962 | 23603.169 | 18055.166 | 0.76494669 >> MaskFromLongBenchmark.microMaskFromLong_Long128 | 22312.568 | 17859.474 | 0.800422166 | 22171.303 | 18106.295 | 0.816654529 >> MaskFromLongBenchmark.microMaskFromLong_Long256 | 24271.19 | 36416.883 | 1.500416049 | 24621.327 | 36390.41 | 1.478003602 >> MaskFromLongBenchmark.microMaskFromLong_Long512 | 15289.749 | 13860.775 | 0.906540389 | 23003.816 | 36396.033 | 1.582173714 >> MaskFromLongBenchmark.microMaskFromLong_Long64 | 27086.471 | 20490.828 | 0.756496777 | 27177.133 | 20441.112 | 0.752143797 >> MaskFromLongBenchmark.microMaskFromLong_Short128 | 23504.216 | 36412.66 | 1.549196961 | 22823.401 | 36417.799 | 1.595634191 >> MaskFromLongBenchmark.microMaskFromLong_Short256 | 20056.61 | 36403.277 | 1.815026418 | 19699.502 | 36412.605 | 1.84840231 >> MaskFromLongBenchmark.microMaskFromLong_Short512 | 4775.721 | 6827.594 | 1.429646749 | 17209.782 | 36388.226 | 2.114392036 >> MaskFromLongBenchmark.microMaskFromLong_Short64 | 24759.049 | 36381.539 | 1.469423927 | 24506.013 | 36413.099 | 1.48588426 >> >> >> >> Kindly review and share feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8277997: Review comments resolution. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6646 From kvn at openjdk.java.net Thu Dec 9 19:23:14 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 9 Dec 2021 19:23:14 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v4] In-Reply-To: References: Message-ID: On Thu, 9 Dec 2021 10:52:36 GMT, Mai ??ng Qu?n Anh wrote: >> Hi, >> >> This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. >> >> While working on this patch, I spot some regressions regarding compilation on AVX1. >> >> Thank you very much. > > Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: > > fix cpu pattern on Neon @chhagedorn please, review these new IR framework tests. ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From david.holmes at oracle.com Thu Dec 9 22:07:55 2021 From: david.holmes at oracle.com (David Holmes) Date: Fri, 10 Dec 2021 08:07:55 +1000 Subject: RFR: 8276901: Implement UseHeavyMonitors consistently [v11] In-Reply-To: References: <3ij7xShJzxFD9U79iHoJPnHJzg6rPv7T1r67gIqEuD4=.3b964bd4-1374-4d14-a4b3-e209c5954093@github.com> Message-ID: <7864a525-2d4b-9e97-de45-800e56ddd277@oracle.com> (re-sending for the mailing lists as previous mail seems lost) On 8/12/2021 12:40 am, Roman Kennke wrote: > On Tue, 7 Dec 2021 14:34:48 GMT, Daniel D. Daugherty wrote: > >> So Tier[1-4] are complete and look fine. Mach5 is a bit over loaded at the moment so Tier[5-8] are still running. I think the Tier[1-4] results and the partial Tier[5-8] are good enough to say that these bits are okay. > > Thanks, David! I will then go ahead and integrate this PR. That was Daniel not David. Sorry I didn't get back to this sooner but ... I am not happy with the argument processing in Arguments::check_vm_args_consistency. This method does not call fatal(), it reports a warning or error message and returns false if VM initialization should not continue. Also for platforms where UseHeavyMonitors is not fully implemented it should not just be a warning IMO but also cause initialization to not continue. If we are going to return false then the error message should go to the error stream not be issued as a warning(). Please file a fillow up issue and fix this. I remain concerned about the impact of this change on platforms that don't support UseHeavyMonitors fully. David > Cheers, > Roman > > ------------- > > PR: https://git.openjdk.java.net/jdk/pull/6320 From duke at openjdk.java.net Thu Dec 9 22:27:35 2021 From: duke at openjdk.java.net (Dan Lutker) Date: Thu, 9 Dec 2021 22:27:35 GMT Subject: RFR: 8278381: [GCC 11] Address::make_raw() does not initialize rspec Message-ID: The issue was encountered on OpenJDK11 with the error below, there is no error in tip because of the changes in [8240669](https://bugs.openjdk.java.net/browse/JDK-8240669). Fixing here for correctness and to backport. .../src/hotspot/cpu/x86/assembler_x86.cpp: In static member function 'static Address Address::make_raw(int, int, int, int, relocInfo::relocType)': .../src/hotspot/cpu/x86/assembler_x86.cpp:189:20: error: 'rspec.RelocationHolder::_relocbuf[3]' is used uninitialized [-Werror=uninitialized] 189 | RelocationHolder rspec; | ^~~~~ ...src/hotspot/cpu/x86/assembler_x86.cpp:189:20: error: 'rspec.RelocationHolder::_relocbuf[2]' is used uninitialized [-Werror=uninitialized] ------------- Commit messages: - 8278381: [GCC 11] Address::make_raw() does not initialize rspec Changes: https://git.openjdk.java.net/jdk/pull/6785/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6785&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278381 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/6785.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6785/head:pull/6785 PR: https://git.openjdk.java.net/jdk/pull/6785 From duke at openjdk.java.net Thu Dec 9 23:43:35 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Thu, 9 Dec 2021 23:43:35 GMT Subject: RFR: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 Message-ID: <9vnRdXesAbnQtJ2n2zxs1o8lmhDNnGUC2FCziDqa_E0=.cacf7679-c8f5-4c7f-8a36-26600f76a219@github.com> The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it *will* overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. I also added a test case to detect this overrun. ------------- Commit messages: - Add buffer overrun check for decode - Add masked write Changes: https://git.openjdk.java.net/jdk/pull/6786/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6786&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8273108 Stats: 12 lines in 2 files changed: 7 ins; 0 del; 5 mod Patch: https://git.openjdk.java.net/jdk/pull/6786.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6786/head:pull/6786 PR: https://git.openjdk.java.net/jdk/pull/6786 From sviswanathan at openjdk.java.net Thu Dec 9 23:43:38 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 9 Dec 2021 23:43:38 GMT Subject: RFR: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 In-Reply-To: <9vnRdXesAbnQtJ2n2zxs1o8lmhDNnGUC2FCziDqa_E0=.cacf7679-c8f5-4c7f-8a36-26600f76a219@github.com> References: <9vnRdXesAbnQtJ2n2zxs1o8lmhDNnGUC2FCziDqa_E0=.cacf7679-c8f5-4c7f-8a36-26600f76a219@github.com> Message-ID: On Thu, 9 Dec 2021 22:43:28 GMT, Scott Gibbons wrote: > The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it *will* overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. > > I also added a test case to detect this overrun. src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6264: > 6262: __ jcc(Assembler::lessEqual, L_finalBit); > 6263: > 6264: __ mov64(rax, 0x0000ffffffffffff); The constant should have an l suffix. ------------- PR: https://git.openjdk.java.net/jdk/pull/6786 From duke at openjdk.java.net Thu Dec 9 23:54:12 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Thu, 9 Dec 2021 23:54:12 GMT Subject: RFR: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 In-Reply-To: References: <9vnRdXesAbnQtJ2n2zxs1o8lmhDNnGUC2FCziDqa_E0=.cacf7679-c8f5-4c7f-8a36-26600f76a219@github.com> Message-ID: On Thu, 9 Dec 2021 23:10:00 GMT, Sandhya Viswanathan wrote: >> The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it *will* overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. >> >> I also added a test case to detect this overrun. > > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6264: > >> 6262: __ jcc(Assembler::lessEqual, L_finalBit); >> 6263: >> 6264: __ mov64(rax, 0x0000ffffffffffff); > > The constant should have an l suffix. I do not believe this is necessary. There are multiple occurrences of mov64()s without the `l` suffix. For example, lines 687-688: __ mov64(c_rarg3, 0x8000000000000000); __ mov64(rax, 0x7fffffffffffffff); ------------- PR: https://git.openjdk.java.net/jdk/pull/6786 From duke at openjdk.java.net Thu Dec 9 23:56:41 2021 From: duke at openjdk.java.net (Dan Lutker) Date: Thu, 9 Dec 2021 23:56:41 GMT Subject: RFR: 8278525: Additional -Wnonnull errors happen with GCC 11 Message-ID: 8271515 introduced more issues with xmm0 ------------- Commit messages: - 8278525: Additional -Wnonnull errors happen with GCC 11. Changes: https://git.openjdk.java.net/jdk/pull/6788/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6788&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278525 Stats: 5 lines in 1 file changed: 5 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6788.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6788/head:pull/6788 PR: https://git.openjdk.java.net/jdk/pull/6788 From sviswanathan at openjdk.java.net Fri Dec 10 00:00:17 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 10 Dec 2021 00:00:17 GMT Subject: RFR: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 In-Reply-To: References: <9vnRdXesAbnQtJ2n2zxs1o8lmhDNnGUC2FCziDqa_E0=.cacf7679-c8f5-4c7f-8a36-26600f76a219@github.com> Message-ID: On Thu, 9 Dec 2021 23:50:52 GMT, Scott Gibbons wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6264: >> >>> 6262: __ jcc(Assembler::lessEqual, L_finalBit); >>> 6263: >>> 6264: __ mov64(rax, 0x0000ffffffffffff); >> >> The constant should have an l suffix. > > I do not believe this is necessary. There are multiple occurrences of mov64()s without the `l` suffix. For example, lines 687-688: > > __ mov64(c_rarg3, 0x8000000000000000); > __ mov64(rax, 0x7fffffffffffffff); You are right, the code looks good. ------------- PR: https://git.openjdk.java.net/jdk/pull/6786 From sviswanathan at openjdk.java.net Fri Dec 10 00:04:12 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 10 Dec 2021 00:04:12 GMT Subject: RFR: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 In-Reply-To: <9vnRdXesAbnQtJ2n2zxs1o8lmhDNnGUC2FCziDqa_E0=.cacf7679-c8f5-4c7f-8a36-26600f76a219@github.com> References: <9vnRdXesAbnQtJ2n2zxs1o8lmhDNnGUC2FCziDqa_E0=.cacf7679-c8f5-4c7f-8a36-26600f76a219@github.com> Message-ID: On Thu, 9 Dec 2021 22:43:28 GMT, Scott Gibbons wrote: > The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it *will* overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. > > I also added a test case to detect this overrun. @asgibbons The change looks good to me. Could you please create this PR versus JDK 18 (https://github.com/openjdk/jdk18). ------------- PR: https://git.openjdk.java.net/jdk/pull/6786 From kvn at openjdk.java.net Fri Dec 10 00:11:12 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 10 Dec 2021 00:11:12 GMT Subject: RFR: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 In-Reply-To: <9vnRdXesAbnQtJ2n2zxs1o8lmhDNnGUC2FCziDqa_E0=.cacf7679-c8f5-4c7f-8a36-26600f76a219@github.com> References: <9vnRdXesAbnQtJ2n2zxs1o8lmhDNnGUC2FCziDqa_E0=.cacf7679-c8f5-4c7f-8a36-26600f76a219@github.com> Message-ID: On Thu, 9 Dec 2021 22:43:28 GMT, Scott Gibbons wrote: > The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it *will* overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. > > I also added a test case to detect this overrun. Yes, new PR have to be filed based on jdk18 repo pointed by Sandhya because we need to fix it in JDK 18. After integration the fix will be automatically pushed into JDK 19 (current repo). ------------- PR: https://git.openjdk.java.net/jdk/pull/6786 From duke at openjdk.java.net Fri Dec 10 00:23:14 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Fri, 10 Dec 2021 00:23:14 GMT Subject: RFR: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 In-Reply-To: <9vnRdXesAbnQtJ2n2zxs1o8lmhDNnGUC2FCziDqa_E0=.cacf7679-c8f5-4c7f-8a36-26600f76a219@github.com> References: <9vnRdXesAbnQtJ2n2zxs1o8lmhDNnGUC2FCziDqa_E0=.cacf7679-c8f5-4c7f-8a36-26600f76a219@github.com> Message-ID: <3vmsCM9a-EVn58115yZR6TU4yKCDp_XmfUr1r9F38T4=.112cf3ad-3968-43e4-ad3e-d7b97af6438f@github.com> On Thu, 9 Dec 2021 22:43:28 GMT, Scott Gibbons wrote: > The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it *will* overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. > > I also added a test case to detect this overrun. I just created a PR (https://github.com/openjdk/jdk18/pull/4) on the jdk-18 branch. Thanks for the heads-up, ------------- PR: https://git.openjdk.java.net/jdk/pull/6786 From duke at openjdk.java.net Fri Dec 10 00:34:55 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Fri, 10 Dec 2021 00:34:55 GMT Subject: [jdk18] RFR: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 Message-ID: The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it will overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. I also added a test case to detect this overrun. ------------- Commit messages: - Apply Base64 buffer overrun fix to JDK 18 Changes: https://git.openjdk.java.net/jdk18/pull/4/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=4&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8273108 Stats: 12 lines in 2 files changed: 7 ins; 0 del; 5 mod Patch: https://git.openjdk.java.net/jdk18/pull/4.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/4/head:pull/4 PR: https://git.openjdk.java.net/jdk18/pull/4 From sviswanathan at openjdk.java.net Fri Dec 10 01:21:36 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 10 Dec 2021 01:21:36 GMT Subject: [jdk18] RFR: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 00:17:36 GMT, Scott Gibbons wrote: > The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it will overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. > > I also added a test case to detect this overrun. Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/4 From njian at openjdk.java.net Fri Dec 10 01:22:12 2021 From: njian at openjdk.java.net (Ningsheng Jian) Date: Fri, 10 Dec 2021 01:22:12 GMT Subject: RFR: 8277621: ARM32: multiple fastdebug failures with "bad AD file" after JDK-8276162 In-Reply-To: References: Message-ID: On Wed, 8 Dec 2021 12:45:16 GMT, Hao Sun wrote: > If you want it in JDK 18 (and I think it makes sense), you need to integrate it by Dec 9, 16:00 UTC. This is P2 bug, so I think it's OK for jdk18. But you need to create a PR for https://github.com/openjdk/jdk18. The patch looks good to me. ------------- PR: https://git.openjdk.java.net/jdk/pull/6759 From jbhateja at openjdk.java.net Fri Dec 10 01:53:21 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 10 Dec 2021 01:53:21 GMT Subject: Integrated: 8277997: Intrinsic creation for VectorMask.fromLong API In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 18:23:27 GMT, Jatin Bhateja wrote: > Summary of changes: > > 1) Inline expansion of VectorMask.fromLong API, this includes Java API implementation and C2 IR changes. > 2) X86 backend support for AVX512 and AVX2 targets. > 3) New IR transformation to handle following patterns:- > a) Mask2Long + Long2Mask -> MaskCast (when source and destination mask lengths are equal) > b) Long2Mask + Mask2Long -> Long > 4) Following performance data is collected for new JMH micro included with the patch:- > > System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) > > Benchmark | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain factor | Baseline AVX3 (ops/ms) | Withopt AVX3(ops/ms) | Gain factor > -- | -- | -- | -- | -- | -- | -- > MaskFromLongBenchmark.microMaskFromLong_Byte128 | 20050.884 | 36414.349 | 1.816096936 | 19699.631 | 36412.252 | 1.848372287 > MaskFromLongBenchmark.microMaskFromLong_Byte256 | 17589.496 | 36418.368 | 2.070461143 | 17211.451 | 36407.44 | 2.115303352 > MaskFromLongBenchmark.microMaskFromLong_Byte512 | 2824.411 | 2492.795 | 0.882589326 | 6359.071 | 36405.344 | 5.72494693 > MaskFromLongBenchmark.microMaskFromLong_Byte64 | 23507.28 | 36424.668 | 1.549505855 | 22659.666 | 36420.345 | 1.607276338 > MaskFromLongBenchmark.microMaskFromLong_Integer128 | 24567.895 | 36411.602 | 1.482080659 | 24620.619 | 36397.005 | 1.478313969 > MaskFromLongBenchmark.microMaskFromLong_Integer256 | 23495.078 | 36411.981 | 1.549770595 | 22823.846 | 36395.703 | 1.594634971 > MaskFromLongBenchmark.microMaskFromLong_Integer512 | 12377.022 | 11478.101 | 0.927371786 | 19701.118 | 36394.878 | 1.847350897 > MaskFromLongBenchmark.microMaskFromLong_Integer64 | 22169.231 | 17791.849 | 0.802546962 | 23603.169 | 18055.166 | 0.76494669 > MaskFromLongBenchmark.microMaskFromLong_Long128 | 22312.568 | 17859.474 | 0.800422166 | 22171.303 | 18106.295 | 0.816654529 > MaskFromLongBenchmark.microMaskFromLong_Long256 | 24271.19 | 36416.883 | 1.500416049 | 24621.327 | 36390.41 | 1.478003602 > MaskFromLongBenchmark.microMaskFromLong_Long512 | 15289.749 | 13860.775 | 0.906540389 | 23003.816 | 36396.033 | 1.582173714 > MaskFromLongBenchmark.microMaskFromLong_Long64 | 27086.471 | 20490.828 | 0.756496777 | 27177.133 | 20441.112 | 0.752143797 > MaskFromLongBenchmark.microMaskFromLong_Short128 | 23504.216 | 36412.66 | 1.549196961 | 22823.401 | 36417.799 | 1.595634191 > MaskFromLongBenchmark.microMaskFromLong_Short256 | 20056.61 | 36403.277 | 1.815026418 | 19699.502 | 36412.605 | 1.84840231 > MaskFromLongBenchmark.microMaskFromLong_Short512 | 4775.721 | 6827.594 | 1.429646749 | 17209.782 | 36388.226 | 2.114392036 > MaskFromLongBenchmark.microMaskFromLong_Short64 | 24759.049 | 36381.539 | 1.469423927 | 24506.013 | 36413.099 | 1.48588426 > > > > Kindly review and share feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 0113322a Author: Jatin Bhateja URL: https://git.openjdk.java.net/jdk/commit/0113322ac15e2441def3dec599199b98cbd02961 Stats: 618 lines in 81 files changed: 378 ins; 25 del; 215 mod 8277997: Intrinsic creation for VectorMask.fromLong API Reviewed-by: psandoz, kvn, sviswanathan ------------- PR: https://git.openjdk.java.net/jdk/pull/6646 From dlong at openjdk.java.net Fri Dec 10 01:57:19 2021 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 10 Dec 2021 01:57:19 GMT Subject: RFR: 8277621: ARM32: multiple fastdebug failures with "bad AD file" after JDK-8276162 In-Reply-To: References: Message-ID: On Wed, 8 Dec 2021 03:59:09 GMT, Hao Sun wrote: > JDK-8276162 introduced an optimization that creates the following two > kinds of IR shapes: > > 1. `CMoveI (Bool (CmpUL ...) ...)`: conditionally moving two ints based > on the comparison result of two unsigned longs. > > 2. `CMoveP (Bool (CmpUL ...) ...)`: conditionally moving two pointers > based on the comparison result of two unsigned longs. > > But the corresponding match rules are missing for arm32. JDK-8277324 and > JDK-8277753 complemented the missing rules for x86-32. We do the same > thing to arm32 in this patch. > > For IR shape 1, the missing rules in arm32 are in form of > "cmovIL_AA_BB_U", where AA can be reg/imm16/immMov(i.e. three variants > of moving immediate/register), and BB can be LTGE/EQNE/LEGT(i.e. three > condition code registers for unsigned long comparisons). > > For IR shape 2, the missing rules are in form of "cmovPL_CC_BB_U", where > CC can be reg/imm, and BB has the same meaning with IR shape 1. > > Minor updates: > 1. "cmpOpUL/comOpUL_commute" should be used for "cmovLL_AA_BB_U" rules. > 2. Rename "cmovIL_imm_BB" rules to "cmovIL_imm16_BB". > 3. Add "cmovIL_immMov_BB" rules. > 4. Style issue: remove the extra space in the predicate statement for > cmov* rules. > > Test: > We ran tier 1~3 tests on arm32 platform. With this patch, "bad AD file" > errors are gone without introducing test regression. Marked as reviewed by dlong (Reviewer). This looks fine. I think the cmove rules could be consolidated into fewer rules if the flags operand types were collapsed into fewer types likes x64 does, but that's a bigger change and probably deserves its own RFE. ------------- PR: https://git.openjdk.java.net/jdk/pull/6759 From haosun at openjdk.java.net Fri Dec 10 01:57:19 2021 From: haosun at openjdk.java.net (Hao Sun) Date: Fri, 10 Dec 2021 01:57:19 GMT Subject: RFR: 8277621: ARM32: multiple fastdebug failures with "bad AD file" after JDK-8276162 In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 01:19:34 GMT, Ningsheng Jian wrote: > > If you want it in JDK 18 (and I think it makes sense), you need to integrate it by Dec 9, 16:00 UTC. > > This is P2 bug, so I think it's OK for jdk18. But you need to create a PR for https://github.com/openjdk/jdk18. > > The patch looks good to me. Thanks for your kind reminder. I just created one PR at JDK 18 repo. See https://github.com/openjdk/jdk18/pull/6 Hence, I'd like to close current PR now. ------------- PR: https://git.openjdk.java.net/jdk/pull/6759 From haosun at openjdk.java.net Fri Dec 10 01:57:19 2021 From: haosun at openjdk.java.net (Hao Sun) Date: Fri, 10 Dec 2021 01:57:19 GMT Subject: Withdrawn: 8277621: ARM32: multiple fastdebug failures with "bad AD file" after JDK-8276162 In-Reply-To: References: Message-ID: On Wed, 8 Dec 2021 03:59:09 GMT, Hao Sun wrote: > JDK-8276162 introduced an optimization that creates the following two > kinds of IR shapes: > > 1. `CMoveI (Bool (CmpUL ...) ...)`: conditionally moving two ints based > on the comparison result of two unsigned longs. > > 2. `CMoveP (Bool (CmpUL ...) ...)`: conditionally moving two pointers > based on the comparison result of two unsigned longs. > > But the corresponding match rules are missing for arm32. JDK-8277324 and > JDK-8277753 complemented the missing rules for x86-32. We do the same > thing to arm32 in this patch. > > For IR shape 1, the missing rules in arm32 are in form of > "cmovIL_AA_BB_U", where AA can be reg/imm16/immMov(i.e. three variants > of moving immediate/register), and BB can be LTGE/EQNE/LEGT(i.e. three > condition code registers for unsigned long comparisons). > > For IR shape 2, the missing rules are in form of "cmovPL_CC_BB_U", where > CC can be reg/imm, and BB has the same meaning with IR shape 1. > > Minor updates: > 1. "cmpOpUL/comOpUL_commute" should be used for "cmovLL_AA_BB_U" rules. > 2. Rename "cmovIL_imm_BB" rules to "cmovIL_imm16_BB". > 3. Add "cmovIL_immMov_BB" rules. > 4. Style issue: remove the extra space in the predicate statement for > cmov* rules. > > Test: > We ran tier 1~3 tests on arm32 platform. With this patch, "bad AD file" > errors are gone without introducing test regression. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/6759 From haosun at openjdk.java.net Fri Dec 10 01:57:37 2021 From: haosun at openjdk.java.net (Hao Sun) Date: Fri, 10 Dec 2021 01:57:37 GMT Subject: [jdk18] RFR: 8277621: ARM32: multiple fastdebug failures with "bad AD file" after JDK-8276162 Message-ID: JDK-8276162 introduced an optimization that creates the following two kinds of IR shapes: 1. `CMoveI (Bool (CmpUL ...) ...)`: conditionally moving two ints based on the comparison result of two unsigned longs. 2. `CMoveP (Bool (CmpUL ...) ...)`: conditionally moving two pointers based on the comparison result of two unsigned longs. But the corresponding match rules are missing for arm32. JDK-8277324 and JDK-8277753 complemented the missing rules for x86-32. We do the same thing to arm32 in this patch. For IR shape 1, the missing rules in arm32 are in form of "cmovIL_AA_BB_U", where AA can be reg/imm16/immMov(i.e. three variants of moving immediate/register), and BB can be LTGE/EQNE/LEGT(i.e. three condition code registers for unsigned long comparisons). For IR shape 2, the missing rules are in form of "cmovPL_CC_BB_U", where CC can be reg/imm, and BB has the same meaning with IR shape 1. Minor updates: 1. "cmpOpUL/comOpUL_commute" should be used for "cmovLL_AA_BB_U" rules. 2. Rename "cmovIL_imm_BB" rules to "cmovIL_imm16_BB". 3. Add "cmovIL_immMov_BB" rules. 4. Style issue: remove the extra space in the predicate statement for cmov* rules. Test: We ran tier 1~3 tests on arm32 platform. With this patch, "bad AD file" errors are gone without introducing test regression. ------------- Commit messages: - 8277621: ARM32: multiple fastdebug failures with "bad AD file" after JDK-8276162 Changes: https://git.openjdk.java.net/jdk18/pull/6/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=6&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8277621 Stats: 273 lines in 1 file changed: 240 ins; 0 del; 33 mod Patch: https://git.openjdk.java.net/jdk18/pull/6.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/6/head:pull/6 PR: https://git.openjdk.java.net/jdk18/pull/6 From haosun at openjdk.java.net Fri Dec 10 02:08:16 2021 From: haosun at openjdk.java.net (Hao Sun) Date: Fri, 10 Dec 2021 02:08:16 GMT Subject: [jdk18] RFR: 8277621: ARM32: multiple fastdebug failures with "bad AD file" after JDK-8276162 In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 01:49:59 GMT, Hao Sun wrote: > JDK-8276162 introduced an optimization that creates the following two > kinds of IR shapes: > > 1. `CMoveI (Bool (CmpUL ...) ...)`: conditionally moving two ints based > on the comparison result of two unsigned longs. > > 2. `CMoveP (Bool (CmpUL ...) ...)`: conditionally moving two pointers > based on the comparison result of two unsigned longs. > > But the corresponding match rules are missing for arm32. JDK-8277324 and > JDK-8277753 complemented the missing rules for x86-32. We do the same > thing to arm32 in this patch. > > For IR shape 1, the missing rules in arm32 are in form of > "cmovIL_AA_BB_U", where AA can be reg/imm16/immMov(i.e. three variants > of moving immediate/register), and BB can be LTGE/EQNE/LEGT(i.e. three > condition code registers for unsigned long comparisons). > > For IR shape 2, the missing rules are in form of "cmovPL_CC_BB_U", where > CC can be reg/imm, and BB has the same meaning with IR shape 1. > > Minor updates: > 1. "cmpOpUL/comOpUL_commute" should be used for "cmovLL_AA_BB_U" rules. > 2. Rename "cmovIL_imm_BB" rules to "cmovIL_imm16_BB". > 3. Add "cmovIL_immMov_BB" rules. > 4. Style issue: remove the extra space in the predicate statement for > cmov* rules. > > Test: > We ran tier 1~3 tests on arm32 platform. With this patch, "bad AD file" > errors are gone without introducing test regression. Hi @dean-long I'm afraid I was closing the PR https://github.com/openjdk/jdk/pull/6759 at the same time of receiving your comment/approval. Could you help to review this PR AGAIN? @shipilev and @dean-long Thanks in advance. Note that nothing is changed in this PR, compared to previous one. ------------- PR: https://git.openjdk.java.net/jdk18/pull/6 From mli at openjdk.java.net Fri Dec 10 02:36:39 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Fri, 10 Dec 2021 02:36:39 GMT Subject: RFR: 8278532: Fix some typos in C1 comments Message-ID: This is a trivial patch to fix some typos in C1 comments. ------------- Commit messages: - Initial commit Changes: https://git.openjdk.java.net/jdk/pull/6793/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6793&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278532 Stats: 5 lines in 3 files changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.java.net/jdk/pull/6793.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6793/head:pull/6793 PR: https://git.openjdk.java.net/jdk/pull/6793 From jiefu at openjdk.java.net Fri Dec 10 02:47:13 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 10 Dec 2021 02:47:13 GMT Subject: RFR: 8278532: Fix some typos in C1 comments In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 02:27:23 GMT, Hamlin Li wrote: > This is a trivial patch to fix some typos in C1 comments. Looks good. How about fixing this typo (`shoudl` --> `should`) together in this pr https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/phaseX.hpp#L244 . Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6793 From mli at openjdk.java.net Fri Dec 10 03:03:18 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Fri, 10 Dec 2021 03:03:18 GMT Subject: RFR: 8278532: Fix some typos in C1 comments In-Reply-To: References: Message-ID: <2Kq8eB-R1wbQ0CHLR1_Hr8SlvV395cM8dU2-HD0Kn4Y=.39a9e319-1774-49f8-ab69-9addfd286914@github.com> On Fri, 10 Dec 2021 02:44:30 GMT, Jie Fu wrote: >> This is a trivial patch to fix some typos in C1 comments. > > Looks good. > > How about fixing this typo (`shoudl` --> `should`) together in this pr https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/phaseX.hpp#L244 . > Thanks. Thanks @DamonFool , I will go through comments in C2 later, how about I fix them in another pr, I created https://bugs.openjdk.java.net/browse/JDK-8278535 to track the issue. This one is trivial, right? ------------- PR: https://git.openjdk.java.net/jdk/pull/6793 From dlong at openjdk.java.net Fri Dec 10 03:06:11 2021 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 10 Dec 2021 03:06:11 GMT Subject: RFR: 8262134: compiler/uncommontrap/TestDeoptOOM.java failed with "guarantee(false) failed: wrong number of expression stack elements during deopt" In-Reply-To: <5pD6SNXU2pDZmAj-XhdOCmi7Fxj1K9vxeQqJxcRpBvc=.4efd3bf3-5286-49d1-9e0f-9946c03223fd@github.com> References: <5pD6SNXU2pDZmAj-XhdOCmi7Fxj1K9vxeQqJxcRpBvc=.4efd3bf3-5286-49d1-9e0f-9946c03223fd@github.com> Message-ID: <064CCgMDxJfl1j1YM5BArePtzyhX4ce6obFKBvt9FcQ=.7445baa6-a0dc-4c23-9840-006dcb22d2c1@github.com> On Wed, 8 Dec 2021 23:49:25 GMT, Dean Long wrote: > C1 patching stubs use the Unpack_reexecute deoptimization type, but if an exception is thrown, that information is lost, causing the VerifyStack logic to fail. Rather than relaxing the VerifyStack logic to accept Unpack_exception in this case, this change sets the reexecute flag on the patch stub call site. I guess I'll need to retarget this for 18. ------------- PR: https://git.openjdk.java.net/jdk/pull/6776 From njian at openjdk.java.net Fri Dec 10 03:13:40 2021 From: njian at openjdk.java.net (Ningsheng Jian) Date: Fri, 10 Dec 2021 03:13:40 GMT Subject: [jdk18] RFR: 8277621: ARM32: multiple fastdebug failures with "bad AD file" after JDK-8276162 In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 01:49:59 GMT, Hao Sun wrote: > JDK-8276162 introduced an optimization that creates the following two > kinds of IR shapes: > > 1. `CMoveI (Bool (CmpUL ...) ...)`: conditionally moving two ints based > on the comparison result of two unsigned longs. > > 2. `CMoveP (Bool (CmpUL ...) ...)`: conditionally moving two pointers > based on the comparison result of two unsigned longs. > > But the corresponding match rules are missing for arm32. JDK-8277324 and > JDK-8277753 complemented the missing rules for x86-32. We do the same > thing to arm32 in this patch. > > For IR shape 1, the missing rules in arm32 are in form of > "cmovIL_AA_BB_U", where AA can be reg/imm16/immMov(i.e. three variants > of moving immediate/register), and BB can be LTGE/EQNE/LEGT(i.e. three > condition code registers for unsigned long comparisons). > > For IR shape 2, the missing rules are in form of "cmovPL_CC_BB_U", where > CC can be reg/imm, and BB has the same meaning with IR shape 1. > > Minor updates: > 1. "cmpOpUL/comOpUL_commute" should be used for "cmovLL_AA_BB_U" rules. > 2. Rename "cmovIL_imm_BB" rules to "cmovIL_imm16_BB". > 3. Add "cmovIL_immMov_BB" rules. > 4. Style issue: remove the extra space in the predicate statement for > cmov* rules. > > Test: > We ran tier 1~3 tests on arm32 platform. With this patch, "bad AD file" > errors are gone without introducing test regression. Marked as reviewed by njian (Committer). ------------- PR: https://git.openjdk.java.net/jdk18/pull/6 From jiefu at openjdk.java.net Fri Dec 10 03:16:14 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 10 Dec 2021 03:16:14 GMT Subject: RFR: 8278532: Fix some typos in C1 comments In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 02:44:30 GMT, Jie Fu wrote: >> This is a trivial patch to fix some typos in C1 comments. > > Looks good. > > How about fixing this typo (`shoudl` --> `should`) together in this pr https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/phaseX.hpp#L244 . > Thanks. > Thanks @DamonFool , I will go through comments in C2 later, how about I fix them in another pr, I created https://bugs.openjdk.java.net/browse/JDK-8278535 to track the issue. > > This one is trivial, right? I would suggest fixing them together in only one PR. The JBS title can be adjusted as something like: Fix some typos in compiler comments ------------- PR: https://git.openjdk.java.net/jdk/pull/6793 From dlong at openjdk.java.net Fri Dec 10 03:21:16 2021 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 10 Dec 2021 03:21:16 GMT Subject: Withdrawn: 8262134: compiler/uncommontrap/TestDeoptOOM.java failed with "guarantee(false) failed: wrong number of expression stack elements during deopt" In-Reply-To: <5pD6SNXU2pDZmAj-XhdOCmi7Fxj1K9vxeQqJxcRpBvc=.4efd3bf3-5286-49d1-9e0f-9946c03223fd@github.com> References: <5pD6SNXU2pDZmAj-XhdOCmi7Fxj1K9vxeQqJxcRpBvc=.4efd3bf3-5286-49d1-9e0f-9946c03223fd@github.com> Message-ID: On Wed, 8 Dec 2021 23:49:25 GMT, Dean Long wrote: > C1 patching stubs use the Unpack_reexecute deoptimization type, but if an exception is thrown, that information is lost, causing the VerifyStack logic to fail. Rather than relaxing the VerifyStack logic to accept Unpack_exception in this case, this change sets the reexecute flag on the patch stub call site. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/6776 From mli at openjdk.java.net Fri Dec 10 03:26:13 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Fri, 10 Dec 2021 03:26:13 GMT Subject: RFR: 8278532: Fix some typos in C1 comments In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 02:27:23 GMT, Hamlin Li wrote: > This is a trivial patch to fix some typos in C1 comments. Thanks for your opinion. Normally, we don't include code in many areas in one pr unless there are strong reasons to do that. I will hold on for another reviewer's opinion. ------------- PR: https://git.openjdk.java.net/jdk/pull/6793 From jiefu at openjdk.java.net Fri Dec 10 03:36:12 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 10 Dec 2021 03:36:12 GMT Subject: RFR: 8278532: Fix some typos in C1 comments In-Reply-To: References: Message-ID: <5FROrvN6Btub1q9z4Z0nSMH2qfDcIsjh0bb4tP4nrIs=.6c89cd2e-2ead-490c-99d2-7f70a7eb0332@github.com> On Fri, 10 Dec 2021 03:23:24 GMT, Hamlin Li wrote: > Normally, we don't include code in many areas in one pr unless there are strong reasons to do that. These typos belong to hotspot compiler area. So I don't think they are in different areas. ------------- PR: https://git.openjdk.java.net/jdk/pull/6793 From duke at openjdk.java.net Fri Dec 10 03:45:06 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Fri, 10 Dec 2021 03:45:06 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v5] In-Reply-To: References: Message-ID: > Hi, > > This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. > > While working on this patch, I spot some regressions regarding compilation on AVX1. > > Thank you very much. Mai ??ng Qu?n Anh has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 13 additional commits since the last revision: - remove imports - refactor main methods - Merge branch 'master' into vectorReshapeTests - fix cpu pattern on Neon - concretify parameter types - reduce invocation count, improve test cases - Merge branch 'master' into vectorReshapeTests - grammar in comment - missing copyright - add comments - ... and 3 more: https://git.openjdk.java.net/jdk/compare/b3acd1f4...5d3b23a7 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6724/files - new: https://git.openjdk.java.net/jdk/pull/6724/files/a2cee23d..5d3b23a7 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6724&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6724&range=03-04 Stats: 8492 lines in 297 files changed: 4875 ins; 1782 del; 1835 mod Patch: https://git.openjdk.java.net/jdk/pull/6724.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6724/head:pull/6724 PR: https://git.openjdk.java.net/jdk/pull/6724 From duke at openjdk.java.net Fri Dec 10 03:48:14 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Fri, 10 Dec 2021 03:48:14 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v4] In-Reply-To: References: Message-ID: <8K-qiGzSRXoM-yL5ukH1lP7vDj8VRMKyvIgKzgx8PSM=.b13022a5-b7bb-4e20-bd9d-d1f8f9f6bbfb@github.com> On Thu, 9 Dec 2021 16:59:18 GMT, Paul Sandoz wrote: >> Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: >> >> fix cpu pattern on Neon > > test/hotspot/jtreg/compiler/vectorapi/reshape/TestVectorCastAVX1.java line 50: > >> 48: String testMethods = String.join(",", TestCastMethods.AVX1_CAST_TESTS.stream() >> 49: .map(VectorSpeciesPair::format) >> 50: .toList()); > > Use a joining collector: > Suggestion: > > String testMethods = TestCastMethods.AVX1_CAST_TESTS.stream() > .map(VectorSpeciesPair::format) > .collect(joining(",")); > > > There is also a fair bit of repetition in each Test class, consider an abstract class with static method accepting the test class, additional flags, and test cases list? That would be more applicable if three general cases in `TestVectorReinterpret` were separated out, or at least the expand tests from the rebracket tests. Done, I created a helper method in `VectorReshapeHelper` and pass the necessary information to it. ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From dlong at openjdk.java.net Fri Dec 10 03:53:52 2021 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 10 Dec 2021 03:53:52 GMT Subject: [jdk18] RFR: 8262134: compiler/uncommontrap/TestDeoptOOM.java failed with "guarantee(false) failed: wrong number of expression stack elements during deopt" Message-ID: C1 patching stubs use the Unpack_reexecute deoptimization type, but if an exception is thrown, that information is lost, causing the VerifyStack logic to fail. Rather than relaxing the VerifyStack logic to accept Unpack_exception in this case, this change sets the reexecute flag on the patch stub call site. Other changes: use runtime/BootstrapMethod/BSMCalledTwice.java as a reproducer improve VerifyStack failure output ------------- Commit messages: - runtime/BootstrapMethod/BSMCalledTwice.java is a reproducer for 8262134 - print current bytecode if VerifyStack fails - set reexecute flag on c1 patching stubs Changes: https://git.openjdk.java.net/jdk18/pull/7/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=7&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8262134 Stats: 23 lines in 5 files changed: 16 ins; 1 del; 6 mod Patch: https://git.openjdk.java.net/jdk18/pull/7.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/7/head:pull/7 PR: https://git.openjdk.java.net/jdk18/pull/7 From kvn at openjdk.java.net Fri Dec 10 04:00:14 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 10 Dec 2021 04:00:14 GMT Subject: [jdk18] RFR: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 00:17:36 GMT, Scott Gibbons wrote: > The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it will overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. > > I also added a test case to detect this overrun. Let me test it before approval. ------------- PR: https://git.openjdk.java.net/jdk18/pull/4 From duke at openjdk.java.net Fri Dec 10 04:02:12 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Fri, 10 Dec 2021 04:02:12 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v4] In-Reply-To: References: Message-ID: <7SAhX4CEt1Lh2IJOAgqsOACgN2Y6fKT7gNzkQDXWiDs=.612d2bd2-f2e8-4f4e-86ad-8b82bc412350@github.com> On Thu, 9 Dec 2021 17:19:39 GMT, Paul Sandoz wrote: >> Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: >> >> fix cpu pattern on Neon > > test/hotspot/jtreg/compiler/vectorapi/reshape/tests/TestVectorCast.java line 39: > >> 37: * In each cast, the VectorCastNode is expected to appear exactly once. >> 38: */ >> 39: public class TestVectorCast { > > At some point i would like to explore autogenerating such files., from another Java program and a template mechanism (unfortunately we cannot use a library like Java Poet). Not now though! > > This also relates to redesigning the existing vector unit tests, moving away from a bash script to a more flexible Java program, from which we can generate unit tests and/or IR tests. > > The challenge with the IR tests is knowing associated IR nodes and the set supported on various platforms. I like how you enumerated the list for the conversions. Ideally we could go to the source of truth in C2 and determine that, but it does not seem easy to determine. I believe if we have a way to query for compiler support from Java the same way as we do in C2 then the task would be trivial. I don't know if we have a way to call into C2 from Java similar to JNI, though. ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From kvn at openjdk.java.net Fri Dec 10 04:09:13 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 10 Dec 2021 04:09:13 GMT Subject: RFR: 8278532: Fix some typos in C1 comments In-Reply-To: <5FROrvN6Btub1q9z4Z0nSMH2qfDcIsjh0bb4tP4nrIs=.6c89cd2e-2ead-490c-99d2-7f70a7eb0332@github.com> References: <5FROrvN6Btub1q9z4Z0nSMH2qfDcIsjh0bb4tP4nrIs=.6c89cd2e-2ead-490c-99d2-7f70a7eb0332@github.com> Message-ID: <4vAEABFW8DxiZSC8clV9GM0ZDvgvJr1P7hT5o7rUD1s=.b182b281-6b08-4284-9a6b-f15f652a858e@github.com> On Fri, 10 Dec 2021 03:33:34 GMT, Jie Fu wrote: > > Normally, we don't include code in many areas in one pr unless there are strong reasons to do that. > > These typos belong to hotspot compiler area. So I don't think they are in different areas. I agree. The changes are not so big to be separated. Next directories belong to compiler are: `c1, ci, code, compiler, opto`. If you find typos there, fix it here in one PR. ------------- PR: https://git.openjdk.java.net/jdk/pull/6793 From dlong at openjdk.java.net Fri Dec 10 04:15:17 2021 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 10 Dec 2021 04:15:17 GMT Subject: [jdk18] RFR: 8277621: ARM32: multiple fastdebug failures with "bad AD file" after JDK-8276162 In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 01:49:59 GMT, Hao Sun wrote: > JDK-8276162 introduced an optimization that creates the following two > kinds of IR shapes: > > 1. `CMoveI (Bool (CmpUL ...) ...)`: conditionally moving two ints based > on the comparison result of two unsigned longs. > > 2. `CMoveP (Bool (CmpUL ...) ...)`: conditionally moving two pointers > based on the comparison result of two unsigned longs. > > But the corresponding match rules are missing for arm32. JDK-8277324 and > JDK-8277753 complemented the missing rules for x86-32. We do the same > thing to arm32 in this patch. > > For IR shape 1, the missing rules in arm32 are in form of > "cmovIL_AA_BB_U", where AA can be reg/imm16/immMov(i.e. three variants > of moving immediate/register), and BB can be LTGE/EQNE/LEGT(i.e. three > condition code registers for unsigned long comparisons). > > For IR shape 2, the missing rules are in form of "cmovPL_CC_BB_U", where > CC can be reg/imm, and BB has the same meaning with IR shape 1. > > Minor updates: > 1. "cmpOpUL/comOpUL_commute" should be used for "cmovLL_AA_BB_U" rules. > 2. Rename "cmovIL_imm_BB" rules to "cmovIL_imm16_BB". > 3. Add "cmovIL_immMov_BB" rules. > 4. Style issue: remove the extra space in the predicate statement for > cmov* rules. > > Test: > We ran tier 1~3 tests on arm32 platform. With this patch, "bad AD file" > errors are gone without introducing test regression. Marked as reviewed by dlong (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk18/pull/6 From duke at openjdk.java.net Fri Dec 10 04:19:39 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Fri, 10 Dec 2021 04:19:39 GMT Subject: RFR: 8278471: Reorder optimizations in addnode Ideal [v2] In-Reply-To: References: Message-ID: > Reorder optimizations in addnode so special cases appear before general cases; otherwise the special cases would be never covered. > > `(a - b) + (c - d)` subsumes both `(a - b) + (b - c)` and `(a - b) + (c - a)`. Therefore `(a - b) + (b - c)` and `(a - b) + (c - a)` have to be placed before `(a - b) + (c - d)` so that they can work. Zhiqiang Zang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: - Merge master. - Merge master. - reorder optimizations in addnode so special cases appear before general cases. ------------- Changes: https://git.openjdk.java.net/jdk/pull/6752/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6752&range=01 Stats: 20 lines in 1 file changed: 10 ins; 10 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6752.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6752/head:pull/6752 PR: https://git.openjdk.java.net/jdk/pull/6752 From mli at openjdk.java.net Fri Dec 10 04:19:55 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Fri, 10 Dec 2021 04:19:55 GMT Subject: RFR: 8278532: Fix some typos in C1 comments [v2] In-Reply-To: References: Message-ID: > This is a trivial patch to fix some typos in C1 comments. Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: Fix typo in C2 too ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6793/files - new: https://git.openjdk.java.net/jdk/pull/6793/files/64b54296..dec24a62 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6793&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6793&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/6793.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6793/head:pull/6793 PR: https://git.openjdk.java.net/jdk/pull/6793 From mli at openjdk.java.net Fri Dec 10 04:19:56 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Fri, 10 Dec 2021 04:19:56 GMT Subject: RFR: 8278532: Fix some typos in C1 comments In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 02:27:23 GMT, Hamlin Li wrote: > This is a trivial patch to fix some typos in C1 comments. Thanks Vladimir for confirmation. ------------- PR: https://git.openjdk.java.net/jdk/pull/6793 From jiefu at openjdk.java.net Fri Dec 10 04:32:11 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 10 Dec 2021 04:32:11 GMT Subject: RFR: 8278532: Fix some typos in C1 comments In-Reply-To: References: Message-ID: <5zW6dHwYWCIlEnIzcRk1CmnW7MU6R3mUH8vemmesOzw=.0e9fecad-a70a-47c3-b5e4-ba68ea9c939c@github.com> On Fri, 10 Dec 2021 04:15:57 GMT, Hamlin Li wrote: >> This is a trivial patch to fix some typos in C1 comments. > > Thanks Vladimir for confirmation. Thanks @Hamlin-Li for your update. It would be better to change the JBS title since you also fix a typo in C2 comments. ------------- PR: https://git.openjdk.java.net/jdk/pull/6793 From mli at openjdk.java.net Fri Dec 10 04:45:15 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Fri, 10 Dec 2021 04:45:15 GMT Subject: RFR: 8278532: Fix some typos in compiler comments In-Reply-To: <5zW6dHwYWCIlEnIzcRk1CmnW7MU6R3mUH8vemmesOzw=.0e9fecad-a70a-47c3-b5e4-ba68ea9c939c@github.com> References: <5zW6dHwYWCIlEnIzcRk1CmnW7MU6R3mUH8vemmesOzw=.0e9fecad-a70a-47c3-b5e4-ba68ea9c939c@github.com> Message-ID: <32E6aNkMy0WStrT5uMnkMURGiAZZBX08INAI7eT9REI=.9488b28c-95b9-4f45-911a-c68eba3a6fad@github.com> On Fri, 10 Dec 2021 04:29:19 GMT, Jie Fu wrote: >> Thanks Vladimir for confirmation. > > Thanks @Hamlin-Li for your update. > > It would be better to change the JBS title since you also fix a typo in C2 comments. Thanks @DamonFool for reminding, it's updated. ------------- PR: https://git.openjdk.java.net/jdk/pull/6793 From kvn at openjdk.java.net Fri Dec 10 04:54:37 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 10 Dec 2021 04:54:37 GMT Subject: [jdk18] RFR: 8262134: compiler/uncommontrap/TestDeoptOOM.java failed with "guarantee(false) failed: wrong number of expression stack elements during deopt" In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 03:41:24 GMT, Dean Long wrote: > C1 patching stubs use the Unpack_reexecute deoptimization type, but if an exception is thrown, that information is lost, causing the VerifyStack logic to fail. Rather than relaxing the VerifyStack logic to accept Unpack_exception in this case, this change sets the reexecute flag on the patch stub call site. > Other changes: > use runtime/BootstrapMethod/BSMCalledTwice.java as a reproducer > improve VerifyStack failure output Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/7 From kvn at openjdk.java.net Fri Dec 10 04:56:14 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 10 Dec 2021 04:56:14 GMT Subject: RFR: 8278532: Fix some typos in compiler comments [v2] In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 04:19:55 GMT, Hamlin Li wrote: >> This is a trivial patch to fix some typos in C1 comments. > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > Fix typo in C2 too Marked as reviewed by kvn (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6793 From jiefu at openjdk.java.net Fri Dec 10 05:10:16 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 10 Dec 2021 05:10:16 GMT Subject: RFR: 8278532: Fix some typos in compiler comments [v2] In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 04:19:55 GMT, Hamlin Li wrote: >> This is a trivial patch to fix some typos in C1 comments. > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > Fix typo in C2 too Looks good and trivial. ------------- Marked as reviewed by jiefu (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6793 From duke at openjdk.java.net Fri Dec 10 05:44:12 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Fri, 10 Dec 2021 05:44:12 GMT Subject: RFR: 8278525: Additional -Wnonnull errors happen with GCC 11 In-Reply-To: References: Message-ID: On Thu, 9 Dec 2021 23:48:42 GMT, Dan Lutker wrote: > 8271515 introduced more issues with xmm0 I believe an x86 version of [8276563](https://bugs.openjdk.java.net/browse/JDK-8276563) would solve the problem. ------------- PR: https://git.openjdk.java.net/jdk/pull/6788 From mli at openjdk.java.net Fri Dec 10 06:10:16 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Fri, 10 Dec 2021 06:10:16 GMT Subject: RFR: 8278532: Fix some typos in compiler comments In-Reply-To: <5zW6dHwYWCIlEnIzcRk1CmnW7MU6R3mUH8vemmesOzw=.0e9fecad-a70a-47c3-b5e4-ba68ea9c939c@github.com> References: <5zW6dHwYWCIlEnIzcRk1CmnW7MU6R3mUH8vemmesOzw=.0e9fecad-a70a-47c3-b5e4-ba68ea9c939c@github.com> Message-ID: On Fri, 10 Dec 2021 04:29:19 GMT, Jie Fu wrote: >> Thanks Vladimir for confirmation. > > Thanks @Hamlin-Li for your update. > > It would be better to change the JBS title since you also fix a typo in C2 comments. Thanks Vladimir, @DamonFool for your reviews. ------------- PR: https://git.openjdk.java.net/jdk/pull/6793 From mli at openjdk.java.net Fri Dec 10 06:10:17 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Fri, 10 Dec 2021 06:10:17 GMT Subject: Integrated: 8278532: Fix some typos in compiler comments In-Reply-To: References: Message-ID: <-RkPUkKR3lXoRDixUrGVY3KuSF0WKYzGI5cTuCJoXbY=.2199bf93-52c7-42f8-8802-ddfce5b4c8a2@github.com> On Fri, 10 Dec 2021 02:27:23 GMT, Hamlin Li wrote: > This is a trivial patch to fix some typos in C1 comments. This pull request has now been integrated. Changeset: 539fbbf8 Author: Hamlin Li URL: https://git.openjdk.java.net/jdk/commit/539fbbf8c7c6003af33fe148bc3ceb4e69966143 Stats: 7 lines in 4 files changed: 0 ins; 0 del; 7 mod 8278532: Fix some typos in compiler comments Reviewed-by: kvn, jiefu ------------- PR: https://git.openjdk.java.net/jdk/pull/6793 From dlong at openjdk.java.net Fri Dec 10 07:34:10 2021 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 10 Dec 2021 07:34:10 GMT Subject: [jdk18] RFR: 8262134: compiler/uncommontrap/TestDeoptOOM.java failed with "guarantee(false) failed: wrong number of expression stack elements during deopt" In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 03:41:24 GMT, Dean Long wrote: > C1 patching stubs use the Unpack_reexecute deoptimization type, but if an exception is thrown, that information is lost, causing the VerifyStack logic to fail. Rather than relaxing the VerifyStack logic to accept Unpack_exception in this case, this change sets the reexecute flag on the patch stub call site. > Other changes: > use runtime/BootstrapMethod/BSMCalledTwice.java as a reproducer > improve VerifyStack failure output Thanks Vladimir. ------------- PR: https://git.openjdk.java.net/jdk18/pull/7 From phh at openjdk.java.net Fri Dec 10 08:07:17 2021 From: phh at openjdk.java.net (Paul Hohensee) Date: Fri, 10 Dec 2021 08:07:17 GMT Subject: RFR: 8278525: Additional -Wnonnull errors happen with GCC 11 In-Reply-To: References: Message-ID: <_eY1l89gV0s2BDUnTUrZnGzjKyZQWh1mJVtln1HrDBk=.0aa77409-abc1-4b24-b2a6-d2257547726b@github.com> On Thu, 9 Dec 2021 23:48:42 GMT, Dan Lutker wrote: > 8271515 introduced more issues with xmm0 Lgtm. ------------- Marked as reviewed by phh (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6788 From phh at openjdk.java.net Fri Dec 10 08:09:22 2021 From: phh at openjdk.java.net (Paul Hohensee) Date: Fri, 10 Dec 2021 08:09:22 GMT Subject: RFR: 8278381: [GCC 11] Address::make_raw() does not initialize rspec In-Reply-To: References: Message-ID: <6NL_KDWG5tLCREiNfQmtmOMi9t91xmde22zwg6NGKik=.fd3a0cc1-76e2-404e-89bd-f2c2381c1d17@github.com> On Thu, 9 Dec 2021 22:19:57 GMT, Dan Lutker wrote: > The issue was encountered on OpenJDK11 with the error below, there is no error in tip because of the changes in [8240669](https://bugs.openjdk.java.net/browse/JDK-8240669). Fixing here for correctness and to backport. > > .../src/hotspot/cpu/x86/assembler_x86.cpp: In static member function 'static Address Address::make_raw(int, int, int, int, relocInfo::relocType)': > .../src/hotspot/cpu/x86/assembler_x86.cpp:189:20: error: 'rspec.RelocationHolder::_relocbuf[3]' is used uninitialized [-Werror=uninitialized] > 189 | RelocationHolder rspec; > | ^~~~~ > ...src/hotspot/cpu/x86/assembler_x86.cpp:189:20: error: 'rspec.RelocationHolder::_relocbuf[2]' is used uninitialized [-Werror=uninitialized] Lgtm. ------------- Marked as reviewed by phh (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6785 From shade at openjdk.java.net Fri Dec 10 10:55:32 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Fri, 10 Dec 2021 10:55:32 GMT Subject: [jdk18] RFR: 8277621: ARM32: multiple fastdebug failures with "bad AD file" after JDK-8276162 In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 01:49:59 GMT, Hao Sun wrote: > JDK-8276162 introduced an optimization that creates the following two > kinds of IR shapes: > > 1. `CMoveI (Bool (CmpUL ...) ...)`: conditionally moving two ints based > on the comparison result of two unsigned longs. > > 2. `CMoveP (Bool (CmpUL ...) ...)`: conditionally moving two pointers > based on the comparison result of two unsigned longs. > > But the corresponding match rules are missing for arm32. JDK-8277324 and > JDK-8277753 complemented the missing rules for x86-32. We do the same > thing to arm32 in this patch. > > For IR shape 1, the missing rules in arm32 are in form of > "cmovIL_AA_BB_U", where AA can be reg/imm16/immMov(i.e. three variants > of moving immediate/register), and BB can be LTGE/EQNE/LEGT(i.e. three > condition code registers for unsigned long comparisons). > > For IR shape 2, the missing rules are in form of "cmovPL_CC_BB_U", where > CC can be reg/imm, and BB has the same meaning with IR shape 1. > > Minor updates: > 1. "cmpOpUL/comOpUL_commute" should be used for "cmovLL_AA_BB_U" rules. > 2. Rename "cmovIL_imm_BB" rules to "cmovIL_imm16_BB". > 3. Add "cmovIL_immMov_BB" rules. > 4. Style issue: remove the extra space in the predicate statement for > cmov* rules. > > Test: > We ran tier 1~3 tests on arm32 platform. With this patch, "bad AD file" > errors are gone without introducing test regression. Assuming this is the same patch, this still looks good. Yes, I think it should be in JDK 18. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/6 From chagedorn at openjdk.java.net Fri Dec 10 11:21:19 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 10 Dec 2021 11:21:19 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v5] In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 03:45:06 GMT, Mai ??ng Qu?n Anh wrote: >> Hi, >> >> This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. >> >> While working on this patch, I spot some regressions regarding compilation on AVX1. >> >> Thank you very much. > > Mai ??ng Qu?n Anh has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 13 additional commits since the last revision: > > - remove imports > - refactor main methods > - Merge branch 'master' into vectorReshapeTests > - fix cpu pattern on Neon > - concretify parameter types > - reduce invocation count, improve test cases > - Merge branch 'master' into vectorReshapeTests > - grammar in comment > - missing copyright > - add comments > - ... and 3 more: https://git.openjdk.java.net/jdk/compare/b537e319...5d3b23a7 Nice IR tests! They look good. test/hotspot/jtreg/compiler/vectorapi/reshape/tests/TestVectorDoubleExpandShrink.java line 39: > 37: public class TestVectorDoubleExpandShrink { > 38: @Test > 39: @IR(counts = {REINTERPRET_NODE, "0"}) You could use the equivalent attribute `failOn = REINTERPRET_NODE` for a zero count instead. test/hotspot/jtreg/compiler/vectorapi/reshape/utils/VectorReshapeHelper.java line 80: > 78: public static final String F2X_NODE = PREFIX + "VectorCastF2X" + SUFFIX; > 79: public static final String D2X_NODE = PREFIX + "VectorCastD2X" + SUFFIX; > 80: public static final String REINTERPRET_NODE = PREFIX + "VectorReinterpret" + SUFFIX; The vector regexes for the IR matching could be moved to `compiler/lib/ir_framework/IRNode.java` to have them at a common place if other tests want to use them as well. If you move them you should add a `VECTOR_`/`VECTOR_CAST` prefix or something like that to better distinguish them from other nodes. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6724 From haosun at openjdk.java.net Fri Dec 10 11:43:22 2021 From: haosun at openjdk.java.net (Hao Sun) Date: Fri, 10 Dec 2021 11:43:22 GMT Subject: [jdk18] RFR: 8277621: ARM32: multiple fastdebug failures with "bad AD file" after JDK-8276162 In-Reply-To: References: Message-ID: <36DrzjcMEJjnJDn1xvJLBYjVfblaAXIJAKYnD_zmvXo=.5807cc91-df68-4678-a093-94557af3a909@github.com> On Fri, 10 Dec 2021 01:49:59 GMT, Hao Sun wrote: > JDK-8276162 introduced an optimization that creates the following two > kinds of IR shapes: > > 1. `CMoveI (Bool (CmpUL ...) ...)`: conditionally moving two ints based > on the comparison result of two unsigned longs. > > 2. `CMoveP (Bool (CmpUL ...) ...)`: conditionally moving two pointers > based on the comparison result of two unsigned longs. > > But the corresponding match rules are missing for arm32. JDK-8277324 and > JDK-8277753 complemented the missing rules for x86-32. We do the same > thing to arm32 in this patch. > > For IR shape 1, the missing rules in arm32 are in form of > "cmovIL_AA_BB_U", where AA can be reg/imm16/immMov(i.e. three variants > of moving immediate/register), and BB can be LTGE/EQNE/LEGT(i.e. three > condition code registers for unsigned long comparisons). > > For IR shape 2, the missing rules are in form of "cmovPL_CC_BB_U", where > CC can be reg/imm, and BB has the same meaning with IR shape 1. > > Minor updates: > 1. "cmpOpUL/comOpUL_commute" should be used for "cmovLL_AA_BB_U" rules. > 2. Rename "cmovIL_imm_BB" rules to "cmovIL_imm16_BB". > 3. Add "cmovIL_immMov_BB" rules. > 4. Style issue: remove the extra space in the predicate statement for > cmov* rules. > > Test: > We ran tier 1~3 tests on arm32 platform. With this patch, "bad AD file" > errors are gone without introducing test regression. Thanks a lot for your review! ------------- PR: https://git.openjdk.java.net/jdk18/pull/6 From duke at openjdk.java.net Fri Dec 10 11:43:54 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Fri, 10 Dec 2021 11:43:54 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v6] In-Reply-To: References: Message-ID: > Hi, > > This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. > > While working on this patch, I spot some regressions regarding compilation on AVX1. > > Thank you very much. Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: address reviews ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6724/files - new: https://git.openjdk.java.net/jdk/pull/6724/files/5d3b23a7..2af640c1 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6724&range=05 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6724&range=04-05 Stats: 25 lines in 3 files changed: 9 ins; 3 del; 13 mod Patch: https://git.openjdk.java.net/jdk/pull/6724.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6724/head:pull/6724 PR: https://git.openjdk.java.net/jdk/pull/6724 From duke at openjdk.java.net Fri Dec 10 11:44:01 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Fri, 10 Dec 2021 11:44:01 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v5] In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 11:16:24 GMT, Christian Hagedorn wrote: >> Mai ??ng Qu?n Anh has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 13 additional commits since the last revision: >> >> - remove imports >> - refactor main methods >> - Merge branch 'master' into vectorReshapeTests >> - fix cpu pattern on Neon >> - concretify parameter types >> - reduce invocation count, improve test cases >> - Merge branch 'master' into vectorReshapeTests >> - grammar in comment >> - missing copyright >> - add comments >> - ... and 3 more: https://git.openjdk.java.net/jdk/compare/2f77985d...5d3b23a7 > > test/hotspot/jtreg/compiler/vectorapi/reshape/utils/VectorReshapeHelper.java line 80: > >> 78: public static final String F2X_NODE = PREFIX + "VectorCastF2X" + SUFFIX; >> 79: public static final String D2X_NODE = PREFIX + "VectorCastD2X" + SUFFIX; >> 80: public static final String REINTERPRET_NODE = PREFIX + "VectorReinterpret" + SUFFIX; > > The vector regexes for the IR matching could be moved to `compiler/lib/ir_framework/IRNode.java` to have them at a common place if other tests want to use them as well. If you move them you should add a `VECTOR_`/`VECTOR_CAST` prefix or something like that to better distinguish them from other nodes. Done, thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From mli at openjdk.java.net Fri Dec 10 12:42:31 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Fri, 10 Dec 2021 12:42:31 GMT Subject: RFR: 8278533: Remove some unused methods in c1_Instruction and c1_ValueMap Message-ID: This is a trivial patch to remove some unused methods in c1_Instruction and c1_ValueMap. ------------- Commit messages: - Initial commit Changes: https://git.openjdk.java.net/jdk/pull/6798/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6798&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278533 Stats: 8 lines in 3 files changed: 0 ins; 7 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/6798.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6798/head:pull/6798 PR: https://git.openjdk.java.net/jdk/pull/6798 From chagedorn at openjdk.java.net Fri Dec 10 13:02:16 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 10 Dec 2021 13:02:16 GMT Subject: RFR: 8278533: Remove some unused methods in c1_Instruction and c1_ValueMap In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 12:33:38 GMT, Hamlin Li wrote: > This is a trivial patch to remove some unused methods in c1_Instruction and c1_ValueMap. Looks good and trivial. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6798 From chagedorn at openjdk.java.net Fri Dec 10 13:03:18 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 10 Dec 2021 13:03:18 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v6] In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 11:43:54 GMT, Mai ??ng Qu?n Anh wrote: >> Hi, >> >> This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. >> >> While working on this patch, I spot some regressions regarding compilation on AVX1. >> >> Thank you very much. > > Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: > > address reviews Update looks good, thanks! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6724 From mli at openjdk.java.net Fri Dec 10 14:55:15 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Fri, 10 Dec 2021 14:55:15 GMT Subject: RFR: 8278533: Remove some unused methods in c1_Instruction and c1_ValueMap In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 12:33:38 GMT, Hamlin Li wrote: > This is a trivial patch to remove some unused methods in c1_Instruction and c1_ValueMap. Thanks Christian for your review. ------------- PR: https://git.openjdk.java.net/jdk/pull/6798 From mli at openjdk.java.net Fri Dec 10 14:55:16 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Fri, 10 Dec 2021 14:55:16 GMT Subject: Integrated: 8278533: Remove some unused methods in c1_Instruction and c1_ValueMap In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 12:33:38 GMT, Hamlin Li wrote: > This is a trivial patch to remove some unused methods in c1_Instruction and c1_ValueMap. This pull request has now been integrated. Changeset: 3e0b083f Author: Hamlin Li URL: https://git.openjdk.java.net/jdk/commit/3e0b083f2013f07b090af92a78c9a5f46f9fe427 Stats: 8 lines in 3 files changed: 0 ins; 7 del; 1 mod 8278533: Remove some unused methods in c1_Instruction and c1_ValueMap Reviewed-by: chagedorn ------------- PR: https://git.openjdk.java.net/jdk/pull/6798 From mli at openjdk.java.net Fri Dec 10 15:07:31 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Fri, 10 Dec 2021 15:07:31 GMT Subject: RFR: 8278534: Remove some unnecessary code in MethodLiveness::init_basic_blocks Message-ID: This is a minor patch to remove some unnecessary code in MethodLiveness::init_basic_blocks. ------------- Commit messages: - Initial commit Changes: https://git.openjdk.java.net/jdk/pull/6799/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6799&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278534 Stats: 6 lines in 1 file changed: 0 ins; 6 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6799.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6799/head:pull/6799 PR: https://git.openjdk.java.net/jdk/pull/6799 From chagedorn at openjdk.java.net Fri Dec 10 15:51:52 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 10 Dec 2021 15:51:52 GMT Subject: [jdk18] RFR: 8278420: C2: assert(!n->is_Store() && !n->is_LoadStore()) failed: no node with a side effect Message-ID: The test case fails with the assertion when an actual unreachable store node with only uses outside of the loop is tried to be sunk out of a dead loop in split-if. This is quite an edge case in which C2 is not able to remove the inner loop but the store for `iFldArr2` inside this loop dies due to improved type information after peeling. This removes some memory phis as well and leaves the store `iFldArr1` with only outside the loop uses. A more detailed explanation how we end up in this situation is shown in the comments of the test case. This suggests that the assertion is too strong. I propose to relax the assertion and bail out if we are trying to sink a store node. However, I don't think that we will reach this code with `LoadStore` nodes as they have other memory outputs inside a loop, preventing to reach this assertion code. Thanks, Christian ------------- Commit messages: - 8278420: C2: assert(!n->is_Store() && !n->is_LoadStore()) failed: no node with a side effect Changes: https://git.openjdk.java.net/jdk18/pull/11/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=11&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278420 Stats: 123 lines in 2 files changed: 122 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk18/pull/11.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/11/head:pull/11 PR: https://git.openjdk.java.net/jdk18/pull/11 From haosun at openjdk.java.net Fri Dec 10 15:56:23 2021 From: haosun at openjdk.java.net (Hao Sun) Date: Fri, 10 Dec 2021 15:56:23 GMT Subject: [jdk18] Integrated: 8277621: ARM32: multiple fastdebug failures with "bad AD file" after JDK-8276162 In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 01:49:59 GMT, Hao Sun wrote: > JDK-8276162 introduced an optimization that creates the following two > kinds of IR shapes: > > 1. `CMoveI (Bool (CmpUL ...) ...)`: conditionally moving two ints based > on the comparison result of two unsigned longs. > > 2. `CMoveP (Bool (CmpUL ...) ...)`: conditionally moving two pointers > based on the comparison result of two unsigned longs. > > But the corresponding match rules are missing for arm32. JDK-8277324 and > JDK-8277753 complemented the missing rules for x86-32. We do the same > thing to arm32 in this patch. > > For IR shape 1, the missing rules in arm32 are in form of > "cmovIL_AA_BB_U", where AA can be reg/imm16/immMov(i.e. three variants > of moving immediate/register), and BB can be LTGE/EQNE/LEGT(i.e. three > condition code registers for unsigned long comparisons). > > For IR shape 2, the missing rules are in form of "cmovPL_CC_BB_U", where > CC can be reg/imm, and BB has the same meaning with IR shape 1. > > Minor updates: > 1. "cmpOpUL/comOpUL_commute" should be used for "cmovLL_AA_BB_U" rules. > 2. Rename "cmovIL_imm_BB" rules to "cmovIL_imm16_BB". > 3. Add "cmovIL_immMov_BB" rules. > 4. Style issue: remove the extra space in the predicate statement for > cmov* rules. > > Test: > We ran tier 1~3 tests on arm32 platform. With this patch, "bad AD file" > errors are gone without introducing test regression. This pull request has now been integrated. Changeset: 0602f4c4 Author: Hao Sun Committer: Aleksey Shipilev URL: https://git.openjdk.java.net/jdk18/commit/0602f4c48b0ffe53a6081551988b417d7536efa0 Stats: 273 lines in 1 file changed: 240 ins; 0 del; 33 mod 8277621: ARM32: multiple fastdebug failures with "bad AD file" after JDK-8276162 Reviewed-by: njian, dlong, shade ------------- PR: https://git.openjdk.java.net/jdk18/pull/6 From roland at openjdk.java.net Fri Dec 10 16:13:14 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 10 Dec 2021 16:13:14 GMT Subject: [jdk18] RFR: 8278420: C2: assert(!n->is_Store() && !n->is_LoadStore()) failed: no node with a side effect In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 15:43:52 GMT, Christian Hagedorn wrote: > The test case fails with the assertion when an actual unreachable store node with only uses outside of the loop is tried to be sunk out of a dead loop in split-if. This is quite an edge case in which C2 is not able to remove the inner loop but the store for `iFldArr2` inside this loop dies due to improved type information after peeling. This removes some memory phis as well and leaves the store `iFldArr1` with only outside the loop uses. A more detailed explanation how we end up in this situation is shown in the comments of the test case. > > This suggests that the assertion is too strong. I propose to relax the assertion and bail out if we are trying to sink a store node. However, I don't think that we will reach this code with `LoadStore` nodes as they have other memory outputs inside a loop, preventing to reach this assertion code. > > Thanks, > Christian That doesn't seem right to me. The fact that a CastII dies on a path that doesn't die is a sign, in my opinion, that the graph is broken. So: iArrFld2[j] = 8; has a CastII that becomes top because the loop j <= -1. And the range check for that array access is a predicate. So the loop in which the store to iArrFld2 is is unreachable. That's why the CastII dies. Do I understand correctly? Isn't the problem that we would need skeleton predicates between the peel iteration and the loop body to catch that the loop is unreachable? ------------- PR: https://git.openjdk.java.net/jdk18/pull/11 From iveresov at openjdk.java.net Fri Dec 10 17:11:11 2021 From: iveresov at openjdk.java.net (Igor Veresov) Date: Fri, 10 Dec 2021 17:11:11 GMT Subject: RFR: 8262134: compiler/uncommontrap/TestDeoptOOM.java failed with "guarantee(false) failed: wrong number of expression stack elements during deopt" In-Reply-To: <5pD6SNXU2pDZmAj-XhdOCmi7Fxj1K9vxeQqJxcRpBvc=.4efd3bf3-5286-49d1-9e0f-9946c03223fd@github.com> References: <5pD6SNXU2pDZmAj-XhdOCmi7Fxj1K9vxeQqJxcRpBvc=.4efd3bf3-5286-49d1-9e0f-9946c03223fd@github.com> Message-ID: On Wed, 8 Dec 2021 23:49:25 GMT, Dean Long wrote: > C1 patching stubs use the Unpack_reexecute deoptimization type, but if an exception is thrown, that information is lost, causing the VerifyStack logic to fail. Rather than relaxing the VerifyStack logic to accept Unpack_exception in this case, this change sets the reexecute flag on the patch stub call site. Have you run tier7 where is it does -XX:DeoptimizeALot ? ------------- PR: https://git.openjdk.java.net/jdk/pull/6776 From iveresov at openjdk.java.net Fri Dec 10 17:12:14 2021 From: iveresov at openjdk.java.net (Igor Veresov) Date: Fri, 10 Dec 2021 17:12:14 GMT Subject: [jdk18] RFR: 8262134: compiler/uncommontrap/TestDeoptOOM.java failed with "guarantee(false) failed: wrong number of expression stack elements during deopt" In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 03:41:24 GMT, Dean Long wrote: > C1 patching stubs use the Unpack_reexecute deoptimization type, but if an exception is thrown, that information is lost, causing the VerifyStack logic to fail. Rather than relaxing the VerifyStack logic to accept Unpack_exception in this case, this change sets the reexecute flag on the patch stub call site. > Other changes: > use runtime/BootstrapMethod/BSMCalledTwice.java as a reproducer > improve VerifyStack failure output Have you run tier7 where is it does -XX:DeoptimizeALot ? ------------- PR: https://git.openjdk.java.net/jdk18/pull/7 From xliu at openjdk.java.net Fri Dec 10 17:22:16 2021 From: xliu at openjdk.java.net (Xin Liu) Date: Fri, 10 Dec 2021 17:22:16 GMT Subject: RFR: 8278381: [GCC 11] Address::make_raw() does not initialize rspec In-Reply-To: References: Message-ID: On Thu, 9 Dec 2021 22:19:57 GMT, Dan Lutker wrote: > The issue was encountered on OpenJDK11 with the error below, there is no error in tip because of the changes in [8240669](https://bugs.openjdk.java.net/browse/JDK-8240669). Fixing here for correctness and to backport. > > .../src/hotspot/cpu/x86/assembler_x86.cpp: In static member function 'static Address Address::make_raw(int, int, int, int, relocInfo::relocType)': > .../src/hotspot/cpu/x86/assembler_x86.cpp:189:20: error: 'rspec.RelocationHolder::_relocbuf[3]' is used uninitialized [-Werror=uninitialized] > 189 | RelocationHolder rspec; > | ^~~~~ > ...src/hotspot/cpu/x86/assembler_x86.cpp:189:20: error: 'rspec.RelocationHolder::_relocbuf[2]' is used uninitialized [-Werror=uninitialized] LGTM. ------------- Marked as reviewed by xliu (Committer). PR: https://git.openjdk.java.net/jdk/pull/6785 From duke at openjdk.java.net Fri Dec 10 17:56:15 2021 From: duke at openjdk.java.net (Dan Lutker) Date: Fri, 10 Dec 2021 17:56:15 GMT Subject: Integrated: 8278381: [GCC 11] Address::make_raw() does not initialize rspec In-Reply-To: References: Message-ID: <0mKAXT1SlAdmoGriiJnOFAIruM3utEuEvKB55edKnAk=.06ef2011-04ce-4a1c-91b5-1a44123d7981@github.com> On Thu, 9 Dec 2021 22:19:57 GMT, Dan Lutker wrote: > The issue was encountered on OpenJDK11 with the error below, there is no error in tip because of the changes in [8240669](https://bugs.openjdk.java.net/browse/JDK-8240669). Fixing here for correctness and to backport. > > .../src/hotspot/cpu/x86/assembler_x86.cpp: In static member function 'static Address Address::make_raw(int, int, int, int, relocInfo::relocType)': > .../src/hotspot/cpu/x86/assembler_x86.cpp:189:20: error: 'rspec.RelocationHolder::_relocbuf[3]' is used uninitialized [-Werror=uninitialized] > 189 | RelocationHolder rspec; > | ^~~~~ > ...src/hotspot/cpu/x86/assembler_x86.cpp:189:20: error: 'rspec.RelocationHolder::_relocbuf[2]' is used uninitialized [-Werror=uninitialized] This pull request has now been integrated. Changeset: 4f594e6a Author: Dan Lutker Committer: Paul Hohensee URL: https://git.openjdk.java.net/jdk/commit/4f594e6a28ad85d46d3252fb960f1c116f414899 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod 8278381: [GCC 11] Address::make_raw() does not initialize rspec Reviewed-by: phh, xliu ------------- PR: https://git.openjdk.java.net/jdk/pull/6785 From kcr at openjdk.java.net Fri Dec 10 18:48:22 2021 From: kcr at openjdk.java.net (Kevin Rushforth) Date: Fri, 10 Dec 2021 18:48:22 GMT Subject: [jdk18] RFR: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 00:17:36 GMT, Scott Gibbons wrote: > The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it will overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. > > I also added a test case to detect this overrun. @asgibbons I see that [JDK-8275427](https://bugs.openjdk.java.net/browse/JDK-8275427) is closed as a duplicate. Normally, duplicates are not listed in the commit message of a fix. ------------- PR: https://git.openjdk.java.net/jdk18/pull/4 From duke at openjdk.java.net Fri Dec 10 18:55:17 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Fri, 10 Dec 2021 18:55:17 GMT Subject: [jdk18] RFR: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 18:45:02 GMT, Kevin Rushforth wrote: >> The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it will overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. >> >> I also added a test case to detect this overrun. > > @asgibbons I see that [JDK-8275427](https://bugs.openjdk.java.net/browse/JDK-8275427) is closed as a duplicate. Normally, duplicates are not listed in the commit message of a fix. @kevinrushforth Thanks for the tip. I believe it was marked as duplicate after I made this PR. I'll keep this in mind for future PRs. ------------- PR: https://git.openjdk.java.net/jdk18/pull/4 From psandoz at openjdk.java.net Fri Dec 10 20:23:11 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Fri, 10 Dec 2021 20:23:11 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v4] In-Reply-To: <7SAhX4CEt1Lh2IJOAgqsOACgN2Y6fKT7gNzkQDXWiDs=.612d2bd2-f2e8-4f4e-86ad-8b82bc412350@github.com> References: <7SAhX4CEt1Lh2IJOAgqsOACgN2Y6fKT7gNzkQDXWiDs=.612d2bd2-f2e8-4f4e-86ad-8b82bc412350@github.com> Message-ID: On Fri, 10 Dec 2021 03:59:34 GMT, Mai ??ng Qu?n Anh wrote: >> test/hotspot/jtreg/compiler/vectorapi/reshape/tests/TestVectorCast.java line 39: >> >>> 37: * In each cast, the VectorCastNode is expected to appear exactly once. >>> 38: */ >>> 39: public class TestVectorCast { >> >> At some point i would like to explore autogenerating such files., from another Java program and a template mechanism (unfortunately we cannot use a library like Java Poet). Not now though! >> >> This also relates to redesigning the existing vector unit tests, moving away from a bash script to a more flexible Java program, from which we can generate unit tests and/or IR tests. >> >> The challenge with the IR tests is knowing associated IR nodes and the set supported on various platforms. I like how you enumerated the list for the conversions. Ideally we could go to the source of truth in C2 and determine that, but it does not seem easy to determine. > > I believe if we have a way to query for compiler support from Java the same way as we do in C2 then the task would be trivial. I don't know if we have a way to call into C2 from Java similar to JNI, though. Perhaps a good way to explore that is through the `WhiteBox` API? ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From dlong at openjdk.java.net Fri Dec 10 20:40:16 2021 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 10 Dec 2021 20:40:16 GMT Subject: [jdk18] RFR: 8262134: compiler/uncommontrap/TestDeoptOOM.java failed with "guarantee(false) failed: wrong number of expression stack elements during deopt" In-Reply-To: References: Message-ID: <2C-dBc_EkQ6FBLVIBLDwzSdsdO3cKEWvlMLZaKqqG4w=.24e359f4-c5e6-4fea-b301-cd8727009c4d@github.com> On Fri, 10 Dec 2021 17:09:31 GMT, Igor Veresov wrote: > Have you run tier7 where is it does -XX:DeoptimizeALot ? I'll do that. What I did before was run tiers1 - 3 with -XX:DeoptimizeALot -XX:+VerifyStack. That's how I found runtime/BootstrapMethod/BSMCalledTwice.java as a reproducer. Unfortunately, running all tests with both flags finds too many other problems, and running with -XX:DeoptimizeALot but not -XX:+VerifyStack may give mysterious crashes instead of a nice assert if it finds a problem. ------------- PR: https://git.openjdk.java.net/jdk18/pull/7 From kvn at openjdk.java.net Fri Dec 10 21:24:13 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 10 Dec 2021 21:24:13 GMT Subject: [jdk18] RFR: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 00:17:36 GMT, Scott Gibbons wrote: > The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it will overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. > > I also added a test case to detect this overrun. You should closed other bugs as duplicate if you think the fix applied to them. Also you don't need to list them in PR because they are listed in JBS anyway. Testing takes long time because, as test's name says, it runs for 24 hours. I want to make sure test passed with this fix. ------------- PR: https://git.openjdk.java.net/jdk18/pull/4 From psandoz at openjdk.java.net Fri Dec 10 22:21:12 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Fri, 10 Dec 2021 22:21:12 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v6] In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 11:43:54 GMT, Mai ??ng Qu?n Anh wrote: >> Hi, >> >> This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. >> >> While working on this patch, I spot some regressions regarding compilation on AVX1. >> >> Thank you very much. > > Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: > > address reviews test/hotspot/jtreg/compiler/vectorapi/reshape/TestVectorCastAVX512DQ.java line 38: > 36: * @requires vm.cpu.features ~= ".*avx512dq.*" > 37: * @library /test/lib / > 38: * @run driver compiler.vectorapi.reshape.TestVectorCast512DQ Suggestion: * @run driver compiler.vectorapi.reshape.TestVectorCastAVX512DQ ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From psandoz at openjdk.java.net Fri Dec 10 22:28:10 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Fri, 10 Dec 2021 22:28:10 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v6] In-Reply-To: References: Message-ID: <9dd8sRt62lV96I-deCNFEqC5KpgjvxpvDTg144Ospmw=.e3bd2d47-7778-4644-b843-454134735875@github.com> On Fri, 10 Dec 2021 11:43:54 GMT, Mai ??ng Qu?n Anh wrote: >> Hi, >> >> This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. >> >> While working on this patch, I spot some regressions regarding compilation on AVX1. >> >> Thank you very much. > > Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: > > address reviews test/hotspot/jtreg/compiler/vectorapi/reshape/tests/TestVectorCast.java line 294: > 292: > 293: @Test > 294: @IR(counts = {B2X_NODE, "1"}) Suggestion: @IR(counts = {S2X_NODE, "1"}) ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From psandoz at openjdk.java.net Fri Dec 10 22:41:14 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Fri, 10 Dec 2021 22:41:14 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v6] In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 11:43:54 GMT, Mai ??ng Qu?n Anh wrote: >> Hi, >> >> This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. >> >> While working on this patch, I spot some regressions regarding compilation on AVX1. >> >> Thank you very much. > > Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: > > address reviews I ran the tests on a avx512dq machine and found a few minor issues, see prior comments. Once those are fixed I think its can be approved for integration. ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From dlong at openjdk.java.net Fri Dec 10 23:47:10 2021 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 10 Dec 2021 23:47:10 GMT Subject: RFR: 8278525: Additional -Wnonnull errors happen with GCC 11 In-Reply-To: References: Message-ID: On Thu, 9 Dec 2021 23:48:42 GMT, Dan Lutker wrote: > 8271515 introduced more issues with xmm0 Marked as reviewed by dlong (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6788 From duke at openjdk.java.net Fri Dec 10 23:53:16 2021 From: duke at openjdk.java.net (Dan Lutker) Date: Fri, 10 Dec 2021 23:53:16 GMT Subject: Integrated: 8278525: Additional -Wnonnull errors happen with GCC 11 In-Reply-To: References: Message-ID: On Thu, 9 Dec 2021 23:48:42 GMT, Dan Lutker wrote: > 8271515 introduced more issues with xmm0 This pull request has now been integrated. Changeset: 6eb6ec05 Author: Dan Lutker Committer: Paul Hohensee URL: https://git.openjdk.java.net/jdk/commit/6eb6ec05fd4f80e11d0b052b58190bc8b53f4b11 Stats: 5 lines in 1 file changed: 5 ins; 0 del; 0 mod 8278525: Additional -Wnonnull errors happen with GCC 11 Reviewed-by: phh, dlong ------------- PR: https://git.openjdk.java.net/jdk/pull/6788 From duke at openjdk.java.net Sat Dec 11 02:57:49 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Sat, 11 Dec 2021 02:57:49 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v7] In-Reply-To: References: Message-ID: > Hi, > > This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. > > While working on this patch, I spot some regressions regarding compilation on AVX1. > > Thank you very much. Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: typos ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6724/files - new: https://git.openjdk.java.net/jdk/pull/6724/files/2af640c1..7acd2fa4 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6724&range=06 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6724&range=05-06 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/6724.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6724/head:pull/6724 PR: https://git.openjdk.java.net/jdk/pull/6724 From duke at openjdk.java.net Sat Dec 11 02:57:51 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Sat, 11 Dec 2021 02:57:51 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v6] In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 11:43:54 GMT, Mai ??ng Qu?n Anh wrote: >> Hi, >> >> This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. >> >> While working on this patch, I spot some regressions regarding compilation on AVX1. >> >> Thank you very much. > > Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: > > address reviews Thanks a lot for spotting those, they are fixed now. ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From duke at openjdk.java.net Sat Dec 11 16:36:17 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Sat, 11 Dec 2021 16:36:17 GMT Subject: RFR: 8278471: Reorder optimizations in addnode Ideal [v2] In-Reply-To: References: Message-ID: <65naI1tnl-0MOdNOKBZBNW6Xk3FYIhC0pfLRuQe9o80=.abdf2aeb-22b9-4623-bd57-d287fc00360d@github.com> On Fri, 10 Dec 2021 04:19:39 GMT, Zhiqiang Zang wrote: >> Reorder optimizations in addnode so special cases appear before general cases; otherwise the special cases would be never covered. >> >> `(a - b) + (c - d)` subsumes both `(a - b) + (b - c)` and `(a - b) + (c - a)`. Therefore `(a - b) + (b - c)` and `(a - b) + (c - a)` have to be placed before `(a - b) + (c - d)` so that they can work. > > Zhiqiang Zang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - Merge master. > - Merge master. > - reorder optimizations in addnode so special cases appear before general cases. We actually don't need those since `(x - c) + (c - y)` would be transformed into `(x + c) - (c + y)` which will be transformed into `x - y` anyway. So you can safely remove them completely. https://github.com/openjdk/jdk/blob/10a51fdb05abe7f7be76a9c442bd4564b24010de/src/hotspot/share/opto/subnode.cpp#L256 ------------- PR: https://git.openjdk.java.net/jdk/pull/6752 From gli at openjdk.java.net Sun Dec 12 04:30:42 2021 From: gli at openjdk.java.net (Guoxiong Li) Date: Sun, 12 Dec 2021 04:30:42 GMT Subject: RFR: 8278104: C1 should support the compiler directive 'BreakAtExecute' Message-ID: Hi all, Currently, the directive `BreakAtExecute` is not effective at C1. And the `CompileCommand=break` doesn't break the compiled method, too. This patch unifies the `BreakAtExecute` and `CompileCommand=break` to the directive 'BreakAtExecute' and uses the directive 'BreakAtExecute' to identify whether a breakpoint should be added. The test group `hotspot_compiler` passed locally.(linux x86_64 fastdebug) And the pre-submit tests passed before submitting the PR. Thanks for taking the time to review. Best Regards, -- Guoxiong ------------- Commit messages: - 8278104: C1 should support the compiler directive 'BreakAtExecute' Changes: https://git.openjdk.java.net/jdk/pull/6807/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6807&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278104 Stats: 26 lines in 8 files changed: 12 ins; 0 del; 14 mod Patch: https://git.openjdk.java.net/jdk/pull/6807.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6807/head:pull/6807 PR: https://git.openjdk.java.net/jdk/pull/6807 From jiefu at openjdk.java.net Sun Dec 12 06:00:29 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Sun, 12 Dec 2021 06:00:29 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" Message-ID: Hi all, I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. The patch also removes an useless statement [4]. Testing: - vector api tests on Linux/x64-{AVX512, AVX2} Thanks. Best regards, Jie [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 ------------- Commit messages: - 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" Changes: https://git.openjdk.java.net/jdk/pull/6808/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6808&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278584 Stats: 3 lines in 1 file changed: 0 ins; 1 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/6808.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6808/head:pull/6808 PR: https://git.openjdk.java.net/jdk/pull/6808 From kvn at openjdk.java.net Sun Dec 12 16:09:24 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sun, 12 Dec 2021 16:09:24 GMT Subject: [jdk18] RFR: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 00:17:36 GMT, Scott Gibbons wrote: > The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it will overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. > > I also added a test case to detect this overrun. All testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/4 From duke at openjdk.java.net Sun Dec 12 16:12:21 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Sun, 12 Dec 2021 16:12:21 GMT Subject: [jdk18] Integrated: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 00:17:36 GMT, Scott Gibbons wrote: > The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it will overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. > > I also added a test case to detect this overrun. This pull request has now been integrated. Changeset: 9a1bbaf8 Author: Scott Gibbons Committer: Vladimir Kozlov URL: https://git.openjdk.java.net/jdk18/commit/9a1bbaf8db0e869ab76be8ab1bd0ddeb23693e7e Stats: 12 lines in 2 files changed: 7 ins; 0 del; 5 mod 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 8272809: JFR thread sampler SI_KERNEL SEGV in metaspace::VirtualSpaceList::contains Reviewed-by: sviswanathan, kvn ------------- PR: https://git.openjdk.java.net/jdk18/pull/4 From kvn at openjdk.java.net Sun Dec 12 16:32:09 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sun, 12 Dec 2021 16:32:09 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" In-Reply-To: References: Message-ID: On Sun, 12 Dec 2021 05:51:31 GMT, Jie Fu wrote: > Hi all, > > I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. > Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. > > The patch also removes an useless statement [4]. > > Testing: > - vector api tests on Linux/x64-{AVX512, AVX2} > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 @jatin-bhateja should review it too. My main question why we did not catch this in #6646 pre-integration testing? The test failed only when we run it with `-Xcomp -XX:+DeoptimizeALot` flags running on AVX2 machines. which indicates that these instructions are not used during normal execution of this test or any tests in `jdk/incubator/vector` @DamonFool, how you trigger failure with the test? May be we should increase NUM_ITER (which is only 5000) to actually trigger C2 compilation with normal execution. @jatin-bhateja why there are no `jdk/incubator/vector` tests which use these instruction? And if there are why they are not triggered this failure during normal execution? ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6808 From kvn at openjdk.java.net Sun Dec 12 16:39:10 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sun, 12 Dec 2021 16:39:10 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" In-Reply-To: References: Message-ID: <2uT7IoUwKq0_CZ2DHARuljOllHXGLsGEw6qMCqZ5lLg=.db2a5d72-0874-4e20-9f61-466d96a4e041@github.com> On Sun, 12 Dec 2021 05:51:31 GMT, Jie Fu wrote: > Hi all, > > I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. > Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. > > The patch also removes an useless statement [4]. > > Testing: > - vector api tests on Linux/x64-{AVX512, AVX2} > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 I noticed that the test also uses `-XX:CompileThreshold=100` so `NUM_ITER = 5000` should be enough. So why? ------------- PR: https://git.openjdk.java.net/jdk/pull/6808 From duke at openjdk.java.net Sun Dec 12 17:59:09 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Sun, 12 Dec 2021 17:59:09 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" In-Reply-To: References: Message-ID: On Sun, 12 Dec 2021 05:51:31 GMT, Jie Fu wrote: > Hi all, > > I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. > Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. > > The patch also removes an useless statement [4]. > > Testing: > - vector api tests on Linux/x64-{AVX512, AVX2} > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 It seems that our `fromLong` tests create a mask from `long` values then immediately convert it back to `long` values using `toLong` to verify the results. Unfortunately, a `VectorMaskToLongNode` and a `VectorLongToMaskNode` together being optimized out leads to us not being able to verify the code emission. ------------- PR: https://git.openjdk.java.net/jdk/pull/6808 From duke at openjdk.java.net Sun Dec 12 19:12:16 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Sun, 12 Dec 2021 19:12:16 GMT Subject: RFR: 8278471: Reorder optimizations in addnode Ideal [v2] In-Reply-To: <65naI1tnl-0MOdNOKBZBNW6Xk3FYIhC0pfLRuQe9o80=.abdf2aeb-22b9-4623-bd57-d287fc00360d@github.com> References: <65naI1tnl-0MOdNOKBZBNW6Xk3FYIhC0pfLRuQe9o80=.abdf2aeb-22b9-4623-bd57-d287fc00360d@github.com> Message-ID: On Sat, 11 Dec 2021 16:32:54 GMT, Mai ??ng Qu?n Anh wrote: > We actually don't need those since `(x - c) + (c - y)` would be transformed into `(x + c) - (c + y)` which will be transformed into `x - y` anyway. So you can safely remove them completely. > > https://github.com/openjdk/jdk/blob/10a51fdb05abe7f7be76a9c442bd4564b24010de/src/hotspot/share/opto/subnode.cpp#L256 Thank you for pointing this out! I have two questions: 1. Should I just go ahead and delete these two, (a - b) + (b - c) and (a - b) + (c - a), in this pr? 2. I do see you are familiar with the concatenation of different optimizations. I was wondering where I am able to find more information about the order the functions are called, for example Ideal(), Value(), Identity(), or even better the entire workflow of optimization. Thank you very much! ------------- PR: https://git.openjdk.java.net/jdk/pull/6752 From kvn at openjdk.java.net Sun Dec 12 23:38:12 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sun, 12 Dec 2021 23:38:12 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" In-Reply-To: References: Message-ID: On Sun, 12 Dec 2021 05:51:31 GMT, Jie Fu wrote: > Hi all, > > I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. > Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. > > The patch also removes an useless statement [4]. > > Testing: > - vector api tests on Linux/x64-{AVX512, AVX2} > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 To clarify, I want this fix update tests so that the case like this is triggered during normal execution (no '-Xcomp` required). ------------- PR: https://git.openjdk.java.net/jdk/pull/6808 From jiefu at openjdk.java.net Mon Dec 13 04:35:47 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 13 Dec 2021 04:35:47 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" [v2] In-Reply-To: References: Message-ID: > Hi all, > > I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. > Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. > > The patch also removes an useless statement [4]. > > Testing: > - vector api tests on Linux/x64-{AVX512, AVX2} > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 Jie Fu has updated the pull request incrementally with one additional commit since the last revision: Add a jtreg test ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6808/files - new: https://git.openjdk.java.net/jdk/pull/6808/files/ecb8e76a..e3cabbac Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6808&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6808&range=00-01 Stats: 216 lines in 1 file changed: 216 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6808.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6808/head:pull/6808 PR: https://git.openjdk.java.net/jdk/pull/6808 From jiefu at openjdk.java.net Mon Dec 13 04:48:11 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 13 Dec 2021 04:48:11 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" In-Reply-To: References: Message-ID: On Sun, 12 Dec 2021 23:35:30 GMT, Vladimir Kozlov wrote: > To clarify, I want this fix update tests so that the case like this is triggered during normal execution (no '-Xcomp` required). Done. Thanks @vnkozlov and @merykitty for your review. ------------- PR: https://git.openjdk.java.net/jdk/pull/6808 From duke at openjdk.java.net Mon Dec 13 05:33:11 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Mon, 13 Dec 2021 05:33:11 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" [v2] In-Reply-To: References: Message-ID: <-EQ-LZQLMk1xz5rRQEPj7c3H4TJHvE0dQjWjrgHG-Qk=.7193b8f3-6c69-4def-be2b-1d09313e3ed7@github.com> On Mon, 13 Dec 2021 04:35:47 GMT, Jie Fu wrote: >> Hi all, >> >> I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. >> Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. >> >> The patch also removes an useless statement [4]. >> >> Testing: >> - vector api tests on Linux/x64-{AVX512, AVX2} >> >> Thanks. >> Best regards, >> Jie >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Add a jtreg test Hi, we currently don't have any tests to verify the correctness of the generated code. Also, you could modify the current tests in `jdk/incubator/vector/*VectorTests.java` and `compiler/vectorapi/VectorMaskLoadStoreTest.java` since they are effectively don't do anything regarding compiled codes under normal execution. ------------- PR: https://git.openjdk.java.net/jdk/pull/6808 From jiefu at openjdk.java.net Mon Dec 13 06:27:09 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 13 Dec 2021 06:27:09 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" [v2] In-Reply-To: <-EQ-LZQLMk1xz5rRQEPj7c3H4TJHvE0dQjWjrgHG-Qk=.7193b8f3-6c69-4def-be2b-1d09313e3ed7@github.com> References: <-EQ-LZQLMk1xz5rRQEPj7c3H4TJHvE0dQjWjrgHG-Qk=.7193b8f3-6c69-4def-be2b-1d09313e3ed7@github.com> Message-ID: On Mon, 13 Dec 2021 05:29:47 GMT, Mai ??ng Qu?n Anh wrote: > Hi, we currently don't have any tests to verify the correctness of the generated code. Hi @merykitty , the problem is that if I verify the results, the bug won't be triggered without `-Xcomp`. So do you have any idea to reproduce this bug while keeping the verification? >Also, you could modify the current tests in `jdk/incubator/vector/*VectorTests.java` and `compiler/vectorapi/VectorMaskLoadStoreTest.java` since they are effectively don't do anything regarding compiled codes under normal execution. This test is actually made from `compiler/vectorapi/VectorMaskLoadStoreTest.java` by removing the results checking logic. ------------- PR: https://git.openjdk.java.net/jdk/pull/6808 From jiefu at openjdk.java.net Mon Dec 13 06:52:42 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 13 Dec 2021 06:52:42 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" [v3] In-Reply-To: References: Message-ID: > Hi all, > > I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. > Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. > > The patch also removes an useless statement [4]. > > Testing: > - vector api tests on Linux/x64-{AVX512, AVX2} > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 Jie Fu has updated the pull request incrementally with one additional commit since the last revision: Address review comments ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6808/files - new: https://git.openjdk.java.net/jdk/pull/6808/files/e3cabbac..11e2b2ca Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6808&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6808&range=01-02 Stats: 244 lines in 2 files changed: 1 ins; 217 del; 26 mod Patch: https://git.openjdk.java.net/jdk/pull/6808.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6808/head:pull/6808 PR: https://git.openjdk.java.net/jdk/pull/6808 From jiefu at openjdk.java.net Mon Dec 13 07:01:09 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 13 Dec 2021 07:01:09 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" [v2] In-Reply-To: <-EQ-LZQLMk1xz5rRQEPj7c3H4TJHvE0dQjWjrgHG-Qk=.7193b8f3-6c69-4def-be2b-1d09313e3ed7@github.com> References: <-EQ-LZQLMk1xz5rRQEPj7c3H4TJHvE0dQjWjrgHG-Qk=.7193b8f3-6c69-4def-be2b-1d09313e3ed7@github.com> Message-ID: On Mon, 13 Dec 2021 05:29:47 GMT, Mai ??ng Qu?n Anh wrote: > Hi, we currently don't have any tests to verify the correctness of the generated code. Also, you could modify the current tests in `jdk/incubator/vector/*VectorTests.java` and `compiler/vectorapi/VectorMaskLoadStoreTest.java` since they are effectively don't do anything regarding compiled codes under normal execution. Hi @merykitty After more experiments, I finally found a way to reproduce this bug without `-Xcomp` and keeping the verification by modifying `compiler/vectorapi/VectorMaskLoadStoreTest.java`. Please review it. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6808 From fgao at openjdk.java.net Mon Dec 13 08:11:14 2021 From: fgao at openjdk.java.net (Fei Gao) Date: Mon, 13 Dec 2021 08:11:14 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v4] In-Reply-To: <5tL6bmguT-7Wcm_WpWXUhxIQhjtGp9UqZvgH1jD6FbU=.9cc0d6ce-d322-409b-8ce4-066a18997f43@github.com> References: <5tL6bmguT-7Wcm_WpWXUhxIQhjtGp9UqZvgH1jD6FbU=.9cc0d6ce-d322-409b-8ce4-066a18997f43@github.com> Message-ID: On Thu, 9 Dec 2021 08:00:25 GMT, Fei Gao wrote: >> The patch aims to help optimize Math.abs() mainly from these three parts: >> 1) Remove redundant instructions for abs with constant values >> 2) Remove redundant instructions for abs with char type >> 3) Convert some common abs operations to ideal forms >> >> 1. Remove redundant instructions for abs with constant values >> >> If we can decide the value of the input node for function Math.abs() >> at compile-time, we can substitute the Abs node with the absolute >> value of the constant and don't have to calculate it at runtime. >> >> For example, >> int[] a >> for (int i = 0; i < SIZE; i++) { >> a[i] = Math.abs(-38); >> } >> >> Before the patch, the generated code for the testcase above is: >> ... >> mov w10, #0xffffffda >> cmp w10, wzr >> cneg w17, w10, lt >> dup v16.8h, w17 >> ... >> After the patch, the generated code for the testcase above is : >> ... >> movi v16.4s, #0x26 >> ... >> >> 2. Remove redundant instructions for abs with char type >> >> In Java semantics, as the char type is always non-negative, we >> could actually remove the absI node in the C2 middle end. >> >> As for vectorization part, in current SLP, the vectorization of >> Math.abs() with char type is intentionally disabled after >> JDK-8261022 because it generates incorrect result before. After >> removing the AbsI node in the middle end, Math.abs(char) can be >> vectorized naturally. >> >> For example, >> >> char[] a; >> char[] b; >> for (int i = 0; i < SIZE; i++) { >> b[i] = (char) Math.abs(a[i]); >> } >> >> Before the patch, the generated assembly code for the testcase >> above is: >> >> B15: >> add x13, x21, w20, sxtw #1 >> ldrh w11, [x13, #16] >> cmp w11, wzr >> cneg w10, w11, lt >> strh w10, [x13, #16] >> ldrh w10, [x13, #18] >> cmp w10, wzr >> cneg w10, w10, lt >> strh w10, [x13, #18] >> ... >> add w20, w20, #0x1 >> cmp w20, w17 >> b.lt B15 >> >> After the patch, the generated assembly code is: >> B15: >> sbfiz x18, x19, #1, #32 >> add x0, x14, x18 >> ldr q16, [x0, #16] >> add x18, x21, x18 >> str q16, [x18, #16] >> ldr q16, [x0, #32] >> str q16, [x18, #32] >> ... >> add w19, w19, #0x40 >> cmp w19, w17 >> b.lt B15 >> >> 3. Convert some common abs operations to ideal forms >> >> The patch overrides some virtual support functions for AbsNode >> so that optimization of gvn can work on it. Here are the optimizable >> forms: >> >> a) abs(0 - x) => abs(x) >> >> Before the patch: >> ... >> ldr w13, [x13, #16] >> neg w13, w13 >> cmp w13, wzr >> cneg w14, w13, lt >> ... >> After the patch: >> ... >> ldr w13, [x13, #16] >> cmp w13, wzr >> cneg w13, w13, lt >> ... >> >> b) abs(abs(x)) => abs(x) >> >> Before the patch: >> ... >> ldr w12, [x12, #16] >> cmp w12, wzr >> cneg w12, w12, lt >> cmp w12, wzr >> cneg w12, w12, lt >> ... >> After the patch: >> ... >> ldr w13, [x13, #16] >> cmp w13, wzr >> cneg w13, w13, lt >> ... > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - Merge branch 'master' into fg8276673 > > Change-Id: I5e3898054b75f49653b8c3b37e4f5007675fa963 > - 8276673: Optimize abs operations in C2 compiler > > The patch aims to help optimize Math.abs() mainly from these three parts: > 1) Remove redundant instructions for abs with constant values > 2) Remove redundant instructions for abs with char type > 3) Convert some common abs operations to ideal forms > > 1. Remove redundant instructions for abs with constant values > > If we can decide the value of the input node for function Math.abs() > at compile-time, we can substitute the Abs node with the absolute > value of the constant and don't have to calculate it at runtime. > > For example, > int[] a > for (int i = 0; i < SIZE; i++) { > a[i] = Math.abs(-38); > } > > Before the patch, the generated code for the testcase above is: > ... > mov w10, #0xffffffda > cmp w10, wzr > cneg w17, w10, lt > dup v16.8h, w17 > ... > After the patch, the generated code for the testcase above is : > ... > movi v16.4s, #0x26 > ... > > 2. Remove redundant instructions for abs with char type > > In Java semantics, as the char type is always non-negative, we > could actually remove the absI node in the C2 middle end. > > As for vectorization part, in current SLP, the vectorization of > Math.abs() with char type is intentionally disabled after > JDK-8261022 because it generates incorrect result before. After > removing the AbsI node in the middle end, Math.abs(char) can be > vectorized naturally. > > For example, > > char[] a; > char[] b; > for (int i = 0; i < SIZE; i++) { > b[i] = (char) Math.abs(a[i]); > } > > Before the patch, the generated assembly code for the testcase > above is: > > B15: > add x13, x21, w20, sxtw #1 > ldrh w11, [x13, #16] > cmp w11, wzr > cneg w10, w11, lt > strh w10, [x13, #16] > ldrh w10, [x13, #18] > cmp w10, wzr > cneg w10, w10, lt > strh w10, [x13, #18] > ... > add w20, w20, #0x1 > cmp w20, w17 > b.lt B15 > > After the patch, the generated assembly code is: > B15: > sbfiz x18, x19, #1, #32 > add x0, x14, x18 > ldr q16, [x0, #16] > add x18, x21, x18 > str q16, [x18, #16] > ldr q16, [x0, #32] > str q16, [x18, #32] > ... > add w19, w19, #0x40 > cmp w19, w17 > b.lt B15 > > 3. Convert some common abs operations to ideal forms > > The patch overrides some virtual support functions for AbsNode > so that optimization of gvn can work on it. Here are the optimizable > forms: > > a) abs(0 - x) => abs(x) > > Before the patch: > ... > ldr w13, [x13, #16] > neg w13, w13 > cmp w13, wzr > cneg w14, w13, lt > ... > After the patch: > ... > ldr w13, [x13, #16] > cmp w13, wzr > cneg w13, w13, lt > ... > > b) abs(abs(x)) => abs(x) > > Before the patch: > ... > ldr w12, [x12, #16] > cmp w12, wzr > cneg w12, w12, lt > cmp w12, wzr > cneg w12, w12, lt > ... > After the patch: > ... > ldr w13, [x13, #16] > cmp w13, wzr > cneg w13, w13, lt > ... > > Change-Id: I5434c01a225796caaf07ffbb19983f4fe2e206bd The PR optimizes abs operations in the C2 middle end. Can I have your review please? ------------- PR: https://git.openjdk.java.net/jdk/pull/6755 From jiefu at openjdk.java.net Mon Dec 13 08:29:18 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 13 Dec 2021 08:29:18 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v4] In-Reply-To: References: <5tL6bmguT-7Wcm_WpWXUhxIQhjtGp9UqZvgH1jD6FbU=.9cc0d6ce-d322-409b-8ce4-066a18997f43@github.com> Message-ID: On Mon, 13 Dec 2021 08:08:28 GMT, Fei Gao wrote: > The PR optimizes abs operations in the C2 middle end. Can I have your review please? So what's the performance data before and after this patch? Does it also benefit on x86? It would be better to provide a jmh micro benchmark. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6755 From thartmann at openjdk.java.net Mon Dec 13 10:20:15 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 13 Dec 2021 10:20:15 GMT Subject: RFR: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 In-Reply-To: <9vnRdXesAbnQtJ2n2zxs1o8lmhDNnGUC2FCziDqa_E0=.cacf7679-c8f5-4c7f-8a36-26600f76a219@github.com> References: <9vnRdXesAbnQtJ2n2zxs1o8lmhDNnGUC2FCziDqa_E0=.cacf7679-c8f5-4c7f-8a36-26600f76a219@github.com> Message-ID: On Thu, 9 Dec 2021 22:43:28 GMT, Scott Gibbons wrote: > The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it *will* overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. > > I also added a test case to detect this overrun. As Vladimir mentioned, the fix will be forward ported to JDK 19 automatically. This PR should be closed without integration. ------------- PR: https://git.openjdk.java.net/jdk/pull/6786 From duke at openjdk.java.net Mon Dec 13 10:28:14 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Mon, 13 Dec 2021 10:28:14 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" [v3] In-Reply-To: References: Message-ID: On Mon, 13 Dec 2021 06:52:42 GMT, Jie Fu wrote: >> Hi all, >> >> I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. >> Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. >> >> The patch also removes an useless statement [4]. >> >> Testing: >> - vector api tests on Linux/x64-{AVX512, AVX2} >> >> Thanks. >> Best regards, >> Jie >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments Hi, unfortunately, the tests still don't verify the correctness of the compiled codes, what happened here is that the `IfNode` always goes in one direction make the other path an uncommon trap, the uncommon trap captures all nodes before it including the `VectorLongToMaskNode`, thus throw `ShouldNotReachHere` when trying to emit instructions. The result of the `VectorLongToMaskNode`, however, is only used to feed into the uncommon trap and not used in verification at all. I propose we just store the mask into a boolean array using `toArray()` and compare the result in a for loop. 0x00007f3e254efa40: mov %eax,-0x16000(%rsp) 0x00007f3e254efa47: push %rbp 0x00007f3e254efa48: sub $0x30,%rsp ;*synchronization entry ; - org.openjdk.bench.vm.compiler.Sample::testByte64 at -1 (line 67) 0x00007f3e254efa4c: movzbl 0x18(%rsi),%r10d ;*getfield checked {reexecute=0 rethrow=0 return_oop=0} ; - org.openjdk.bench.vm.compiler.Sample::testByte64 at 14 (line 69) 0x00007f3e254efa51: test %r10d,%r10d 0x00007f3e254efa54: je 0x00007f3e254efa69 // If (checked != 0), we return immediately ;; B2: # out( N45 ) <- in( B1 ) Freq: 1 0x00007f3e254efa56: add $0x30,%rsp 0x00007f3e254efa5a: pop %rbp 0x00007f3e254efa5b: cmp 0x388(%r15),%rsp ; {poll_return} 0x00007f3e254efa62: ja 0x00007f3e254efadc 0x00007f3e254efa68: ret ;; B3: # out( N45 ) <- in( B1 ) Freq: 4.76837e-07 0x00007f3e254efa69: mov $0xffff,%r11d 0x00007f3e254efa6f: and 0x10(%rsi),%r11 0x00007f3e254efa73: movabs $0x101010101010101,%r8 0x00007f3e254efa7d: pdep %r8,%r11,%r8 0x00007f3e254efa82: mov %r11,%r9 0x00007f3e254efa85: vpxor %xmm1,%xmm1,%xmm1 0x00007f3e254efa89: vmovq %r8,%xmm1 0x00007f3e254efa8e: vmovq %r8,%xmm0 0x00007f3e254efa93: movabs $0x101010101010101,%r8 0x00007f3e254efa9d: shr $0x8,%r9 0x00007f3e254efaa1: pdep %r8,%r9,%r8 0x00007f3e254efaa6: vpinsrq $0x1,%r8,%xmm1,%xmm1 0x00007f3e254efaac: vmovdqu %ymm1,%ymm0 0x00007f3e254efab0: mov %rsi,%rbp 0x00007f3e254efab3: mov %r10d,(%rsp) 0x00007f3e254efab7: vmovdqu %xmm0,0x10(%rsp) ;*invokestatic maskReductionCoerced {reexecute=0 rethrow=0 return_oop=0} ; - jdk.incubator.vector.Byte128Vector$Byte128Mask::toLong at 35 (line 735) ; - org.openjdk.bench.vm.compiler.Sample::testByte64 at 21 (line 70) 0x00007f3e254efabd: mov $0xffffff45,%esi 0x00007f3e254efac2: nop 0x00007f3e254efac3: call 0x00007f3e254bbf20 ; ImmutableOopMap {rbp=Oop } ;*ifeq {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) org.openjdk.bench.vm.compiler.Sample::testByte64 at 17 (line 69) ; {runtime_call UncommonTrapBlob} ------------- PR: https://git.openjdk.java.net/jdk/pull/6808 From duke at openjdk.java.net Mon Dec 13 10:56:09 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Mon, 13 Dec 2021 10:56:09 GMT Subject: RFR: 8278471: Reorder optimizations in addnode Ideal [v2] In-Reply-To: References: Message-ID: <3E6-DIcs8gGaF-Odpoa3r3NdeT_GQui3hISZLgS2B8k=.53ec14bb-99ce-4632-b253-f0c79972e1f7@github.com> On Fri, 10 Dec 2021 04:19:39 GMT, Zhiqiang Zang wrote: >> Reorder optimizations in addnode so special cases appear before general cases; otherwise the special cases would be never covered. >> >> `(a - b) + (c - d)` subsumes both `(a - b) + (b - c)` and `(a - b) + (c - a)`. Therefore `(a - b) + (b - c)` and `(a - b) + (c - a)` have to be placed before `(a - b) + (c - d)` so that they can work. > > Zhiqiang Zang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - Merge master. > - Merge master. > - reorder optimizations in addnode so special cases appear before general cases. Hi, I think just deleting those 2 should be okay. Regarding idealisation, if you look into `PhaseGVN::transform`, it is clear that `Node::Ideal` is called repeatedly until there is nothing change anymore. Then `Node::Value` is called to do constant propagation before `Node::Identity` is called to get rid of redundant operations. Finally, value numbering is the transformation which deduplicates all equivalent `Node`s into a single one. https://github.com/openjdk/jdk/blob/ccdb9f1b160a0f49ee86c7a2714d2381d68419cc/src/hotspot/share/opto/phaseX.cpp#L828 ------------- PR: https://git.openjdk.java.net/jdk/pull/6752 From jiefu at openjdk.java.net Mon Dec 13 11:40:34 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 13 Dec 2021 11:40:34 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" [v3] In-Reply-To: References: Message-ID: On Mon, 13 Dec 2021 10:25:11 GMT, Mai ??ng Qu?n Anh wrote: > Hi, unfortunately, the tests still don't verify the correctness of the compiled codes, what happened here is that the `IfNode` always goes in one direction make the other path an uncommon trap, the uncommon trap captures all nodes before it including the `VectorLongToMaskNode`, thus throw `ShouldNotReachHere` when trying to emit instructions. The result of the `VectorLongToMaskNode`, however, is only used to feed into the uncommon trap and not used in verification at all. Nice catch! @merykitty What do you think of the updated version? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6808 From jiefu at openjdk.java.net Mon Dec 13 11:40:33 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 13 Dec 2021 11:40:33 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" [v4] In-Reply-To: References: Message-ID: > Hi all, > > I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. > Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. > > The patch also removes an useless statement [4]. > > Testing: > - vector api tests on Linux/x64-{AVX512, AVX2} > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 Jie Fu has updated the pull request incrementally with one additional commit since the last revision: Address review comments ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6808/files - new: https://git.openjdk.java.net/jdk/pull/6808/files/11e2b2ca..579363d8 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6808&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6808&range=02-03 Stats: 38 lines in 1 file changed: 11 ins; 1 del; 26 mod Patch: https://git.openjdk.java.net/jdk/pull/6808.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6808/head:pull/6808 PR: https://git.openjdk.java.net/jdk/pull/6808 From duke at openjdk.java.net Mon Dec 13 12:18:10 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Mon, 13 Dec 2021 12:18:10 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" [v4] In-Reply-To: References: Message-ID: <8C385-hTENLFdyRNdRhYXaB16NKiM_41lqxxlbhFihA=.c2997b85-0ce4-47c4-8c43-5c5d6dd80ca5@github.com> On Mon, 13 Dec 2021 11:40:33 GMT, Jie Fu wrote: >> Hi all, >> >> I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. >> Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. >> >> The patch also removes an useless statement [4]. >> >> Testing: >> - vector api tests on Linux/x64-{AVX512, AVX2} >> >> Thanks. >> Best regards, >> Jie >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments Elegant, indeed! ------------- Marked as reviewed by merykitty at github.com (no known OpenJDK username). PR: https://git.openjdk.java.net/jdk/pull/6808 From kvn at openjdk.java.net Mon Dec 13 16:26:15 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 13 Dec 2021 16:26:15 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" [v4] In-Reply-To: References: Message-ID: <37tGeccVdLbmk7rZIi5OVfO9qyUciHJ6KjFFFrjAVt8=.356e013e-ec8c-4771-b296-17ee3612743d@github.com> On Mon, 13 Dec 2021 11:40:33 GMT, Jie Fu wrote: >> Hi all, >> >> I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. >> Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. >> >> The patch also removes an useless statement [4]. >> >> Testing: >> - vector api tests on Linux/x64-{AVX512, AVX2} >> >> Thanks. >> Best regards, >> Jie >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments Test's change is clever! Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6808 From psandoz at openjdk.java.net Mon Dec 13 16:35:10 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Mon, 13 Dec 2021 16:35:10 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v7] In-Reply-To: References: Message-ID: On Sat, 11 Dec 2021 02:57:49 GMT, Mai ??ng Qu?n Anh wrote: >> Hi, >> >> This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. >> >> While working on this patch, I spot some regressions regarding compilation on AVX1. >> >> Thank you very much. > > Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: > > typos Marked as reviewed by psandoz (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From duke at openjdk.java.net Mon Dec 13 16:39:16 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Mon, 13 Dec 2021 16:39:16 GMT Subject: Integrated: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" In-Reply-To: References: Message-ID: <3lJG49yeT-NAa_vNKYlH4keEZTzQ4MtwoKt8UF4TQ7s=.03866c83-67af-4aff-81b0-41114275fbc0@github.com> On Mon, 6 Dec 2021 14:21:24 GMT, Mai ??ng Qu?n Anh wrote: > Hi, > > This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. > > While working on this patch, I spot some regressions regarding compilation on AVX1. > > Thank you very much. This pull request has now been integrated. Changeset: ca8c58c7 Author: merykitty Committer: Paul Sandoz URL: https://git.openjdk.java.net/jdk/commit/ca8c58c731959e3a1b8fe02255ed44fc1d14d565 Stats: 4127 lines in 16 files changed: 4127 ins; 0 del; 0 mod 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" Reviewed-by: psandoz, chagedorn ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From psandoz at openjdk.java.net Mon Dec 13 16:46:14 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Mon, 13 Dec 2021 16:46:14 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" [v3] In-Reply-To: References: Message-ID: <6eGWyIMoLbabvF3_AneUpaf3PhiiWFOHoI_pMfP2hKA=.bcd12773-4825-4df1-b39a-307c78357b2b@github.com> On Mon, 13 Dec 2021 11:36:46 GMT, Jie Fu wrote: >> Hi, unfortunately, the tests still don't verify the correctness of the compiled codes, what happened here is that the `IfNode` always goes in one direction make the other path an uncommon trap, the uncommon trap captures all nodes before it including the `VectorLongToMaskNode`, thus throw `ShouldNotReachHere` when trying to emit instructions. The result of the `VectorLongToMaskNode`, however, is only used to feed into the uncommon trap and not used in verification at all. >> >> I propose we just store the mask into a boolean array using `toArray()` and compare the result in a for loop. >> >> 0x00007f3e254efa40: mov %eax,-0x16000(%rsp) >> 0x00007f3e254efa47: push %rbp >> 0x00007f3e254efa48: sub $0x30,%rsp ;*synchronization entry >> ; - org.openjdk.bench.vm.compiler.Sample::testByte64 at -1 (line 67) >> 0x00007f3e254efa4c: movzbl 0x18(%rsi),%r10d ;*getfield checked {reexecute=0 rethrow=0 return_oop=0} >> ; - org.openjdk.bench.vm.compiler.Sample::testByte64 at 14 (line 69) >> 0x00007f3e254efa51: test %r10d,%r10d >> 0x00007f3e254efa54: je 0x00007f3e254efa69 // If (checked != 0), we return immediately >> ;; B2: # out( N45 ) <- in( B1 ) Freq: 1 >> 0x00007f3e254efa56: add $0x30,%rsp >> 0x00007f3e254efa5a: pop %rbp >> 0x00007f3e254efa5b: cmp 0x388(%r15),%rsp ; {poll_return} >> 0x00007f3e254efa62: ja 0x00007f3e254efadc >> 0x00007f3e254efa68: ret >> ;; B3: # out( N45 ) <- in( B1 ) Freq: 4.76837e-07 >> 0x00007f3e254efa69: mov $0xffff,%r11d >> 0x00007f3e254efa6f: and 0x10(%rsi),%r11 >> 0x00007f3e254efa73: movabs $0x101010101010101,%r8 >> 0x00007f3e254efa7d: pdep %r8,%r11,%r8 >> 0x00007f3e254efa82: mov %r11,%r9 >> 0x00007f3e254efa85: vpxor %xmm1,%xmm1,%xmm1 >> 0x00007f3e254efa89: vmovq %r8,%xmm1 >> 0x00007f3e254efa8e: vmovq %r8,%xmm0 >> 0x00007f3e254efa93: movabs $0x101010101010101,%r8 >> 0x00007f3e254efa9d: shr $0x8,%r9 >> 0x00007f3e254efaa1: pdep %r8,%r9,%r8 >> 0x00007f3e254efaa6: vpinsrq $0x1,%r8,%xmm1,%xmm1 >> 0x00007f3e254efaac: vmovdqu %ymm1,%ymm0 >> 0x00007f3e254efab0: mov %rsi,%rbp >> 0x00007f3e254efab3: mov %r10d,(%rsp) >> 0x00007f3e254efab7: vmovdqu %xmm0,0x10(%rsp) ;*invokestatic maskReductionCoerced {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.Byte128Vector$Byte128Mask::toLong at 35 (line 735) >> ; - org.openjdk.bench.vm.compiler.Sample::testByte64 at 21 (line 70) >> 0x00007f3e254efabd: mov $0xffffff45,%esi >> 0x00007f3e254efac2: nop >> 0x00007f3e254efac3: call 0x00007f3e254bbf20 ; ImmutableOopMap {rbp=Oop } >> ;*ifeq {reexecute=1 rethrow=0 return_oop=0} >> ; - (reexecute) org.openjdk.bench.vm.compiler.Sample::testByte64 at 17 (line 69) >> ; {runtime_call UncommonTrapBlob} > >> Hi, unfortunately, the tests still don't verify the correctness of the compiled codes, what happened here is that the `IfNode` always goes in one direction make the other path an uncommon trap, the uncommon trap captures all nodes before it including the `VectorLongToMaskNode`, thus throw `ShouldNotReachHere` when trying to emit instructions. The result of the `VectorLongToMaskNode`, however, is only used to feed into the uncommon trap and not used in verification at all. > > Nice catch! @merykitty > What do you think of the updated version? > Thanks. @DamonFool neat test fix, could you add a brief comment as to why disabling `_VectorMaskOp` is required? ------------- PR: https://git.openjdk.java.net/jdk/pull/6808 From duke at openjdk.java.net Mon Dec 13 17:39:14 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Mon, 13 Dec 2021 17:39:14 GMT Subject: RFR: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 In-Reply-To: References: <9vnRdXesAbnQtJ2n2zxs1o8lmhDNnGUC2FCziDqa_E0=.cacf7679-c8f5-4c7f-8a36-26600f76a219@github.com> Message-ID: <3JLESi4g23s3QwQkcfPgPs7WPWcePKI1keGA4AMtqJA=.323f4848-f525-488d-b37d-52442385b6d2@github.com> On Mon, 13 Dec 2021 10:17:07 GMT, Tobias Hartmann wrote: >> The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it *will* overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. >> >> I also added a test case to detect this overrun. > > As Vladimir mentioned, the fix will be forward ported to JDK 19 automatically. This PR should be closed without integration. Thank you, @TobiHartmann. Closing this PR now. ------------- PR: https://git.openjdk.java.net/jdk/pull/6786 From duke at openjdk.java.net Mon Dec 13 17:39:14 2021 From: duke at openjdk.java.net (Scott Gibbons) Date: Mon, 13 Dec 2021 17:39:14 GMT Subject: Withdrawn: 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 In-Reply-To: <9vnRdXesAbnQtJ2n2zxs1o8lmhDNnGUC2FCziDqa_E0=.cacf7679-c8f5-4c7f-8a36-26600f76a219@github.com> References: <9vnRdXesAbnQtJ2n2zxs1o8lmhDNnGUC2FCziDqa_E0=.cacf7679-c8f5-4c7f-8a36-26600f76a219@github.com> Message-ID: On Thu, 9 Dec 2021 22:43:28 GMT, Scott Gibbons wrote: > The base64 decoder overwrites memory past the end of its output buffer in certain cases. It will not overwrite if the encoded string length is < 64 bytes. It also will not overwrite if the encoded string length mod 64 is >= 16. So the case where it *will* overwrite is when the input string length (the encoded byte length) mod 64 is less than 16. > > I also added a test case to detect this overrun. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/6786 From duke at openjdk.java.net Mon Dec 13 18:24:42 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Mon, 13 Dec 2021 18:24:42 GMT Subject: RFR: 8278471: Reorder optimizations in addnode Ideal [v3] In-Reply-To: References: Message-ID: > Reorder optimizations in addnode so special cases appear before general cases; otherwise the special cases would be never covered. > > `(a - b) + (c - d)` subsumes both `(a - b) + (b - c)` and `(a - b) + (c - a)`. Therefore `(a - b) + (b - c)` and `(a - b) + (c - a)` have to be placed before `(a - b) + (c - d)` so that they can work. Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: remove unnecessary optimizations. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6752/files - new: https://git.openjdk.java.net/jdk/pull/6752/files/10a51fdb..d4d2b180 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6752&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6752&range=01-02 Stats: 10 lines in 1 file changed: 0 ins; 10 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6752.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6752/head:pull/6752 PR: https://git.openjdk.java.net/jdk/pull/6752 From kvn at openjdk.java.net Mon Dec 13 18:39:37 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 13 Dec 2021 18:39:37 GMT Subject: RFR: 8276455: C2: iterative EA Message-ID: <6JWGnSJyY6M2F0BG1_WCKTAU08sJQ7fFePu8KS7lIjE=.67c1c3e9-b4f8-4ffd-823e-d27f55c434a3@github.com> Resolve C2 issue with nested initialization when Escape Analysis can not scalarize allocations: `new A(new B( new C)))` Implemented iterative EA when C2 invokes it again if there are progress and candidates. I added JMH microbenchmark with cases which show improvements. Improvements are due to removed allocation code. Before: Benchmark Mode Cnt Score Error Units IterativeEA.test1 avgt 5 11.489 ? 3.037 ns/op IterativeEA.test2 avgt 5 16.103 ? 3.686 ns/op IterativeEA.test3 avgt 5 1988.827 ? 217.831 ns/op With these changes: IterativeEA.test1 avgt 5 2.182 ? 0.022 ns/op IterativeEA.test2 avgt 5 2.375 ? 0.001 ns/op IterativeEA.test3 avgt 5 821.011 ? 8.268 ns/op An other JMH test: PointerBenchmarkFlat.test avgt 30 23.232 ? 6.507 ms/op vs PointerBenchmarkFlat.test avgt 30 0.299 ? 0.001 ms/op ------------- Commit messages: - Remove trailing spaces - Added IR framework test and new JMH test - Added JMH test - Merge branch 'master' into JDK-8276455 - missed change - Merge branch 'master' into JDK-8276455 - 8276455: C2: iterative EA Changes: https://git.openjdk.java.net/jdk/pull/6222/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6222&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8276455 Stats: 571 lines in 10 files changed: 541 ins; 14 del; 16 mod Patch: https://git.openjdk.java.net/jdk/pull/6222.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6222/head:pull/6222 PR: https://git.openjdk.java.net/jdk/pull/6222 From iveresov at openjdk.java.net Mon Dec 13 18:39:37 2021 From: iveresov at openjdk.java.net (Igor Veresov) Date: Mon, 13 Dec 2021 18:39:37 GMT Subject: RFR: 8276455: C2: iterative EA In-Reply-To: <6JWGnSJyY6M2F0BG1_WCKTAU08sJQ7fFePu8KS7lIjE=.67c1c3e9-b4f8-4ffd-823e-d27f55c434a3@github.com> References: <6JWGnSJyY6M2F0BG1_WCKTAU08sJQ7fFePu8KS7lIjE=.67c1c3e9-b4f8-4ffd-823e-d27f55c434a3@github.com> Message-ID: <17H-Dr3GevMTqymlVFD3xMieM8YcQWNkhtu95i3rxmg=.20e43a12-0469-4e59-ab8e-d80001043dff@github.com> On Wed, 3 Nov 2021 01:44:49 GMT, Vladimir Kozlov wrote: > Resolve C2 issue with nested initialization when Escape Analysis can not scalarize allocations: `new A(new B( new C)))` > Implemented iterative EA when C2 invokes it again if there are progress and candidates. > > I added JMH microbenchmark with cases which show improvements. Improvements are due to removed allocation code. > > Before: > > Benchmark Mode Cnt Score Error Units > IterativeEA.test1 avgt 5 11.489 ? 3.037 ns/op > IterativeEA.test2 avgt 5 16.103 ? 3.686 ns/op > IterativeEA.test3 avgt 5 1988.827 ? 217.831 ns/op > > With these changes: > > IterativeEA.test1 avgt 5 2.182 ? 0.022 ns/op > IterativeEA.test2 avgt 5 2.375 ? 0.001 ns/op > IterativeEA.test3 avgt 5 821.011 ? 8.268 ns/op > > > An other JMH test: > > PointerBenchmarkFlat.test avgt 30 23.232 ? 6.507 ms/op > > vs > > PointerBenchmarkFlat.test avgt 30 0.299 ? 0.001 ms/op Looks reasonable. ------------- Marked as reviewed by iveresov (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6222 From neliasso at openjdk.java.net Mon Dec 13 18:39:38 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Mon, 13 Dec 2021 18:39:38 GMT Subject: RFR: 8276455: C2: iterative EA In-Reply-To: <6JWGnSJyY6M2F0BG1_WCKTAU08sJQ7fFePu8KS7lIjE=.67c1c3e9-b4f8-4ffd-823e-d27f55c434a3@github.com> References: <6JWGnSJyY6M2F0BG1_WCKTAU08sJQ7fFePu8KS7lIjE=.67c1c3e9-b4f8-4ffd-823e-d27f55c434a3@github.com> Message-ID: <2d-lNLe2ghu1xNAyWiKuyGiU8ds6DIFmUI_HdGlhBV8=.6fa10cf3-0e96-4f14-b7cb-e1953fc5adc3@github.com> On Wed, 3 Nov 2021 01:44:49 GMT, Vladimir Kozlov wrote: > Resolve C2 issue with nested initialization when Escape Analysis can not scalarize allocations: `new A(new B( new C)))` > Implemented iterative EA when C2 invokes it again if there are progress and candidates. > > I added JMH microbenchmark with cases which show improvements. Improvements are due to removed allocation code. > > Before: > > Benchmark Mode Cnt Score Error Units > IterativeEA.test1 avgt 5 11.489 ? 3.037 ns/op > IterativeEA.test2 avgt 5 16.103 ? 3.686 ns/op > IterativeEA.test3 avgt 5 1988.827 ? 217.831 ns/op > > With these changes: > > IterativeEA.test1 avgt 5 2.182 ? 0.022 ns/op > IterativeEA.test2 avgt 5 2.375 ? 0.001 ns/op > IterativeEA.test3 avgt 5 821.011 ? 8.268 ns/op > > > An other JMH test: > > PointerBenchmarkFlat.test avgt 30 23.232 ? 6.507 ms/op > > vs > > PointerBenchmarkFlat.test avgt 30 0.299 ? 0.001 ms/op Looks good ------------- Marked as reviewed by neliasso (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6222 From dcubed at openjdk.java.net Mon Dec 13 20:39:47 2021 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Mon, 13 Dec 2021 20:39:47 GMT Subject: Integrated: 8278630: ProblemList compiler/vectorapi/reshape/TestVectorCastAVX512.java on X64 Message-ID: A trivial fix to ProblemList compiler/vectorapi/reshape/TestVectorCastAVX512.java on X64. ------------- Commit messages: - 8278630: ProblemList compiler/vectorapi/reshape/TestVectorCastAVX512.java on X64 Changes: https://git.openjdk.java.net/jdk/pull/6817/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6817&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278630 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6817.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6817/head:pull/6817 PR: https://git.openjdk.java.net/jdk/pull/6817 From psandoz at openjdk.java.net Mon Dec 13 20:39:48 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Mon, 13 Dec 2021 20:39:48 GMT Subject: Integrated: 8278630: ProblemList compiler/vectorapi/reshape/TestVectorCastAVX512.java on X64 In-Reply-To: References: Message-ID: On Mon, 13 Dec 2021 20:29:54 GMT, Daniel D. Daugherty wrote: > A trivial fix to ProblemList compiler/vectorapi/reshape/TestVectorCastAVX512.java on X64. Marked as reviewed by psandoz (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6817 From dcubed at openjdk.java.net Mon Dec 13 20:39:49 2021 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Mon, 13 Dec 2021 20:39:49 GMT Subject: Integrated: 8278630: ProblemList compiler/vectorapi/reshape/TestVectorCastAVX512.java on X64 In-Reply-To: References: Message-ID: <_nQyUOJpn1Zs-9PQvbW9wczzhfmZat-iQAnjRjXHCZw=.4a60d8d6-14fa-404f-b428-35f5049e758d@github.com> On Mon, 13 Dec 2021 20:31:32 GMT, Paul Sandoz wrote: >> A trivial fix to ProblemList compiler/vectorapi/reshape/TestVectorCastAVX512.java on X64. > > Marked as reviewed by psandoz (Reviewer). @PaulSandoz - Thanks for the fast review. ------------- PR: https://git.openjdk.java.net/jdk/pull/6817 From dcubed at openjdk.java.net Mon Dec 13 20:39:50 2021 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Mon, 13 Dec 2021 20:39:50 GMT Subject: Integrated: 8278630: ProblemList compiler/vectorapi/reshape/TestVectorCastAVX512.java on X64 In-Reply-To: References: Message-ID: On Mon, 13 Dec 2021 20:29:54 GMT, Daniel D. Daugherty wrote: > A trivial fix to ProblemList compiler/vectorapi/reshape/TestVectorCastAVX512.java on X64. This pull request has now been integrated. Changeset: bdc784c0 Author: Daniel D. Daugherty URL: https://git.openjdk.java.net/jdk/commit/bdc784c0cb02d76c6d3a1608a89f4b64f86253eb Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod 8278630: ProblemList compiler/vectorapi/reshape/TestVectorCastAVX512.java on X64 Reviewed-by: psandoz ------------- PR: https://git.openjdk.java.net/jdk/pull/6817 From dlong at openjdk.java.net Mon Dec 13 20:45:19 2021 From: dlong at openjdk.java.net (Dean Long) Date: Mon, 13 Dec 2021 20:45:19 GMT Subject: [jdk18] RFR: 8262134: compiler/uncommontrap/TestDeoptOOM.java failed with "guarantee(false) failed: wrong number of expression stack elements during deopt" In-Reply-To: References: Message-ID: <_3h9tygAJr9Cs_B6woGTkr1YrpHwqIe1bxkveOhlEpc=.609ae56a-7a6e-468c-b7dc-d2e1fb90ea65@github.com> On Fri, 10 Dec 2021 17:09:31 GMT, Igor Veresov wrote: >> C1 patching stubs use the Unpack_reexecute deoptimization type, but if an exception is thrown, that information is lost, causing the VerifyStack logic to fail. Rather than relaxing the VerifyStack logic to accept Unpack_exception in this case, this change sets the reexecute flag on the patch stub call site. >> Other changes: >> use runtime/BootstrapMethod/BSMCalledTwice.java as a reproducer >> improve VerifyStack failure output > > Have you run tier7 where is it does -XX:DeoptimizeALot ? @veresov, tier7 testing passed. ------------- PR: https://git.openjdk.java.net/jdk18/pull/7 From iveresov at openjdk.java.net Mon Dec 13 20:52:25 2021 From: iveresov at openjdk.java.net (Igor Veresov) Date: Mon, 13 Dec 2021 20:52:25 GMT Subject: [jdk18] RFR: 8262134: compiler/uncommontrap/TestDeoptOOM.java failed with "guarantee(false) failed: wrong number of expression stack elements during deopt" In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 03:41:24 GMT, Dean Long wrote: > C1 patching stubs use the Unpack_reexecute deoptimization type, but if an exception is thrown, that information is lost, causing the VerifyStack logic to fail. Rather than relaxing the VerifyStack logic to accept Unpack_exception in this case, this change sets the reexecute flag on the patch stub call site. > Other changes: > use runtime/BootstrapMethod/BSMCalledTwice.java as a reproducer > improve VerifyStack failure output Marked as reviewed by iveresov (Reviewer). Great! Thanks for checking! ------------- PR: https://git.openjdk.java.net/jdk18/pull/7 From rkennke at openjdk.java.net Mon Dec 13 21:01:42 2021 From: rkennke at openjdk.java.net (Roman Kennke) Date: Mon, 13 Dec 2021 21:01:42 GMT Subject: [jdk18] RFR: 8278489: Preserve result in native wrapper with +UseHeavyMonitors Message-ID: <7DC3fQUgJZu-PrJPhcuTvHzSpRuuV4vQfe5pe6p3aqs=.9a0a3a5d-3c02-4f01-a239-8542790bd281@github.com> Testing observed a few failures after JDK-8276901. The reason for the failures is in the native-wrappers, in the +UseHeavyMonitors paths, we don't preserve the result register after the native call. Testing: - [x] java/awt/color, sun/java2d/cmm tests (x86_32, x86_64 x -UseHeavyMonitors, +UseHeavyMonitors) - [ ] tier1 (x86_32, x86_64 x -UseHeavyMonitors, +UseHeavyMonitors) ------------- Commit messages: - 8278489: Preserve result in native wrapper with +UseHeavyMonitors Changes: https://git.openjdk.java.net/jdk18/pull/16/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=16&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278489 Stats: 12 lines in 2 files changed: 4 ins; 0 del; 8 mod Patch: https://git.openjdk.java.net/jdk18/pull/16.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/16/head:pull/16 PR: https://git.openjdk.java.net/jdk18/pull/16 From dholmes at openjdk.java.net Mon Dec 13 22:50:26 2021 From: dholmes at openjdk.java.net (David Holmes) Date: Mon, 13 Dec 2021 22:50:26 GMT Subject: [jdk18] RFR: 8278489: Preserve result in native wrapper with +UseHeavyMonitors In-Reply-To: <7DC3fQUgJZu-PrJPhcuTvHzSpRuuV4vQfe5pe6p3aqs=.9a0a3a5d-3c02-4f01-a239-8542790bd281@github.com> References: <7DC3fQUgJZu-PrJPhcuTvHzSpRuuV4vQfe5pe6p3aqs=.9a0a3a5d-3c02-4f01-a239-8542790bd281@github.com> Message-ID: On Mon, 13 Dec 2021 20:54:55 GMT, Roman Kennke wrote: > Testing observed a few failures after JDK-8276901. The reason for the failures is in the native-wrappers, in the +UseHeavyMonitors paths, we don't preserve the result register after the native call. > > Testing: > - [x] java/awt/color, sun/java2d/cmm tests (x86_32, x86_64 x -UseHeavyMonitors, +UseHeavyMonitors) > - [ ] tier1 (x86_32, x86_64 x -UseHeavyMonitors, +UseHeavyMonitors) Not a review as not my area but saw some nits with the adjusted code. David src/hotspot/cpu/x86/sharedRuntime_x86_32.cpp line 1867: > 1865: } > 1866: > 1867: // Must save rax, if if it is live now because cmpxchg must use it Existing typo: if if src/hotspot/cpu/x86/sharedRuntime_x86_32.cpp line 1870: > 1868: if (ret_type != T_FLOAT && ret_type != T_DOUBLE && ret_type != T_VOID) { > 1869: save_native_result(masm, ret_type, stack_slots); > 1870: } Indentation looks wrong src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 2069: > 2067: } > 2068: > 2069: // Must save rax if if it is live now because cmpxchg must use it existing typo: if if ------------- PR: https://git.openjdk.java.net/jdk18/pull/16 From jiefu at openjdk.java.net Mon Dec 13 23:36:49 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 13 Dec 2021 23:36:49 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" [v5] In-Reply-To: References: Message-ID: <-LFrdsGVbk1xLZt5WdxzHgHxVyGQbFf5sxenbdX_NsE=.dc35826f-cf11-4ef0-8da5-92dc8ef499bb@github.com> > Hi all, > > I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. > Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. > > The patch also removes an useless statement [4]. > > Testing: > - vector api tests on Linux/x64-{AVX512, AVX2} > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 Jie Fu has updated the pull request incrementally with one additional commit since the last revision: Add comment for -XX:DisableIntrinsic=_VectorMaskOp ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6808/files - new: https://git.openjdk.java.net/jdk/pull/6808/files/579363d8..392fd34d Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6808&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6808&range=03-04 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6808.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6808/head:pull/6808 PR: https://git.openjdk.java.net/jdk/pull/6808 From jiefu at openjdk.java.net Mon Dec 13 23:36:50 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 13 Dec 2021 23:36:50 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" [v3] In-Reply-To: References: Message-ID: <44Klx4xvYhf-Ed1NVmUSm_9HPM4A-3GU815szLS4yuQ=.74db2e0c-e817-4429-8bea-ce861c73d0bc@github.com> On Mon, 13 Dec 2021 11:36:46 GMT, Jie Fu wrote: >> Hi, unfortunately, the tests still don't verify the correctness of the compiled codes, what happened here is that the `IfNode` always goes in one direction make the other path an uncommon trap, the uncommon trap captures all nodes before it including the `VectorLongToMaskNode`, thus throw `ShouldNotReachHere` when trying to emit instructions. The result of the `VectorLongToMaskNode`, however, is only used to feed into the uncommon trap and not used in verification at all. >> >> I propose we just store the mask into a boolean array using `toArray()` and compare the result in a for loop. >> >> 0x00007f3e254efa40: mov %eax,-0x16000(%rsp) >> 0x00007f3e254efa47: push %rbp >> 0x00007f3e254efa48: sub $0x30,%rsp ;*synchronization entry >> ; - org.openjdk.bench.vm.compiler.Sample::testByte64 at -1 (line 67) >> 0x00007f3e254efa4c: movzbl 0x18(%rsi),%r10d ;*getfield checked {reexecute=0 rethrow=0 return_oop=0} >> ; - org.openjdk.bench.vm.compiler.Sample::testByte64 at 14 (line 69) >> 0x00007f3e254efa51: test %r10d,%r10d >> 0x00007f3e254efa54: je 0x00007f3e254efa69 // If (checked != 0), we return immediately >> ;; B2: # out( N45 ) <- in( B1 ) Freq: 1 >> 0x00007f3e254efa56: add $0x30,%rsp >> 0x00007f3e254efa5a: pop %rbp >> 0x00007f3e254efa5b: cmp 0x388(%r15),%rsp ; {poll_return} >> 0x00007f3e254efa62: ja 0x00007f3e254efadc >> 0x00007f3e254efa68: ret >> ;; B3: # out( N45 ) <- in( B1 ) Freq: 4.76837e-07 >> 0x00007f3e254efa69: mov $0xffff,%r11d >> 0x00007f3e254efa6f: and 0x10(%rsi),%r11 >> 0x00007f3e254efa73: movabs $0x101010101010101,%r8 >> 0x00007f3e254efa7d: pdep %r8,%r11,%r8 >> 0x00007f3e254efa82: mov %r11,%r9 >> 0x00007f3e254efa85: vpxor %xmm1,%xmm1,%xmm1 >> 0x00007f3e254efa89: vmovq %r8,%xmm1 >> 0x00007f3e254efa8e: vmovq %r8,%xmm0 >> 0x00007f3e254efa93: movabs $0x101010101010101,%r8 >> 0x00007f3e254efa9d: shr $0x8,%r9 >> 0x00007f3e254efaa1: pdep %r8,%r9,%r8 >> 0x00007f3e254efaa6: vpinsrq $0x1,%r8,%xmm1,%xmm1 >> 0x00007f3e254efaac: vmovdqu %ymm1,%ymm0 >> 0x00007f3e254efab0: mov %rsi,%rbp >> 0x00007f3e254efab3: mov %r10d,(%rsp) >> 0x00007f3e254efab7: vmovdqu %xmm0,0x10(%rsp) ;*invokestatic maskReductionCoerced {reexecute=0 rethrow=0 return_oop=0} >> ; - jdk.incubator.vector.Byte128Vector$Byte128Mask::toLong at 35 (line 735) >> ; - org.openjdk.bench.vm.compiler.Sample::testByte64 at 21 (line 70) >> 0x00007f3e254efabd: mov $0xffffff45,%esi >> 0x00007f3e254efac2: nop >> 0x00007f3e254efac3: call 0x00007f3e254bbf20 ; ImmutableOopMap {rbp=Oop } >> ;*ifeq {reexecute=1 rethrow=0 return_oop=0} >> ; - (reexecute) org.openjdk.bench.vm.compiler.Sample::testByte64 at 17 (line 69) >> ; {runtime_call UncommonTrapBlob} > >> Hi, unfortunately, the tests still don't verify the correctness of the compiled codes, what happened here is that the `IfNode` always goes in one direction make the other path an uncommon trap, the uncommon trap captures all nodes before it including the `VectorLongToMaskNode`, thus throw `ShouldNotReachHere` when trying to emit instructions. The result of the `VectorLongToMaskNode`, however, is only used to feed into the uncommon trap and not used in verification at all. > > Nice catch! @merykitty > What do you think of the updated version? > Thanks. > @DamonFool neat test fix, could you add a brief comment as to why disabling `_VectorMaskOp` is required? Thanks @PaulSandoz for your testing. The comment had been added. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6808 From psandoz at openjdk.java.net Tue Dec 14 01:19:18 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Tue, 14 Dec 2021 01:19:18 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" [v5] In-Reply-To: <-LFrdsGVbk1xLZt5WdxzHgHxVyGQbFf5sxenbdX_NsE=.dc35826f-cf11-4ef0-8da5-92dc8ef499bb@github.com> References: <-LFrdsGVbk1xLZt5WdxzHgHxVyGQbFf5sxenbdX_NsE=.dc35826f-cf11-4ef0-8da5-92dc8ef499bb@github.com> Message-ID: On Mon, 13 Dec 2021 23:36:49 GMT, Jie Fu wrote: >> Hi all, >> >> I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. >> Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. >> >> The patch also removes an useless statement [4]. >> >> Testing: >> - vector api tests on Linux/x64-{AVX512, AVX2} >> >> Thanks. >> Best regards, >> Jie >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Add comment for -XX:DisableIntrinsic=_VectorMaskOp Marked as reviewed by psandoz (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6808 From jiefu at openjdk.java.net Tue Dec 14 01:41:16 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 14 Dec 2021 01:41:16 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" [v5] In-Reply-To: <-LFrdsGVbk1xLZt5WdxzHgHxVyGQbFf5sxenbdX_NsE=.dc35826f-cf11-4ef0-8da5-92dc8ef499bb@github.com> References: <-LFrdsGVbk1xLZt5WdxzHgHxVyGQbFf5sxenbdX_NsE=.dc35826f-cf11-4ef0-8da5-92dc8ef499bb@github.com> Message-ID: On Mon, 13 Dec 2021 23:36:49 GMT, Jie Fu wrote: >> Hi all, >> >> I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. >> Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. >> >> The patch also removes an useless statement [4]. >> >> Testing: >> - vector api tests on Linux/x64-{AVX512, AVX2} >> >> Thanks. >> Best regards, >> Jie >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Add comment for -XX:DisableIntrinsic=_VectorMaskOp Thank you all for the review and comment. @jatin-bhateja , are you OK with the fix? Will push it tomorrow if there is no objection. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6808 From jwilhelm at openjdk.java.net Tue Dec 14 01:49:19 2021 From: jwilhelm at openjdk.java.net (Jesper Wilhelmsson) Date: Tue, 14 Dec 2021 01:49:19 GMT Subject: RFR: Merge jdk18 Message-ID: Forwardport JDK 18 -> JDK 19 ------------- Commit messages: - Merge - 8132785: java/lang/management/ThreadMXBean/ThreadLists.java fails intermittently - 8273108: RunThese24H crashes with SEGV in markWord::displaced_mark_helper() after JDK-8268276 - 8278580: ProblemList javax/swing/JTree/4908142/bug4908142.java on macosx-x64 - 8277299: STACK_OVERFLOW in Java_sun_awt_shell_Win32ShellFolder2_getIconBits The merge commit only contains trivial merges, so no merge-specific webrevs have been generated. Changes: https://git.openjdk.java.net/jdk/pull/6824/files Stats: 119 lines in 6 files changed: 85 ins; 1 del; 33 mod Patch: https://git.openjdk.java.net/jdk/pull/6824.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6824/head:pull/6824 PR: https://git.openjdk.java.net/jdk/pull/6824 From jwilhelm at openjdk.java.net Tue Dec 14 02:21:19 2021 From: jwilhelm at openjdk.java.net (Jesper Wilhelmsson) Date: Tue, 14 Dec 2021 02:21:19 GMT Subject: Integrated: Merge jdk18 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 01:35:43 GMT, Jesper Wilhelmsson wrote: > Forwardport JDK 18 -> JDK 19 This pull request has now been integrated. Changeset: 8401a059 Author: Jesper Wilhelmsson URL: https://git.openjdk.java.net/jdk/commit/8401a059bd01b32e3532f806d3d8b60e851c468a Stats: 119 lines in 6 files changed: 85 ins; 1 del; 33 mod Merge ------------- PR: https://git.openjdk.java.net/jdk/pull/6824 From jwilhelm at openjdk.java.net Tue Dec 14 02:21:17 2021 From: jwilhelm at openjdk.java.net (Jesper Wilhelmsson) Date: Tue, 14 Dec 2021 02:21:17 GMT Subject: RFR: Merge jdk18 [v2] In-Reply-To: References: Message-ID: > Forwardport JDK 18 -> JDK 19 Jesper Wilhelmsson has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 29 additional commits since the last revision: - Merge - 8278275: Initial nroff manpage generation for JDK 19 Reviewed-by: erikj, jjg, iris - 8278630: ProblemList compiler/vectorapi/reshape/TestVectorCastAVX512.java on X64 Reviewed-by: psandoz - 8269556: sun/tools/jhsdb/JShellHeapDumpTest.java fails with RuntimeException 'JShellToolProvider' missing from stdout/stderr Reviewed-by: kevinw, sspitsyn, amenkov - 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" Reviewed-by: psandoz, chagedorn - 8276241: JVM does not flag constant class entries ending in '/' Reviewed-by: dholmes, lfoltan - 8277481: Obsolete seldom used CDS flags Reviewed-by: iklam, ccheung, dholmes - 8271079: JavaFileObject#toUri and multi-release jars Reviewed-by: jjg, lancea, alanb - 8278482: G1: Improve HeapRegion::block_is_obj Reviewed-by: sjohanss, tschatzl, mli - 8278344: sun/security/pkcs12/KeytoolOpensslInteropTest.java test fails because of different openssl output Reviewed-by: mdoerr, goetz, stuefe - ... and 19 more: https://git.openjdk.java.net/jdk/compare/9d2883a1...dc02f07e ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6824/files - new: https://git.openjdk.java.net/jdk/pull/6824/files/dc02f07e..dc02f07e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6824&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6824&range=00-01 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6824.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6824/head:pull/6824 PR: https://git.openjdk.java.net/jdk/pull/6824 From duke at openjdk.java.net Tue Dec 14 03:04:13 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Tue, 14 Dec 2021 03:04:13 GMT Subject: RFR: 8278471: Reorder optimizations in addnode Ideal [v2] In-Reply-To: <3E6-DIcs8gGaF-Odpoa3r3NdeT_GQui3hISZLgS2B8k=.53ec14bb-99ce-4632-b253-f0c79972e1f7@github.com> References: <3E6-DIcs8gGaF-Odpoa3r3NdeT_GQui3hISZLgS2B8k=.53ec14bb-99ce-4632-b253-f0c79972e1f7@github.com> Message-ID: On Mon, 13 Dec 2021 10:53:27 GMT, Mai ??ng Qu?n Anh wrote: > Hi, I think just deleting those 2 should be okay. > > Regarding idealisation, if you look into `PhaseGVN::transform`, it is clear that `Node::Ideal` is called repeatedly until there is nothing change anymore. Then `Node::Value` is called to do constant propagation before `Node::Identity` is called to get rid of redundant operations. Finally, value numbering is the transformation which deduplicates all equivalent `Node`s into a single one. > > https://github.com/openjdk/jdk/blob/ccdb9f1b160a0f49ee86c7a2714d2381d68419cc/src/hotspot/share/opto/phaseX.cpp#L828 Thank you so much for answering both my questions! ------------- PR: https://git.openjdk.java.net/jdk/pull/6752 From dlong at openjdk.java.net Tue Dec 14 03:20:16 2021 From: dlong at openjdk.java.net (Dean Long) Date: Tue, 14 Dec 2021 03:20:16 GMT Subject: [jdk18] RFR: 8262134: compiler/uncommontrap/TestDeoptOOM.java failed with "guarantee(false) failed: wrong number of expression stack elements during deopt" In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 03:41:24 GMT, Dean Long wrote: > C1 patching stubs use the Unpack_reexecute deoptimization type, but if an exception is thrown, that information is lost, causing the VerifyStack logic to fail. Rather than relaxing the VerifyStack logic to accept Unpack_exception in this case, this change sets the reexecute flag on the patch stub call site. > Other changes: > use runtime/BootstrapMethod/BSMCalledTwice.java as a reproducer > improve VerifyStack failure output Thanks Igor. ------------- PR: https://git.openjdk.java.net/jdk18/pull/7 From dlong at openjdk.java.net Tue Dec 14 03:20:17 2021 From: dlong at openjdk.java.net (Dean Long) Date: Tue, 14 Dec 2021 03:20:17 GMT Subject: [jdk18] Integrated: 8262134: compiler/uncommontrap/TestDeoptOOM.java failed with "guarantee(false) failed: wrong number of expression stack elements during deopt" In-Reply-To: References: Message-ID: <_5bKZz2Z9PjNsyNevpvEqoszrsT0SxKFZOECydTxJZ4=.b5ff194c-a8a9-46ea-9d46-7227325891d7@github.com> On Fri, 10 Dec 2021 03:41:24 GMT, Dean Long wrote: > C1 patching stubs use the Unpack_reexecute deoptimization type, but if an exception is thrown, that information is lost, causing the VerifyStack logic to fail. Rather than relaxing the VerifyStack logic to accept Unpack_exception in this case, this change sets the reexecute flag on the patch stub call site. > Other changes: > use runtime/BootstrapMethod/BSMCalledTwice.java as a reproducer > improve VerifyStack failure output This pull request has now been integrated. Changeset: 32139c1a Author: Dean Long URL: https://git.openjdk.java.net/jdk18/commit/32139c1a8aae51c0869f41be57580ff4463913d2 Stats: 23 lines in 5 files changed: 16 ins; 1 del; 6 mod 8262134: compiler/uncommontrap/TestDeoptOOM.java failed with "guarantee(false) failed: wrong number of expression stack elements during deopt" Reviewed-by: kvn, iveresov ------------- PR: https://git.openjdk.java.net/jdk18/pull/7 From dlong at openjdk.java.net Tue Dec 14 03:38:36 2021 From: dlong at openjdk.java.net (Dean Long) Date: Tue, 14 Dec 2021 03:38:36 GMT Subject: RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert [v3] In-Reply-To: References: Message-ID: On Tue, 7 Dec 2021 08:25:50 GMT, Roland Westrelin wrote: >> Root cause is identical to 8273165 AFIU: late inline of a virtual call >> can throw from 2 different paths (null check and the call >> itself). That breaks because the logic for exceptions expects the >> stack for all paths that throw exceptions to have the same stack size. >> >> AFAIU, the stack doesn't matter exception handling: either the >> exception is caught by a exception handler and then the stack is >> popped and the exception is pushed or, the exception is rethrown to >> the caller in which case the current stack is also popped (that is the >> jvm state for the current method). As a consequence the fix I propose >> is to ignore the stack in GraphKit::combine_exception_states(). >> >> AFAIU, the same fix would work for 8273165 but I left the current work >> around as is: not sure if we want to be conservative for now or not > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - comment > - Merge branch 'master' into JDK-8275638 > - alternate fix > - make test runnable with release build > - more > - fix @roland, I've asked John Rose to take a look at this too. I'm still staring at the code trying to figure out why parse-time inlining can avoid with this problem, but late inlining can't. ------------- PR: https://git.openjdk.java.net/jdk/pull/6572 From jrose at openjdk.java.net Tue Dec 14 04:53:18 2021 From: jrose at openjdk.java.net (John R Rose) Date: Tue, 14 Dec 2021 04:53:18 GMT Subject: RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert [v3] In-Reply-To: References: Message-ID: On Tue, 7 Dec 2021 08:25:50 GMT, Roland Westrelin wrote: >> Root cause is identical to 8273165 AFIU: late inline of a virtual call >> can throw from 2 different paths (null check and the call >> itself). That breaks because the logic for exceptions expects the >> stack for all paths that throw exceptions to have the same stack size. >> >> AFAIU, the stack doesn't matter exception handling: either the >> exception is caught by a exception handler and then the stack is >> popped and the exception is pushed or, the exception is rethrown to >> the caller in which case the current stack is also popped (that is the >> jvm state for the current method). As a consequence the fix I propose >> is to ignore the stack in GraphKit::combine_exception_states(). >> >> AFAIU, the same fix would work for 8273165 but I left the current work >> around as is: not sure if we want to be conservative for now or not > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - comment > - Merge branch 'master' into JDK-8275638 > - alternate fix > - make test runnable with release build > - more > - fix It is true that when an exception is being thrown the stack is clear. The only reason C2 sometimes keeps (or used to keep) stuff on the stack along an exception path is that, when an exception in being thrown by a bytecode that is re-executable, C2 might cleverly issue an uncommon trap that re-executes the throwing bytecode. In that case, having the stack unchanged is obviously important. But call instructions are not re-executable, and any exception coming from a call (directly or indirectly) must clear the stack. Even if there is a matching local handler in the method making the call, such as a `catch (NullPointerException _)` where a call might throw NPE, the correct move is to clear the stack when the throw is started (by the call), and then thread the JVMS (with empty stack) into the handler. ------------- PR: https://git.openjdk.java.net/jdk/pull/6572 From svkamath at openjdk.java.net Tue Dec 14 06:23:48 2021 From: svkamath at openjdk.java.net (Smita Kamath) Date: Tue, 14 Dec 2021 06:23:48 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 Message-ID: The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. ------------- Commit messages: - Fix for JDK:8274323 TestAESMain fails with invalid offset Changes: https://git.openjdk.java.net/jdk18/pull/19/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=19&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8274323 Stats: 106 lines in 2 files changed: 31 ins; 26 del; 49 mod Patch: https://git.openjdk.java.net/jdk18/pull/19.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/19/head:pull/19 PR: https://git.openjdk.java.net/jdk18/pull/19 From xliu at openjdk.java.net Tue Dec 14 06:29:21 2021 From: xliu at openjdk.java.net (Xin Liu) Date: Tue, 14 Dec 2021 06:29:21 GMT Subject: RFR: 8278104: C1 should support the compiler directive 'BreakAtExecute' In-Reply-To: References: Message-ID: On Sun, 12 Dec 2021 04:21:51 GMT, Guoxiong Li wrote: > Hi all, > > Currently, the directive `BreakAtExecute` is not effective at C1. And the `CompileCommand=break` doesn't break the compiled method, too. This patch unifies the `BreakAtExecute` and `CompileCommand=break` to the directive 'BreakAtExecute' and uses the directive 'BreakAtExecute' to identify whether a breakpoint should be added. > > The test group `hotspot_compiler` passed locally.(linux x86_64 fastdebug) > And the pre-submit tests passed before submitting the PR. > > Thanks for taking the time to review. > > Best Regards, > -- Guoxiong LGTM. I am not a reviewer. We still need other reviewers to approve it. hi, @lgxbslgx your patch looks good to me. ------------- Marked as reviewed by xliu (Committer). PR: https://git.openjdk.java.net/jdk/pull/6807 From thartmann at openjdk.java.net Tue Dec 14 07:05:18 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 14 Dec 2021 07:05:18 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v7] In-Reply-To: References: Message-ID: On Sat, 11 Dec 2021 02:57:49 GMT, Mai ??ng Qu?n Anh wrote: >> Hi, >> >> This patch adds several c2 tests for vector reshape operations. The tests verify the intrinsification of the corresponding operations by using the IR framework and verify the correctness of the results of compiled codes. >> >> While working on this patch, I spot some regressions regarding compilation on AVX1. >> >> Thank you very much. > > Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: > > typos This caused a regression: https://bugs.openjdk.java.net/browse/JDK-8278623 @merykitty Please have a look. ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From duke at openjdk.java.net Tue Dec 14 08:03:17 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Tue, 14 Dec 2021 08:03:17 GMT Subject: RFR: 8259610: VectorReshapeTests are not effective due to failing to intrinsify "VectorSupport.convert" [v7] In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 07:02:27 GMT, Tobias Hartmann wrote: >> Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: >> >> typos > > This caused a regression: https://bugs.openjdk.java.net/browse/JDK-8278623 > @merykitty Please have a look. Hi @TobiHartmann , fix understood, I will submit a fix soon. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6724 From jiefu at openjdk.java.net Tue Dec 14 08:23:16 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 14 Dec 2021 08:23:16 GMT Subject: RFR: 8278471: Reorder optimizations in addnode Ideal [v3] In-Reply-To: References: Message-ID: <8Qv4VPbE3iRhXa0hidjSVcG8T_j9zRh5r6sDSk4jDbk=.d1af02d4-f742-4bc0-acbe-1c08bcb745ed@github.com> On Mon, 13 Dec 2021 18:24:42 GMT, Zhiqiang Zang wrote: >> Reorder optimizations in addnode so special cases appear before general cases; otherwise the special cases would be never covered. >> >> `(a - b) + (c - d)` subsumes both `(a - b) + (b - c)` and `(a - b) + (c - a)`. Therefore `(a - b) + (b - c)` and `(a - b) + (c - a)` have to be placed before `(a - b) + (c - d)` so that they can work. > > Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: > > remove unnecessary optimizations. LGTM The removed rules actually never be reached before. So removing them won't make performance any worse. And the removed opts are already done by (AddNode::IdealIL + SubINode::Ideal). However, it would be better to change the JBS title as something like 'Remove unreached rules in AddNode::IdealIL' Thanks. ------------- Marked as reviewed by jiefu (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6752 From roland at openjdk.java.net Tue Dec 14 08:48:21 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 14 Dec 2021 08:48:21 GMT Subject: RFR: 8276455: C2: iterative EA In-Reply-To: <6JWGnSJyY6M2F0BG1_WCKTAU08sJQ7fFePu8KS7lIjE=.67c1c3e9-b4f8-4ffd-823e-d27f55c434a3@github.com> References: <6JWGnSJyY6M2F0BG1_WCKTAU08sJQ7fFePu8KS7lIjE=.67c1c3e9-b4f8-4ffd-823e-d27f55c434a3@github.com> Message-ID: On Wed, 3 Nov 2021 01:44:49 GMT, Vladimir Kozlov wrote: > Resolve C2 issue with nested initialization when Escape Analysis can not scalarize allocations: `new A(new B( new C)))` > Implemented iterative EA when C2 invokes it again if there are progress and candidates. > > I added JMH microbenchmark with cases which show improvements. Improvements are due to removed allocation code. > > Before: > > Benchmark Mode Cnt Score Error Units > IterativeEA.test1 avgt 5 11.489 ? 3.037 ns/op > IterativeEA.test2 avgt 5 16.103 ? 3.686 ns/op > IterativeEA.test3 avgt 5 1988.827 ? 217.831 ns/op > > With these changes: > > IterativeEA.test1 avgt 5 2.182 ? 0.022 ns/op > IterativeEA.test2 avgt 5 2.375 ? 0.001 ns/op > IterativeEA.test3 avgt 5 821.011 ? 8.268 ns/op > > > An other JMH test: > > PointerBenchmarkFlat.test avgt 30 23.232 ? 6.507 ms/op > > vs > > PointerBenchmarkFlat.test avgt 30 0.299 ? 0.001 ms/op That looks good to me. Have you measured how compile time is affected? ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6222 From pli at openjdk.java.net Tue Dec 14 08:56:43 2021 From: pli at openjdk.java.net (Pengfei Li) Date: Tue, 14 Dec 2021 08:56:43 GMT Subject: RFR: 8183390: Fix and re-enable post loop vectorization Message-ID: ### Background Post loop vectorization is a C2 compiler optimization in an experimental VM feature called PostLoopMultiversioning. It transforms the range-check eliminated post loop to a 1-iteration vectorized loop with vector mask. This optimization was contributed by Intel in 2016 to support x86 AVX512 masked vector instructions. However, it was disabled soon after an issue was found. Due to insufficient maintenance in these years, multiple bugs have been accumulated inside. But we (Arm) still think this is a useful framework for vector mask support in C2 auto-vectorized loops, for both x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable post loop vectorization. ### Changes in this patch This patch reworks post loop vectorization. The most significant change is removing vector mask support in C2 x86 backend and re-implementing it in the mid-end. With this, we can re-enable post loop vectorization for platforms other than x86. Previous implementation hard-codes x86 k1 register as a reserved AVX512 opmask register and defines two routines (setvectmask/restorevectmask) to set and restore the value of k1. But after [JDK-8211251](https://bugs.openjdk.java.net/browse/JDK-8211251) which encodes AVX512 instructions as unmasked by default, generated vector masks are no longer used in AVX512 vector instructions. To fix incorrect codegen and add vector mask support for more platforms, we turn to add a vector mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode to generate a mask and replace all Load/Store nodes in the post loop into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This IR form is exactly the same to those which are used in VectorAPI mask support. For now, we only add mask inputs for Load/Store nodes because we don't have reduction operations supported in post loop vectorization. After this change, the x86 k1 register is no longer reserved and can be allocated when PostLoopMultiversioning is enabled. Besides this change, we have fixed a compiler crash and five incorrect result issues with post loop vectorization. **I) C2 crashes with segmentation fault in strip-mined loops** Previous implementation was done before C2 loop strip-mining was merged into JDK master so it didn't take strip-mined loops into consideration. In C2's strip mined loops, post loop is not the sibling of the main loop in ideal loop tree. Instead, it's the sibling of the main loop's parent. This patch fixed a SIGSEGV issue caused by NULL pointer when locating post loop from strip-mined main loop. **II) Incorrect result issues with post loop vectorization** We have also fixed five incorrect vectorization issues. Some of them are hidden deep and can only be reproduced with corner cases. These issues have a common cause that it assumes the post loop can be vectorized if the vectorization in corresponding main loop is successful. But in many cases this assumption is wrong. Below are details. - **[Issue-1] Incorrect vectorization for partial vectorizable loops** This issue can be reproduced by below loop where only some operations in the loop body are vectorizable. for (int i = 0; i < 10000; i++) { res[i] = a[i] * b[i]; k = 3 * k + 1; } In the main loop, superword can work well if parts of the operations in loop body are not vectorizable since those parts can be unrolled only. But for post loops, we don't create vectors through combining scalar IRs generated from loop unrolling. Instead, we are doing scalars to vectors replacement for all operations in the loop body. Hence, all operations should be either vectorized together or not vectorized at all. To fix this kind of cases, we add an extra field "_slp_vector_pack_count" in CountedLoopNode to record the eventual count of vector packs in the main loop. This value is then passed to post loop and compared with post loop pack count. Vectorization will be bailed out in post loop if it creates more vector packs than in the main loop. - **[Issue-2] Incorrect result in loops with growing-down vectors** This issue appears with growing-down vectors, that is, vectors that grow to smaller memory address as the loop iterates. It can be reproduced by below counting-up loop with negative scale value in array index. for (int i = 0; i < 10000; i++) { a[MAX - i] = b[MAX - i]; } Cause of this issue is that for a growing-down vector, generated vector mask value has reversed vector-lane order so it masks incorrect vector lanes. Note that if negative scale value appears in counting-down loops, the vector will be growing up. With this rule, we fix the issue by only allowing positive array index scales in counting-up loops and negative array index scales in counting-down loops. This check is done with the help of SWPointer by comparing scale values in each memory access in the loop with loop stride value. - **[Issue-3] Incorrect result in manually unrolled loops** This issue can be reproduced by below manually unrolled loop. for (int i = 0; i < 10000; i += 2) { c[i] = a[i] + b[i]; c[i + 1] = a[i + 1] * b[i + 1]; } In this loop, operations in the 2nd statement duplicate those in the 1st statement with a small memory address offset. Vectorization in the main loop works well in this case because C2 does further unrolling and pack combination. But we cannot vectorize the post loop through replacement from scalars to vectors because it creates duplicated vector operations. To fix this, we restrict post loop vectorization to loops with stride values of 1 or -1. - **[Issue-4] Incorrect result in loops with mixed vector element sizes** This issue is found after we enable post loop vectorization for AArch64. It's reproducible by multiple array operations with different element sizes inside a loop. On x86, there is no issue because the values of x86 AVX512 opmasks only depend on which vector lanes are active. But AArch64 is different - the values of SVE predicates also depend on lane size of the vector. Hence, on AArch64 SVE, if a loop has mixed vector element sizes, we should use different vector masks. For now, we just support loops with only one vector element size, i.e., "int + float" vectors in a single loop is ok but "int + double" vectors in a single loop is not vectorizable. This fix also enables subword vectors support to make all primitive type array operations vectorizable. - **[Issue-5] Incorrect result in loops with potential data dependence** This issue can be reproduced by below corner case on AArch64 only. for (int i = 0; i < 10000; i++) { a[i] = x; a[i + OFFSET] = y; } In this case, two stores in the loop have data dependence if the OFFSET value is smaller than the vector length. So we cannot do vectorization through replacing scalars to vectors. But the main loop vectorization in this case is successful on AArch64 because AArch64 has partial vector load/store support. It splits vector fill with different values in lanes to several smaller-sized fills. In this patch, we add additional data dependence check for this kind of cases. The check is also done with the help of SWPointer class. In this check, we require that every two memory accesses (with at least one store) of the same element type (or subword size) in the loop has the same array index expression. ### Tests So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with experimental VM option "PostLoopMultiversioning" turned on. We found no issue in all tests. We notice that those existing cases are not enough because some of above issues are not spotted by them. We would like to add some new cases but we found existing vectorization tests are a bit cumbersome - golden results must be pre-calculated and hard-coded in the test code for correctness verification. Thus, in this patch, we propose a new vectorization testing framework. Our new framework brings a simpler way to add new cases. For a new test case, we only need to create a new method annotated with "@Test". The test runner will invoke each annotated method twice automatically. First time it runs in the interpreter and second time it's forced compiled by C2. Then the two return results are compared. So in this framework each test method should return a primitive value or an array of primitives. In this way, no extra verification code for vectorization correctness is required. This test runner is still jtreg-based and takes advantages of the jtreg WhiteBox API, which enables test methods running at specific compilation levels. Each test class inside is also jtreg-based. It just need to inherit from the test runner class and run with two additional options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI". ### Summary & Future work In this patch, we reworked post loop vectorization. We made it platform independent and fixed several issues inside. We also implemented a new vectorization testing framework with many test cases inside. Meanwhile, we did some code cleanups. This patch only touches C2 code guarded with PostLoopMultiversioning, except a few data structure changes. So, there's no behavior change when experimental VM option PostLoopMultiversioning is off. Also, to reduce risks, we still propose to keep post loop vectorization experimental for now. But if it receives positive feedback, we would like to change it to non-experimental in the future. ------------- Commit messages: - 8183390: Fix and re-enable post loop vectorization Changes: https://git.openjdk.java.net/jdk/pull/6828/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6828&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8183390 Stats: 4793 lines in 39 files changed: 4482 ins; 284 del; 27 mod Patch: https://git.openjdk.java.net/jdk/pull/6828.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6828/head:pull/6828 PR: https://git.openjdk.java.net/jdk/pull/6828 From fgao at openjdk.java.net Tue Dec 14 09:21:19 2021 From: fgao at openjdk.java.net (Fei Gao) Date: Tue, 14 Dec 2021 09:21:19 GMT Subject: RFR: 8277619: AArch64: Incorrect parameter type in Advanced SIMD Copy assembler functions In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 09:49:23 GMT, Fei Gao wrote: > Mov (from general), incorrectly uses SIMD_Arrangement as the parameter > type of the assembler function. However, from Arm ARM [1], it's more > precise to use SIMD_RegVariant here. > > The situation is similar to Mov(to general) [2]. > > Note that as Mov(to general) is an alias of UMOV, we turn to re-use > UMOV encoding for Mov(to general) in this patch. > > [1] https://developer.arm.com/documentation/ddi0602/2020-12/SIMD-FP-Instructions/MOV--from-general---Move-general-purpose-register-to-a-vector-element--an-alias-of-INS--general-- > [2] https://developer.arm.com/documentation/ddi0602/2020-12/SIMD-FP-Instructions/MOV--to-general---Move-vector-element-to-general-purpose-register--an-alias-of-UMOV- The PR does some code cleaning in AArch64 assembler. Can I have your review please? ------------- PR: https://git.openjdk.java.net/jdk/pull/6629 From aph at openjdk.java.net Tue Dec 14 10:27:13 2021 From: aph at openjdk.java.net (Andrew Haley) Date: Tue, 14 Dec 2021 10:27:13 GMT Subject: RFR: 8277619: AArch64: Incorrect parameter type in Advanced SIMD Copy assembler functions In-Reply-To: References: Message-ID: <76e387uxQnax3plvykqrewS_MIhvJEK7BPcONrrtvII=.32f79677-e7d3-4938-8fff-3269b0792c3f@github.com> On Wed, 1 Dec 2021 09:49:23 GMT, Fei Gao wrote: > Mov (from general), incorrectly uses SIMD_Arrangement as the parameter > type of the assembler function. However, from Arm ARM [1], it's more > precise to use SIMD_RegVariant here. > > The situation is similar to Mov(to general) [2]. > > Note that as Mov(to general) is an alias of UMOV, we turn to re-use > UMOV encoding for Mov(to general) in this patch. > > [1] https://developer.arm.com/documentation/ddi0602/2020-12/SIMD-FP-Instructions/MOV--from-general---Move-general-purpose-register-to-a-vector-element--an-alias-of-INS--general-- > [2] https://developer.arm.com/documentation/ddi0602/2020-12/SIMD-FP-Instructions/MOV--to-general---Move-vector-element-to-general-purpose-register--an-alias-of-UMOV- Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6629 From rkennke at openjdk.java.net Tue Dec 14 10:46:54 2021 From: rkennke at openjdk.java.net (Roman Kennke) Date: Tue, 14 Dec 2021 10:46:54 GMT Subject: [jdk18] RFR: 8278489: Preserve result in native wrapper with +UseHeavyMonitors [v2] In-Reply-To: <7DC3fQUgJZu-PrJPhcuTvHzSpRuuV4vQfe5pe6p3aqs=.9a0a3a5d-3c02-4f01-a239-8542790bd281@github.com> References: <7DC3fQUgJZu-PrJPhcuTvHzSpRuuV4vQfe5pe6p3aqs=.9a0a3a5d-3c02-4f01-a239-8542790bd281@github.com> Message-ID: > Testing observed a few failures after JDK-8276901. The reason for the failures is in the native-wrappers, in the +UseHeavyMonitors paths, we don't preserve the result register after the native call. > > Testing: > - [x] java/awt/color, sun/java2d/cmm tests (x86_32, x86_64 x -UseHeavyMonitors, +UseHeavyMonitors) > - [ ] tier1 (x86_32, x86_64 x -UseHeavyMonitors, +UseHeavyMonitors) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: Fix typos and intendation ------------- Changes: - all: https://git.openjdk.java.net/jdk18/pull/16/files - new: https://git.openjdk.java.net/jdk18/pull/16/files/34324134..c77c877d Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk18&pr=16&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk18&pr=16&range=00-01 Stats: 5 lines in 2 files changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.java.net/jdk18/pull/16.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/16/head:pull/16 PR: https://git.openjdk.java.net/jdk18/pull/16 From jiefu at openjdk.java.net Tue Dec 14 11:27:46 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 14 Dec 2021 11:27:46 GMT Subject: [jdk18] RFR: 8278758: runtime/BootstrapMethod/BSMCalledTwice.java fails with release VMs after JDK-8262134 Message-ID: Hi all, runtime/BootstrapMethod/BSMCalledTwice.java fails with release VMs due to 'DeoptimizeALot' is develop after JDK-8262134. This is because `-XX:+DeoptimizeALot` and `-XX:+VerifyStack` are develop and are available only in debug version. The fix only run the test config added by JDK-8262134 with debug VMs. Thanks. Best regards, Jie ------------- Commit messages: - 8278758: runtime/BootstrapMethod/BSMCalledTwice.java fails with release VMs after JDK-8262134 Changes: https://git.openjdk.java.net/jdk18/pull/21/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=21&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278758 Stats: 11 lines in 1 file changed: 9 ins; 1 del; 1 mod Patch: https://git.openjdk.java.net/jdk18/pull/21.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/21/head:pull/21 PR: https://git.openjdk.java.net/jdk18/pull/21 From tobias.hartmann at oracle.com Tue Dec 14 13:30:54 2021 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Tue, 14 Dec 2021 14:30:54 +0100 Subject: JDK-8231460: Java update from 11.0.11 to 11.0.13 changes JVM code cache behavior and results in more process cpu usage and unexpected profiled nmethods memory usage In-Reply-To: References: Message-ID: <1d111a35-b341-d69f-bbb9-f4084f82dab3@oracle.com> Hi Josef, Thanks for reporting this issue! Any chance you could run your application with different builds to narrow the problem down to a single change? @Lutz: As the author of 8223444 and 8231460, any ideas? Best regards, Tobias On 01.12.21 16:35, Josef Lehner wrote: > Dear OpenJDK team, > > as described in this StackOverflow question, I want to reach out to you and > question whether the JVM code cache / codeheap still works as designed. > https://stackoverflow.com/questions/70086548/java-update-from-11-0-11-to-11-0-13-changes-jvm-code-cache-behavior-and-results > > What we experience in our huge application is that with Java 11.0.13 and > -XX:ReservedCodeCacheSize=375m the code cache / codeheap 'profiled > nmethods' (C1 optimized, ~ 180 MB) drops after a very short time to a very > low level (less than 50 MB) and stays at this level forever while > 'non-profiled nmethods' (C2 optimized) is already at its limits. > After we tripled the -XX:ReservedCodeCacheSize to 1024 MB, both areas > 'profiled nmethods' and 'non-profiled nmethods' have stayed at a much > higher constant level (~ 258 MB) for over a week now instead of dropping to > less than 50 MB after 15 min (-XX:ReservedCodeCacheSize=375m) or dropping > after 3 hours (-XX:ReservedCodeCacheSize=512m). > From my point of view as a non-expert I would expect that the C1 optimized > code does not get removed (or at least not so much) from 'profiled > nmethods' as there is no space left in 'non-profiled nmethods' to optimize > it further. What do you think? > > Important changes in 11.0.12: > https://bugs.openjdk.java.net/browse/JDK-8223444 Improve CodeHeap Free > Space Management > https://bugs.openjdk.java.net/browse/JDK-8231460 Performance issue > (CodeHeap) with large free blocks > > Best regards > Josef Lehner > From roland at openjdk.java.net Tue Dec 14 14:20:08 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 14 Dec 2021 14:20:08 GMT Subject: RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert [v3] In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 03:34:36 GMT, Dean Long wrote: > I'm still staring at the code trying to figure out why parse-time inlining can avoid with this problem, but late inlining can't. With parse time inlining, when the call is processed, 2 exception states are added. One for the null check before the call and one for exceptions thrown after the call. They have different sp so they can't be combined. Either the exception is caught in the current method in which case, the 2 exception states are processed independently in Parse::catch_inline_exceptions() at which point the expression stacks are popped for each one of them. Or the exception is passed on the caller in which case, the expression stacks are also popped in Parse::throw_to_exit(). With late inlining, the call to Parse::catch_inline_exceptions() or Parse::throw_to_exit() happened before the call site is processed and that one assumed a single exception state. So when late inlining occurs, exception states must be combined while with parse time inlining they don't need to be combined. FWIW, my understanding of what happens comes from running the tests of this PR with and without late inlining and stepping through the code. I'm unclear where we go from here. My opinion is that, by going over the history of the code, you figured out why expression handling preserves the stack. And AFAICT, you also found that the code that needed the stack preserved was removed. So I would go with a fix that pops the stack when an exception is thrown, have it go through as much testing as possible and integrate that fix if testing is clean. Whether we want to have the exception on stack in exception states or as an extra edge (as it is today) is something that could be revisited later. ------------- PR: https://git.openjdk.java.net/jdk/pull/6572 From phedlin at openjdk.java.net Tue Dec 14 14:56:32 2021 From: phedlin at openjdk.java.net (Patric Hedlin) Date: Tue, 14 Dec 2021 14:56:32 GMT Subject: [jdk18] RFR: 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 Message-ID: Implementation of ISO/ASCII char set encoding, extending current implementation with ASCII encoding support. Implementation with slight focus on balance between footprint and efficiency, trying to utilise a dual SIMD path (e.g. Neoverse N1) for the additional Ascii-check and avoid performance loss in the ISO-only case. - Interleaved ISO and ASCII check code. - Avoid 'umaxv' in the ISO main flow. - Using post inc in main loop. - Retain 8-char loop. - Removing conditional prefetch (no upside). - Adding ISO-8859-1 to encode-decode benchmark. Testing: tier1-3 The revised version compares like this (master vs. update). Benchmark (size) (type) Mode Cnt Score Error Units CharsetEncodeDecode.encode 16384 UTF-8 avgt 30 17.920 ? 0.229 us/op CharsetEncodeDecode.encode 16384 BIG5 avgt 30 18.867 ? 0.356 us/op CharsetEncodeDecode.encode 16384 ISO-8859-15 avgt 30 17.419 ? 0.220 us/op CharsetEncodeDecode.encode 16384 ISO-8859-1 avgt 30 6.200 ? 0.134 us/op CharsetEncodeDecode.encode 16384 ASCII avgt 30 17.149 ? 0.219 us/op CharsetEncodeDecode.encode 16384 UTF-16 avgt 30 135.115 ? 1.440 us/op Benchmark (size) (type) Mode Cnt Score Error Units CharsetEncodeDecode.encode 16384 UTF-8 avgt 30 9.018 ? 0.179 us/op CharsetEncodeDecode.encode 16384 BIG5 avgt 30 10.550 ? 0.470 us/op CharsetEncodeDecode.encode 16384 ISO-8859-15 avgt 30 8.843 ? 0.187 us/op CharsetEncodeDecode.encode 16384 ISO-8859-1 avgt 30 6.406 ? 0.155 us/op CharsetEncodeDecode.encode 16384 ASCII avgt 30 8.822 ? 0.173 us/op CharsetEncodeDecode.encode 16384 UTF-16 avgt 30 135.195 ? 1.432 us/op ------------- Commit messages: - 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 Changes: https://git.openjdk.java.net/jdk18/pull/20/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=20&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8274243 Stats: 256 lines in 6 files changed: 126 ins; 90 del; 40 mod Patch: https://git.openjdk.java.net/jdk18/pull/20.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/20/head:pull/20 PR: https://git.openjdk.java.net/jdk18/pull/20 From phedlin at openjdk.java.net Tue Dec 14 14:56:32 2021 From: phedlin at openjdk.java.net (Patric Hedlin) Date: Tue, 14 Dec 2021 14:56:32 GMT Subject: [jdk18] RFR: 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 10:45:28 GMT, Patric Hedlin wrote: > Implementation of ISO/ASCII char set encoding, extending current implementation with ASCII encoding support. > > Implementation with slight focus on balance between footprint and efficiency, trying to utilise a dual SIMD path (e.g. Neoverse N1) for the additional Ascii-check and avoid performance loss in the ISO-only case. > > - Interleaved ISO and ASCII check code. > - Avoid 'umaxv' in the ISO main flow. > - Using post inc in main loop. > - Retain 8-char loop. > - Removing conditional prefetch (no upside). > - Adding ISO-8859-1 to encode-decode benchmark. > > Testing: tier1-3 > > The revised version compares like this (master vs. update). > > Benchmark (size) (type) Mode Cnt Score Error Units > CharsetEncodeDecode.encode 16384 UTF-8 avgt 30 17.920 ? 0.229 us/op > CharsetEncodeDecode.encode 16384 BIG5 avgt 30 18.867 ? 0.356 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-15 avgt 30 17.419 ? 0.220 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-1 avgt 30 6.200 ? 0.134 us/op > CharsetEncodeDecode.encode 16384 ASCII avgt 30 17.149 ? 0.219 us/op > CharsetEncodeDecode.encode 16384 UTF-16 avgt 30 135.115 ? 1.440 us/op > > > Benchmark (size) (type) Mode Cnt Score Error Units > CharsetEncodeDecode.encode 16384 UTF-8 avgt 30 9.018 ? 0.179 us/op > CharsetEncodeDecode.encode 16384 BIG5 avgt 30 10.550 ? 0.470 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-15 avgt 30 8.843 ? 0.187 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-1 avgt 30 6.406 ? 0.155 us/op > CharsetEncodeDecode.encode 16384 ASCII avgt 30 8.822 ? 0.173 us/op > CharsetEncodeDecode.encode 16384 UTF-16 avgt 30 135.195 ? 1.432 us/op Benchmarks, master vs. update (ran on Aurora/Ampere Altra): openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:ASCII ........77.55% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:BIG5 .........76.71% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:ISO_8859_1 ...-2.31% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:ISO_8859_15 ..75.58% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:UTF_16 ....... 1.04% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:UTF_8 ........76.90% Note that ISO-8859-1 compares with the old intrinsic implementation (essentially the same) and that UTF-16 does not utilise the intrinsic. Runs that show the more pessimistic speed-up, when processing 2^n - 1 chars. openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2047-type:ASCII .........72.97% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2047-type:BIG5 ..........64.46% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2047-type:ISO_8859_1 ....-1.67% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2047-type:ISO_8859_15 ...70.85% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2047-type:UTF_16 ........-4.60% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2047-type:UTF_8 .........70.44% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:511-type:ASCII ..........60.35% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:511-type:BIG5 ...........52.61% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:511-type:ISO_8859_1 ..... 1.75% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:511-type:ISO_8859_15 ....61.45% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:511-type:UTF_16 .........-1.01% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:511-type:UTF_8 ..........59.46% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:ASCII ..........54.26% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:BIG5 ...........42.82% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:ISO_8859_1 .....-0.54% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:ISO_8859_15 ....64.86% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:UTF_16 .........-0.09% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:UTF_8 ..........60.44% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:ASCII ..........51.51% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:BIG5 ...........46.54% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:ISO_8859_1 .....-0.32% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:ISO_8859_15 ....56.48% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:UTF_16 ......... 0.44% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:UTF_8 ..........54.84% Runs to illustrate the threshold effect between the loops in the implementation. openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:32-type:ASCII ...........32.30% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:32-type:BIG5 ............31.93% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:32-type:ISO_8859_1 ......-0.02% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:32-type:ISO_8859_15 .....37.92% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:32-type:UTF_16 .......... 4.45% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:32-type:UTF_8 ...........40.35% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:ASCII ...........20.06% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:BIG5 ............21.64% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:ISO_8859_1 ......-1.13% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:ISO_8859_15 .....27.04% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:UTF_16 .......... 1.20% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:UTF_8 ...........24.72% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:ASCII ...........19.37% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:BIG5 ............20.20% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:ISO_8859_1 ......-1.01% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:ISO_8859_15 .....29.16% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:UTF_16 .......... 0.34% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:UTF_8 ...........25.35% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:ASCII ...........13.03% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:BIG5 ............13.74% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:ISO_8859_1 ......-0.13% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:ISO_8859_15 .....19.26% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:UTF_16 .......... 0.78% openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:UTF_8 ...........17.70% Using the microbenchmarks provided by @carterkozak here: https://github.com/carterkozak/stringbuilder-encoding-performance, comparing master vs. update as follows: Benchmark (charsetName) (message) (timesToAppend) Mode Cnt Score Error Units EncoderBenchmarks.charsetEncoder UTF-8 This is a simple ASCII message 3 avgt 4 151.025 ? 28.111 ns/op EncoderBenchmarks.charsetEncoder UTF-8 This is a message with unicode ?? 3 avgt 4 323.254 ? 5.648 ns/op EncoderBenchmarks.charsetEncoderWithAllocation UTF-8 This is a simple ASCII message 3 avgt 4 244.375 ? 98.844 ns/op EncoderBenchmarks.charsetEncoderWithAllocation UTF-8 This is a message with unicode ?? 3 avgt 4 405.415 ? 5.947 ns/op EncoderBenchmarks.charsetEncoderWithAllocationWrappingBuilder UTF-8 This is a simple ASCII message 3 avgt 4 728.172 ? 22.419 ns/op EncoderBenchmarks.charsetEncoderWithAllocationWrappingBuilder UTF-8 This is a message with unicode ?? 3 avgt 4 859.015 ? 90.541 ns/op EncoderBenchmarks.toStringGetBytes UTF-8 This is a simple ASCII message 3 avgt 4 117.044 ? 11.484 ns/op EncoderBenchmarks.toStringGetBytes UTF-8 This is a message with unicode ?? 3 avgt 4 483.399 ? 38.614 ns/op Benchmark (charsetName) (message) (timesToAppend) Mode Cnt Score Error Units EncoderBenchmarks.charsetEncoder UTF-8 This is a simple ASCII message 3 avgt 4 113.954 ? 7.657 ns/op EncoderBenchmarks.charsetEncoder UTF-8 This is a message with unicode ?? 3 avgt 4 353.266 ? 10.124 ns/op EncoderBenchmarks.charsetEncoderWithAllocation UTF-8 This is a simple ASCII message 3 avgt 4 196.643 ? 52.954 ns/op EncoderBenchmarks.charsetEncoderWithAllocation UTF-8 This is a message with unicode ?? 3 avgt 4 429.157 ? 11.506 ns/op EncoderBenchmarks.charsetEncoderWithAllocationWrappingBuilder UTF-8 This is a simple ASCII message 3 avgt 4 728.138 ? 34.898 ns/op EncoderBenchmarks.charsetEncoderWithAllocationWrappingBuilder UTF-8 This is a message with unicode ?? 3 avgt 4 859.697 ? 61.397 ns/op EncoderBenchmarks.toStringGetBytes UTF-8 This is a simple ASCII message 3 avgt 4 117.269 ? 6.623 ns/op EncoderBenchmarks.toStringGetBytes UTF-8 This is a message with unicode ?? 3 avgt 4 491.559 ? 68.169 ns/op Note: The above was ran on a local dev-machine typically producing less than _perfectly_ consistent results. ------------- PR: https://git.openjdk.java.net/jdk18/pull/20 From phedlin at openjdk.java.net Tue Dec 14 14:57:12 2021 From: phedlin at openjdk.java.net (Patric Hedlin) Date: Tue, 14 Dec 2021 14:57:12 GMT Subject: RFR: 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 14:09:07 GMT, Patric Hedlin wrote: > Implementation of ISO/ASCII char set encoding, extending current implementation with ASCII encoding support. > > Implementation focusing on balance between small footprint and efficiency, trying to utilise a dual SIMD path (e.g. Neoverse N1) for the additional Ascii-check. > > Testing: tier1-6 > > Benchmarks, 18-b26 vs. update (ran on Aurora/Ampere Altra): > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:ASCII..........72.23% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:BIG5...........70.38% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:ISO_8859_15....67.81% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:UTF_16......... 3.72% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16384-type:UTF_8..........68.50% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:ASCII...........65.59% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:BIG5............60.59% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:ISO_8859_15.....63.79% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:UTF_16.......... 1.04% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:2048-type:UTF_8...........63.33% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:ASCII............57.25% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:BIG5.............49.33% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:ISO_8859_15......61.37% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:UTF_16........... 0.02% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:512-type:UTF_8............54.75% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:ASCII............54.52% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:BIG5.............40.41% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:ISO_8859_15......58.46% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:UTF_16...........-0.55% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:255-type:UTF_8............55.98% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:ASCII............47.37% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:BIG5.............36.41% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:ISO_8859_15......50.83% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:UTF_16........... 8.63% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:127-type:UTF_8............48.95% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:ASCII.............17.55% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:BIG5..............18.58% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:ISO_8859_15.......20.82% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:UTF_16............ 4.16% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:31-type:UTF_8.............18.44% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:ASCII.............21.96% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:BIG5..............22.42% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:ISO_8859_15.......30.27% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:UTF_16............-1.17% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:16-type:UTF_8.............35.99% > > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:ASCII............. 6.19% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:BIG5.............. 7.34% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:ISO_8859_15....... 8.34% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:UTF_16............-0.46% > openjdk.bench.java.nio.CharsetEncodeDecode.encode-size:15-type:UTF_8............. 6.80% New PR on JDK-18 repo. https://github.com/openjdk/jdk18/pull/20 ------------- PR: https://git.openjdk.java.net/jdk/pull/6723 From duke at openjdk.java.net Tue Dec 14 15:48:07 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Tue, 14 Dec 2021 15:48:07 GMT Subject: RFR: 8278471: Reorder optimizations in addnode Ideal [v3] In-Reply-To: References: Message-ID: On Mon, 13 Dec 2021 18:24:42 GMT, Zhiqiang Zang wrote: >> Reorder optimizations in addnode so special cases appear before general cases; otherwise the special cases would be never covered. >> >> `(a - b) + (c - d)` subsumes both `(a - b) + (b - c)` and `(a - b) + (c - a)`. Therefore `(a - b) + (b - c)` and `(a - b) + (c - a)` have to be placed before `(a - b) + (c - d)` so that they can work. > > Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: > > remove unnecessary optimizations. > Thank you for your review. @DamonFool Would you mind making the change to JBS for me? I don't have an account there, so I am not able to edit. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6752 From dcubed at openjdk.java.net Tue Dec 14 15:54:13 2021 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Tue, 14 Dec 2021 15:54:13 GMT Subject: [jdk18] RFR: 8278758: runtime/BootstrapMethod/BSMCalledTwice.java fails with release VMs after JDK-8262134 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 11:18:54 GMT, Jie Fu wrote: > Hi all, > > runtime/BootstrapMethod/BSMCalledTwice.java fails with release VMs due to 'DeoptimizeALot' is develop after JDK-8262134. > This is because `-XX:+DeoptimizeALot` and `-XX:+VerifyStack` are develop and are available only in debug version. > > The fix only run the test config added by JDK-8262134 with debug VMs. > > Thanks. > Best regards, > Jie Thumbs up. This is a trivial fix. ------------- Marked as reviewed by dcubed (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/21 From kvn at openjdk.java.net Tue Dec 14 17:47:16 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 14 Dec 2021 17:47:16 GMT Subject: RFR: 8276455: C2: iterative EA In-Reply-To: References: <6JWGnSJyY6M2F0BG1_WCKTAU08sJQ7fFePu8KS7lIjE=.67c1c3e9-b4f8-4ffd-823e-d27f55c434a3@github.com> Message-ID: On Tue, 14 Dec 2021 08:45:11 GMT, Roland Westrelin wrote: > That looks good to me. > Have you measured how compile time is affected? Thank you, Roland, for review. Originally I was also concerned about that. I even added bailout if following iterations take a time. But after collecting times data I found that majority of time is spent in first iteration (as before). Following iterations takes only fraction of time because smaller EA's Connection Graph after allocations were removed during first iteration. ------------- PR: https://git.openjdk.java.net/jdk/pull/6222 From chagedorn at openjdk.java.net Tue Dec 14 18:00:04 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 14 Dec 2021 18:00:04 GMT Subject: [jdk18] RFR: 8278420: C2: assert(!n->is_Store() && !n->is_LoadStore()) failed: no node with a side effect In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 15:43:52 GMT, Christian Hagedorn wrote: > The test case fails with the assertion when an actual unreachable store node with only uses outside of the loop is tried to be sunk out of a dead loop in split-if. This is quite an edge case in which C2 is not able to remove the inner loop but the store for `iFldArr2` inside this loop dies due to improved type information after peeling. This removes some memory phis as well and leaves the store `iFldArr1` with only outside the loop uses. A more detailed explanation how we end up in this situation is shown in the comments of the test case. > > This suggests that the assertion is too strong. I propose to relax the assertion and bail out if we are trying to sink a store node. However, I don't think that we will reach this code with `LoadStore` nodes as they have other memory outputs inside a loop, preventing to reach this assertion code. > > Thanks, > Christian Yes, exactly. You're right, this fixes the wrong thing. But using skeleton predicates instead sounds like a good idea! After having looked at it again, I also think that we can indeed use skeleton predicates to handle this corner case. I'm reworking the fix. ------------- PR: https://git.openjdk.java.net/jdk18/pull/11 From sviswanathan at openjdk.java.net Tue Dec 14 18:23:31 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 14 Dec 2021 18:23:31 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 06:16:23 GMT, Smita Kamath wrote: > The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/19 From sviswanathan at openjdk.java.net Tue Dec 14 18:27:34 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 14 Dec 2021 18:27:34 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 06:16:23 GMT, Smita Kamath wrote: > The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. @vnkozlov Could you please also review this patch? ------------- PR: https://git.openjdk.java.net/jdk18/pull/19 From jbhateja at openjdk.java.net Tue Dec 14 19:25:58 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 14 Dec 2021 19:25:58 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. Message-ID: - Vector.maskAll was accelerated for AVX-512 target, but x86 existing backend implementation does not enable maskAll instruction patterns for 32 bit JVM, due to which operations fall backs over replicateB operation which broadcasts the mask value in a vector. - In some cases after unboxing-boxing optimization this vector eventually reaches to XorVMask which has different operands one held in opmask register and other in vector. Kindly review and share your feedback. Best Regards, Jatin ------------- Commit messages: - 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. Changes: https://git.openjdk.java.net/jdk18/pull/24/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=24&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278508 Stats: 175 lines in 7 files changed: 128 ins; 34 del; 13 mod Patch: https://git.openjdk.java.net/jdk18/pull/24.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/24/head:pull/24 PR: https://git.openjdk.java.net/jdk18/pull/24 From kvn at openjdk.java.net Tue Dec 14 19:28:40 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 14 Dec 2021 19:28:40 GMT Subject: Integrated: 8276455: C2: iterative EA In-Reply-To: <6JWGnSJyY6M2F0BG1_WCKTAU08sJQ7fFePu8KS7lIjE=.67c1c3e9-b4f8-4ffd-823e-d27f55c434a3@github.com> References: <6JWGnSJyY6M2F0BG1_WCKTAU08sJQ7fFePu8KS7lIjE=.67c1c3e9-b4f8-4ffd-823e-d27f55c434a3@github.com> Message-ID: On Wed, 3 Nov 2021 01:44:49 GMT, Vladimir Kozlov wrote: > Resolve C2 issue with nested initialization when Escape Analysis can not scalarize allocations: `new A(new B( new C)))` > Implemented iterative EA when C2 invokes it again if there are progress and candidates. > > I added JMH microbenchmark with cases which show improvements. Improvements are due to removed allocation code. > > Before: > > Benchmark Mode Cnt Score Error Units > IterativeEA.test1 avgt 5 11.489 ? 3.037 ns/op > IterativeEA.test2 avgt 5 16.103 ? 3.686 ns/op > IterativeEA.test3 avgt 5 1988.827 ? 217.831 ns/op > > With these changes: > > IterativeEA.test1 avgt 5 2.182 ? 0.022 ns/op > IterativeEA.test2 avgt 5 2.375 ? 0.001 ns/op > IterativeEA.test3 avgt 5 821.011 ? 8.268 ns/op > > > An other JMH test: > > PointerBenchmarkFlat.test avgt 30 23.232 ? 6.507 ms/op > > vs > > PointerBenchmarkFlat.test avgt 30 0.299 ? 0.001 ms/op This pull request has now been integrated. Changeset: a1dfe572 Author: Vladimir Kozlov URL: https://git.openjdk.java.net/jdk/commit/a1dfe57249db15c0c05d33a0014ac914a7093089 Stats: 571 lines in 10 files changed: 541 ins; 14 del; 16 mod 8276455: C2: iterative EA Reviewed-by: iveresov, neliasso, roland ------------- PR: https://git.openjdk.java.net/jdk/pull/6222 From kvn at openjdk.java.net Tue Dec 14 19:35:38 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 14 Dec 2021 19:35:38 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. In-Reply-To: References: Message-ID: <9PhB-2gHEvsB1GPc4yS0w2aD7E-5HWbWPwgPpKR_Q38=.bc90e8d7-86ab-4ff2-b58f-fbb0cc0112e5@github.com> On Tue, 14 Dec 2021 19:18:47 GMT, Jatin Bhateja wrote: > - Vector.maskAll was accelerated for AVX-512 target, but x86 existing backend implementation does not enable maskAll instruction patterns for 32 bit JVM, due to which operations fall backs over replicateB operation which broadcasts the mask value in a vector. > - In some cases after unboxing-boxing optimization this vector eventually reaches to XorVMask which has different operands one held in opmask register and other in vector. > > Kindly review and share your feedback. > > Best Regards, > Jatin Seems reasonable. Let me test it. ------------- PR: https://git.openjdk.java.net/jdk18/pull/24 From kvn at openjdk.java.net Tue Dec 14 20:08:41 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 14 Dec 2021 20:08:41 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 06:16:23 GMT, Smita Kamath wrote: > The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. Yes, we need to reexecute because code could be deoptimized during `new_array()` allocation. But why we allocate this temp array in Java heap? Why not on stack in stub code? Also I noticed next return from intrinsics code could be moved up before we generate new nodes in graph: `if (Matcher::htbl_entries == -1) return false;` ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/19 From kvn at openjdk.java.net Tue Dec 14 21:33:08 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 14 Dec 2021 21:33:08 GMT Subject: RFR: 8277893: Arraycopy stress tests [v4] In-Reply-To: References: Message-ID: On Thu, 9 Dec 2021 07:12:47 GMT, Aleksey Shipilev wrote: >> I would like to fork the new tests off the JDK-8150730. These tests were instrumental in capturing many bugs in my arraycopy work, and I think they are good on their own merit, because they provide a test for the current baseline and on-going minor improvements in arraycopy on all platforms, not only x86_64, and they might be cleanly backportable. >> >> A brief tour of these tests: >> >> - Tests all data types; >> - Tests small arrays exhaustively, which captures conjoint/disjoint cases, errors near the edges, etc; >> - Tests large arrays with fuzzing around powers of two and powers of ten, both conjoint and disjoint cases; >> - Tests all available compilation modes for arraycopy stubs; for example, running on AVX-512 enabled machine runs all versions down to `-XX:UseAVX=0 -XX:UseSSE=0` cases; >> - Tests with/without compressed oops mode -- theoretically only needed for `Object` copies, but Hotspot cobbles together int+coops and long+no-coops loops, so I decided to alternate coops mode for all data types; >> >> My previous version used individual `@run` clauses for all configurations, but I think the Java driver is cleaner and easier to maintain. >> >> Test times: >> >> >> # x86_64 (TR 3970X) >> real 4m6.192s >> user 52m50.523s >> sys 0m13.755s >> >> # x86_64 (TR 3970X) -XX:+UseZGC >> real 6m2.573s >> user 72m43.541s >> sys 0m25.697s >> >> # x86_32 (TR 3970X) >> real 6m56.405s >> user 92m56.377s >> sys 0m6.677s >> >> # x86_64 (i5-11500) >> real 29m19.024s >> user 103m52.925s >> sys 1m7.175s >> >> # AArch64 (ThunderX2) >> real 2m59.623s >> user 26m14.624s >> sys 0m9.771s >> >> >> Since these tests are quite long, especially on small machines, I hooked them up to `hotspot:tier3`. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `compiler/stress/arraycopy` >> - [x] Linux x86_32 fastdebug `compiler/stress/arraycopy` >> - [x] Linux AArch64 fastdebug `compiler/stress/arraycopy` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 13 additional commits since the last revision: > > - Bump timeout to 7200 > - Merge branch 'master' into JDK-8277893-arraycopy-tests > - Package declarations > - Add safety check for small systems > - Renames > - Single driver for all the tests > - Safer timeout settings > - Post-merge TEST.groups cleanup > - Merge branch 'master' into JDK-8277893-arraycopy-tests > - Merge branch 'master' into JDK-8277893-arraycopy-tests > - ... and 3 more: https://git.openjdk.java.net/jdk/compare/382293c9...b749c367 Good. Something happened with notifications. I also did not get your response. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6594 From dcubed at openjdk.java.net Tue Dec 14 22:02:59 2021 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Tue, 14 Dec 2021 22:02:59 GMT Subject: RFR: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" [v5] In-Reply-To: <-LFrdsGVbk1xLZt5WdxzHgHxVyGQbFf5sxenbdX_NsE=.dc35826f-cf11-4ef0-8da5-92dc8ef499bb@github.com> References: <-LFrdsGVbk1xLZt5WdxzHgHxVyGQbFf5sxenbdX_NsE=.dc35826f-cf11-4ef0-8da5-92dc8ef499bb@github.com> Message-ID: On Mon, 13 Dec 2021 23:36:49 GMT, Jie Fu wrote: >> Hi all, >> >> I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. >> Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. >> >> The patch also removes an useless statement [4]. >> >> Testing: >> - vector api tests on Linux/x64-{AVX512, AVX2} >> >> Thanks. >> Best regards, >> Jie >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Add comment for -XX:DisableIntrinsic=_VectorMaskOp I'm hoping to see this fix integrated soon. We're seeing 3-4 Tier7 failures per job set... ------------- PR: https://git.openjdk.java.net/jdk/pull/6808 From dlong at openjdk.java.net Tue Dec 14 22:08:04 2021 From: dlong at openjdk.java.net (Dean Long) Date: Tue, 14 Dec 2021 22:08:04 GMT Subject: RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert [v3] In-Reply-To: References: Message-ID: On Tue, 7 Dec 2021 08:25:50 GMT, Roland Westrelin wrote: >> Root cause is identical to 8273165 AFIU: late inline of a virtual call >> can throw from 2 different paths (null check and the call >> itself). That breaks because the logic for exceptions expects the >> stack for all paths that throw exceptions to have the same stack size. >> >> AFAIU, the stack doesn't matter exception handling: either the >> exception is caught by a exception handler and then the stack is >> popped and the exception is pushed or, the exception is rethrown to >> the caller in which case the current stack is also popped (that is the >> jvm state for the current method). As a consequence the fix I propose >> is to ignore the stack in GraphKit::combine_exception_states(). >> >> AFAIU, the same fix would work for 8273165 but I left the current work >> around as is: not sure if we want to be conservative for now or not > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - comment > - Merge branch 'master' into JDK-8275638 > - alternate fix > - make test runnable with release build > - more > - fix Marked as reviewed by dlong (Reviewer). Yes, I agree that's the current difference between parse-time inlining and late inlining as implemented today. I'd just like to know why late inlining must combine the exceptions states into a single state. But that's not a blocker for your fix. John wrote me and said your fix looks OK (thought not a general solution). For your test, he suggested adding a 3rd case with extra call depth: "test3 -> mcaller -> m". Also, I believe @vnkozlov will take a look too. It would be nice if some of the assumptions the code is making could be checked, which could be done in a separate RFE with perhaps a more general solution: - an exception state with cleared stack won't be used to reexecute a bytecode (uncommon trap) - an exception state with uncleared stack must be used to reexecute - safepoint stack size matches computed interpreter oopmap stack size (VerifyStack logic) ------------- PR: https://git.openjdk.java.net/jdk/pull/6572 From jiefu at openjdk.java.net Tue Dec 14 22:52:03 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 14 Dec 2021 22:52:03 GMT Subject: Integrated: 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" In-Reply-To: References: Message-ID: On Sun, 12 Dec 2021 05:51:31 GMT, Jie Fu wrote: > Hi all, > > I'd like to fix the vector_length_encoding error in `long_to_maskLE8_avx` and `long_to_maskGT8_avx`. > Since the input parameter of `vector_length_encoding` [1] is the number of vector bytes (not number of vector bits), I believe we shouldn't `mask_len*8` [2][3]. > > The patch also removes an useless statement [4]. > > Testing: > - vector api tests on Linux/x64-{AVX512, AVX2} > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1219 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9552 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9568 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L9580 This pull request has now been integrated. Changeset: 2def7e91 Author: Jie Fu URL: https://git.openjdk.java.net/jdk/commit/2def7e913207af788e582ed5bde21b28883183de Stats: 15 lines in 2 files changed: 12 ins; 1 del; 2 mod 8278584: compiler/vectorapi/VectorMaskLoadStoreTest.java failed with "Error: ShouldNotReachHere()" Reviewed-by: kvn, psandoz ------------- PR: https://git.openjdk.java.net/jdk/pull/6808 From jiefu at openjdk.java.net Tue Dec 14 22:54:10 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 14 Dec 2021 22:54:10 GMT Subject: [jdk18] RFR: 8278758: runtime/BootstrapMethod/BSMCalledTwice.java fails with release VMs after JDK-8262134 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 15:50:52 GMT, Daniel D. Daugherty wrote: > Thumbs up. This is a trivial fix. Thanks @dcubed-ojdk . ------------- PR: https://git.openjdk.java.net/jdk18/pull/21 From jiefu at openjdk.java.net Tue Dec 14 22:54:11 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 14 Dec 2021 22:54:11 GMT Subject: [jdk18] Integrated: 8278758: runtime/BootstrapMethod/BSMCalledTwice.java fails with release VMs after JDK-8262134 In-Reply-To: References: Message-ID: <-0C6lb1MgaeQ6Ryd2IdxfkEkvU-Rw_AbgiszTVQYAMo=.e2e312d2-8fe6-41a3-b263-8ec6f8d0f23d@github.com> On Tue, 14 Dec 2021 11:18:54 GMT, Jie Fu wrote: > Hi all, > > runtime/BootstrapMethod/BSMCalledTwice.java fails with release VMs due to 'DeoptimizeALot' is develop after JDK-8262134. > This is because `-XX:+DeoptimizeALot` and `-XX:+VerifyStack` are develop and are available only in debug version. > > The fix only run the test config added by JDK-8262134 with debug VMs. > > Thanks. > Best regards, > Jie This pull request has now been integrated. Changeset: f48a3e86 Author: Jie Fu URL: https://git.openjdk.java.net/jdk18/commit/f48a3e86d0274912160f3c415f92741eefa1cb1d Stats: 11 lines in 1 file changed: 9 ins; 1 del; 1 mod 8278758: runtime/BootstrapMethod/BSMCalledTwice.java fails with release VMs after JDK-8262134 Reviewed-by: dcubed ------------- PR: https://git.openjdk.java.net/jdk18/pull/21 From jiefu at openjdk.java.net Tue Dec 14 23:00:08 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 14 Dec 2021 23:00:08 GMT Subject: RFR: 8278471: Reorder optimizations in addnode Ideal [v3] In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 15:45:13 GMT, Zhiqiang Zang wrote: > > > > Thank you for your review. @DamonFool Would you mind making the change to JBS for me? I don't have an account there, so I am not able to edit. Thanks. The JBS title had been updated. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6752 From jiefu at openjdk.java.net Tue Dec 14 23:17:59 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 14 Dec 2021 23:17:59 GMT Subject: RFR: 8278471: Remove unreached rules in AddNode::IdealIL [v3] In-Reply-To: References: Message-ID: On Mon, 13 Dec 2021 18:24:42 GMT, Zhiqiang Zang wrote: >> Reorder optimizations in addnode so special cases appear before general cases; otherwise the special cases would be never covered. >> >> `(a - b) + (c - d)` subsumes both `(a - b) + (b - c)` and `(a - b) + (c - a)`. Therefore `(a - b) + (b - c)` and `(a - b) + (c - a)` have to be placed before `(a - b) + (c - d)` so that they can work. > > Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: > > remove unnecessary optimizations. > /integrate Hi @CptGit , you need at least two reviews for hotspot changes. So one more review is required. ------------- PR: https://git.openjdk.java.net/jdk/pull/6752 From duke at openjdk.java.net Tue Dec 14 23:22:57 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Tue, 14 Dec 2021 23:22:57 GMT Subject: RFR: 8278471: Remove unreached rules in AddNode::IdealIL [v3] In-Reply-To: References: Message-ID: <92JLQnAALPZQkenmyMl9cGoFX0HOIEIa_oN82GVIvP8=.71e0afd4-29db-4eb9-9a73-7da109f49f36@github.com> On Tue, 14 Dec 2021 23:15:12 GMT, Jie Fu wrote: > > /integrate > > Hi @CptGit , you need at least two reviews for hotspot changes. So one more review is required. It thought it can be integrated since the bot told me.Thanks for letting me know the requirement. ------------- PR: https://git.openjdk.java.net/jdk/pull/6752 From kvn at openjdk.java.net Wed Dec 15 01:41:59 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 15 Dec 2021 01:41:59 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 19:18:47 GMT, Jatin Bhateja wrote: > - Vector.maskAll was accelerated for AVX-512 target, but x86 existing backend implementation does not enable maskAll instruction patterns for 32 bit JVM, due to which operations fall backs over replicateB operation which broadcasts the mask value in a vector. > - In some cases after unboxing-boxing optimization this vector eventually reaches to XorVMask which has different operands one held in opmask register and other in vector. > > Kindly review and share your feedback. > > Best Regards, > Jatin Regular testing (x64) results are good. But I also run `jdk/incubator/vector/` tests locally with 32-bit VM (fastdebug). Most tests passed but some hit timeout (run for more than default 2 min): jdk/incubator/vector/Byte512VectorTests.java jdk/incubator/vector/ByteMaxVectorTests.java jdk/incubator/vector/VectorReshapeTests.java I assume such vectors are not supported in 32-bit VM and code is slow. Simple solution is to add `/timeout=240` (4 min) to these tests. They passed for me after that. ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/24 From gli at openjdk.java.net Wed Dec 15 01:54:58 2021 From: gli at openjdk.java.net (Guoxiong Li) Date: Wed, 15 Dec 2021 01:54:58 GMT Subject: RFR: 8278471: Remove unreached rules in AddNode::IdealIL [v3] In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 23:15:12 GMT, Jie Fu wrote: >> Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: >> >> remove unnecessary optimizations. > >> /integrate > > Hi @CptGit , you need at least two reviews for hotspot changes. > So one more review is required. @DamonFool Now we have the command **reviewers** to change the number of the required reviewers. Hope more developers know this. ------------- PR: https://git.openjdk.java.net/jdk/pull/6752 From jiefu at openjdk.java.net Wed Dec 15 02:04:56 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 15 Dec 2021 02:04:56 GMT Subject: RFR: 8278471: Remove unreached rules in AddNode::IdealIL [v3] In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 23:15:12 GMT, Jie Fu wrote: >> Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: >> >> remove unnecessary optimizations. > >> /integrate > > Hi @CptGit , you need at least two reviews for hotspot changes. > So one more review is required. > @DamonFool Now we have the command **reviewers** to change the number of the required reviewers. Hope more developers know this. Nice! Got it and thanks @lgxbslgx . ------------- PR: https://git.openjdk.java.net/jdk/pull/6752 From pli at openjdk.java.net Wed Dec 15 02:26:00 2021 From: pli at openjdk.java.net (Pengfei Li) Date: Wed, 15 Dec 2021 02:26:00 GMT Subject: RFR: 8277619: AArch64: Incorrect parameter type in Advanced SIMD Copy assembler functions In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 09:49:23 GMT, Fei Gao wrote: > Mov (from general), incorrectly uses SIMD_Arrangement as the parameter > type of the assembler function. However, from Arm ARM [1], it's more > precise to use SIMD_RegVariant here. > > The situation is similar to Mov(to general) [2]. > > Note that as Mov(to general) is an alias of UMOV, we turn to re-use > UMOV encoding for Mov(to general) in this patch. > > [1] https://developer.arm.com/documentation/ddi0602/2020-12/SIMD-FP-Instructions/MOV--from-general---Move-general-purpose-register-to-a-vector-element--an-alias-of-INS--general-- > [2] https://developer.arm.com/documentation/ddi0602/2020-12/SIMD-FP-Instructions/MOV--to-general---Move-vector-element-to-general-purpose-register--an-alias-of-UMOV- Marked as reviewed by pli (Committer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6629 From fgao at openjdk.java.net Wed Dec 15 02:30:03 2021 From: fgao at openjdk.java.net (Fei Gao) Date: Wed, 15 Dec 2021 02:30:03 GMT Subject: Integrated: 8277619: AArch64: Incorrect parameter type in Advanced SIMD Copy assembler functions In-Reply-To: References: Message-ID: <1pddjU3ikeReukL53dsA-WpA36scAt-kxVEk4vYLCzQ=.c9605839-9c1c-440a-9bd5-dabf8eb8a1ae@github.com> On Wed, 1 Dec 2021 09:49:23 GMT, Fei Gao wrote: > Mov (from general), incorrectly uses SIMD_Arrangement as the parameter > type of the assembler function. However, from Arm ARM [1], it's more > precise to use SIMD_RegVariant here. > > The situation is similar to Mov(to general) [2]. > > Note that as Mov(to general) is an alias of UMOV, we turn to re-use > UMOV encoding for Mov(to general) in this patch. > > [1] https://developer.arm.com/documentation/ddi0602/2020-12/SIMD-FP-Instructions/MOV--from-general---Move-general-purpose-register-to-a-vector-element--an-alias-of-INS--general-- > [2] https://developer.arm.com/documentation/ddi0602/2020-12/SIMD-FP-Instructions/MOV--to-general---Move-vector-element-to-general-purpose-register--an-alias-of-UMOV- This pull request has now been integrated. Changeset: c442587f Author: Fei Gao Committer: Pengfei Li URL: https://git.openjdk.java.net/jdk/commit/c442587f1e72a614302cd76c20e13f1cb1703641 Stats: 52 lines in 7 files changed: 1 ins; 3 del; 48 mod 8277619: AArch64: Incorrect parameter type in Advanced SIMD Copy assembler functions Reviewed-by: aph, pli ------------- PR: https://git.openjdk.java.net/jdk/pull/6629 From gli at openjdk.java.net Wed Dec 15 02:47:59 2021 From: gli at openjdk.java.net (Guoxiong Li) Date: Wed, 15 Dec 2021 02:47:59 GMT Subject: RFR: 8278471: Remove unreached rules in AddNode::IdealIL [v3] In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 02:01:46 GMT, Jie Fu wrote: >>> /integrate >> >> Hi @CptGit , you need at least two reviews for hotspot changes. >> So one more review is required. > >> @DamonFool Now we have the command **reviewers** to change the number of the required reviewers. Hope more developers know this. > > Nice! > Got it and thanks @lgxbslgx . > @DamonFool Usage: /reviewers [] where is the number of required reviewers. If role is set, the reviewers need to have that project role. If omitted, role defaults to authors. > Number of required reviewers of role reviewers cannot be decreased below 3 @DamonFool Unfortunately, it seems that you meet a bug. I filed [SKARA-1288](https://bugs.openjdk.java.net/browse/SKARA-1288) to follow up. ------------- PR: https://git.openjdk.java.net/jdk/pull/6752 From gli at openjdk.java.net Wed Dec 15 02:58:59 2021 From: gli at openjdk.java.net (Guoxiong Li) Date: Wed, 15 Dec 2021 02:58:59 GMT Subject: RFR: 8278104: C1 should support the compiler directive 'BreakAtExecute' In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 06:24:21 GMT, Xin Liu wrote: >> Hi all, >> >> Currently, the directive `BreakAtExecute` is not effective at C1. And the `CompileCommand=break` doesn't break the compiled method, too. This patch unifies the `BreakAtExecute` and `CompileCommand=break` to the directive 'BreakAtExecute' and uses the directive 'BreakAtExecute' to identify whether a breakpoint should be added. >> >> The test group `hotspot_compiler` passed locally.(linux x86_64 fastdebug) >> And the pre-submit tests passed before submitting the PR. >> >> Thanks for taking the time to review. >> >> Best Regards, >> -- Guoxiong > > hi, @lgxbslgx > your patch looks good to me. @navyxliu Thanks for your review. Still waiting for other reviewers. ------------- PR: https://git.openjdk.java.net/jdk/pull/6807 From kvn at openjdk.java.net Wed Dec 15 04:04:57 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 15 Dec 2021 04:04:57 GMT Subject: RFR: 8278471: Remove unreached rules in AddNode::IdealIL [v3] In-Reply-To: References: Message-ID: On Mon, 13 Dec 2021 18:24:42 GMT, Zhiqiang Zang wrote: >> Reorder optimizations in addnode so special cases appear before general cases; otherwise the special cases would be never covered. >> >> `(a - b) + (c - d)` subsumes both `(a - b) + (b - c)` and `(a - b) + (c - a)`. Therefore `(a - b) + (b - c)` and `(a - b) + (c - a)` have to be placed before `(a - b) + (c - d)` so that they can work. > > Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: > > remove unnecessary optimizations. Before you integrate this, please, add IR framework test which checks that conversions are really happened. Especially ones you removed. ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6752 From thartmann at openjdk.java.net Wed Dec 15 07:06:59 2021 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 15 Dec 2021 07:06:59 GMT Subject: RFR: 8183390: Fix and re-enable post loop vectorization In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 08:48:25 GMT, Pengfei Li wrote: > ### Background > > Post loop vectorization is a C2 compiler optimization in an experimental > VM feature called PostLoopMultiversioning. It transforms the range-check > eliminated post loop to a 1-iteration vectorized loop with vector mask. > This optimization was contributed by Intel in 2016 to support x86 AVX512 > masked vector instructions. However, it was disabled soon after an issue > was found. Due to insufficient maintenance in these years, multiple bugs > have been accumulated inside. But we (Arm) still think this is a useful > framework for vector mask support in C2 auto-vectorized loops, for both > x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable > post loop vectorization. > > ### Changes in this patch > > This patch reworks post loop vectorization. The most significant change > is removing vector mask support in C2 x86 backend and re-implementing > it in the mid-end. With this, we can re-enable post loop vectorization > for platforms other than x86. > > Previous implementation hard-codes x86 k1 register as a reserved AVX512 > opmask register and defines two routines (setvectmask/restorevectmask) > to set and restore the value of k1. But after [JDK-8211251](https://bugs.openjdk.java.net/browse/JDK-8211251) which encodes > AVX512 instructions as unmasked by default, generated vector masks are > no longer used in AVX512 vector instructions. To fix incorrect codegen > and add vector mask support for more platforms, we turn to add a vector > mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode > to generate a mask and replace all Load/Store nodes in the post loop > into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This > IR form is exactly the same to those which are used in VectorAPI mask > support. For now, we only add mask inputs for Load/Store nodes because > we don't have reduction operations supported in post loop vectorization. > After this change, the x86 k1 register is no longer reserved and can be > allocated when PostLoopMultiversioning is enabled. > > Besides this change, we have fixed a compiler crash and five incorrect > result issues with post loop vectorization. > > **I) C2 crashes with segmentation fault in strip-mined loops** > > Previous implementation was done before C2 loop strip-mining was merged > into JDK master so it didn't take strip-mined loops into consideration. > In C2's strip mined loops, post loop is not the sibling of the main loop > in ideal loop tree. Instead, it's the sibling of the main loop's parent. > This patch fixed a SIGSEGV issue caused by NULL pointer when locating > post loop from strip-mined main loop. > > **II) Incorrect result issues with post loop vectorization** > > We have also fixed five incorrect vectorization issues. Some of them are > hidden deep and can only be reproduced with corner cases. These issues > have a common cause that it assumes the post loop can be vectorized if > the vectorization in corresponding main loop is successful. But in many > cases this assumption is wrong. Below are details. > > - **[Issue-1] Incorrect vectorization for partial vectorizable loops** > > This issue can be reproduced by below loop where only some operations in > the loop body are vectorizable. > > for (int i = 0; i < 10000; i++) { > res[i] = a[i] * b[i]; > k = 3 * k + 1; > } > > In the main loop, superword can work well if parts of the operations in > loop body are not vectorizable since those parts can be unrolled only. > But for post loops, we don't create vectors through combining scalar IRs > generated from loop unrolling. Instead, we are doing scalars to vectors > replacement for all operations in the loop body. Hence, all operations > should be either vectorized together or not vectorized at all. To fix > this kind of cases, we add an extra field "_slp_vector_pack_count" in > CountedLoopNode to record the eventual count of vector packs in the main > loop. This value is then passed to post loop and compared with post loop > pack count. Vectorization will be bailed out in post loop if it creates > more vector packs than in the main loop. > > - **[Issue-2] Incorrect result in loops with growing-down vectors** > > This issue appears with growing-down vectors, that is, vectors that grow > to smaller memory address as the loop iterates. It can be reproduced by > below counting-up loop with negative scale value in array index. > > for (int i = 0; i < 10000; i++) { > a[MAX - i] = b[MAX - i]; > } > > Cause of this issue is that for a growing-down vector, generated vector > mask value has reversed vector-lane order so it masks incorrect vector > lanes. Note that if negative scale value appears in counting-down loops, > the vector will be growing up. With this rule, we fix the issue by only > allowing positive array index scales in counting-up loops and negative > array index scales in counting-down loops. This check is done with the > help of SWPointer by comparing scale values in each memory access in the > loop with loop stride value. > > - **[Issue-3] Incorrect result in manually unrolled loops** > > This issue can be reproduced by below manually unrolled loop. > > for (int i = 0; i < 10000; i += 2) { > c[i] = a[i] + b[i]; > c[i + 1] = a[i + 1] * b[i + 1]; > } > > In this loop, operations in the 2nd statement duplicate those in the 1st > statement with a small memory address offset. Vectorization in the main > loop works well in this case because C2 does further unrolling and pack > combination. But we cannot vectorize the post loop through replacement > from scalars to vectors because it creates duplicated vector operations. > To fix this, we restrict post loop vectorization to loops with stride > values of 1 or -1. > > - **[Issue-4] Incorrect result in loops with mixed vector element sizes** > > This issue is found after we enable post loop vectorization for AArch64. > It's reproducible by multiple array operations with different element > sizes inside a loop. On x86, there is no issue because the values of x86 > AVX512 opmasks only depend on which vector lanes are active. But AArch64 > is different - the values of SVE predicates also depend on lane size of > the vector. Hence, on AArch64 SVE, if a loop has mixed vector element > sizes, we should use different vector masks. For now, we just support > loops with only one vector element size, i.e., "int + float" vectors in > a single loop is ok but "int + double" vectors in a single loop is not > vectorizable. This fix also enables subword vectors support to make all > primitive type array operations vectorizable. > > - **[Issue-5] Incorrect result in loops with potential data dependence** > > This issue can be reproduced by below corner case on AArch64 only. > > for (int i = 0; i < 10000; i++) { > a[i] = x; > a[i + OFFSET] = y; > } > > In this case, two stores in the loop have data dependence if the OFFSET > value is smaller than the vector length. So we cannot do vectorization > through replacing scalars to vectors. But the main loop vectorization > in this case is successful on AArch64 because AArch64 has partial vector > load/store support. It splits vector fill with different values in lanes > to several smaller-sized fills. In this patch, we add additional data > dependence check for this kind of cases. The check is also done with the > help of SWPointer class. In this check, we require that every two memory > accesses (with at least one store) of the same element type (or subword > size) in the loop has the same array index expression. > > ### Tests > > So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with > experimental VM option "PostLoopMultiversioning" turned on. We found no > issue in all tests. We notice that those existing cases are not enough > because some of above issues are not spotted by them. We would like to > add some new cases but we found existing vectorization tests are a bit > cumbersome - golden results must be pre-calculated and hard-coded in the > test code for correctness verification. Thus, in this patch, we propose > a new vectorization testing framework. > > Our new framework brings a simpler way to add new cases. For a new test > case, we only need to create a new method annotated with "@Test". The > test runner will invoke each annotated method twice automatically. First > time it runs in the interpreter and second time it's forced compiled by > C2. Then the two return results are compared. So in this framework each > test method should return a primitive value or an array of primitives. > In this way, no extra verification code for vectorization correctness is > required. This test runner is still jtreg-based and takes advantages of > the jtreg WhiteBox API, which enables test methods running at specific > compilation levels. Each test class inside is also jtreg-based. It just > need to inherit from the test runner class and run with two additional > options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI". > > ### Summary & Future work > > In this patch, we reworked post loop vectorization. We made it platform > independent and fixed several issues inside. We also implemented a new > vectorization testing framework with many test cases inside. Meanwhile, > we did some code cleanups. > > This patch only touches C2 code guarded with PostLoopMultiversioning, > except a few data structure changes. So, there's no behavior change when > experimental VM option PostLoopMultiversioning is off. Also, to reduce > risks, we still propose to keep post loop vectorization experimental for > now. But if it receives positive feedback, we would like to change it to > non-experimental in the future. I haven't looked at the code yet but just gave this a quick run through our testing. I'm seeing several hundred failures: java.lang.ClassNotFoundException: compiler/vectorization/runner/ArrayCopyTest at java.base/java.lang.Class.forName0(Native Method) at java.base/java.lang.Class.forName(Class.java:383) at java.base/java.lang.Class.forName(Class.java:376) at compiler.vectorization.runner.VectorizationTestRunner.createTestInstance(VectorizationTestRunner.java:183) at compiler.vectorization.runner.VectorizationTestRunner.main(VectorizationTestRunner.java:199) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:577) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) at java.base/java.lang.Thread.run(Thread.java:833) java.lang.RuntimeException: Cannot create test instance for class compiler/vectorization/runner/ArrayCopyTest at compiler.vectorization.runner.VectorizationTestRunner.fail(VectorizationTestRunner.java:195) at compiler.vectorization.runner.VectorizationTestRunner.createTestInstance(VectorizationTestRunner.java:188) at compiler.vectorization.runner.VectorizationTestRunner.main(VectorizationTestRunner.java:199) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:577) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) at java.base/java.lang.Thread.run(Thread.java:833) Similar ClassNotFoundExceptions happen with the other tests. The new tests also intermittently time out with `-Xcomp`. java.lang.RuntimeException: Test failed in compiler.vectorization.runner.ArrayIndexFillTest.fillByteArray: Method is not compiled after 30s. at compiler.vectorization.runner.VectorizationTestRunner.run(VectorizationTestRunner.java:72) at compiler.vectorization.runner.VectorizationTestRunner.main(VectorizationTestRunner.java:200) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:577) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) at java.base/java.lang.Thread.run(Thread.java:833) Running with `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation`. Similar failures happen with the other tests. There are also failures in the pre-submit tests. ------------- PR: https://git.openjdk.java.net/jdk/pull/6828 From pli at openjdk.java.net Wed Dec 15 07:27:59 2021 From: pli at openjdk.java.net (Pengfei Li) Date: Wed, 15 Dec 2021 07:27:59 GMT Subject: RFR: 8183390: Fix and re-enable post loop vectorization In-Reply-To: References: Message-ID: <8xK2PmI8ViNhtdSJhC2OEHjbLEc2rKCQRBAFGLw13G8=.ec7e1206-98b3-4777-a1aa-9247dfcb7bd6@github.com> On Wed, 15 Dec 2021 07:04:12 GMT, Tobias Hartmann wrote: >> ### Background >> >> Post loop vectorization is a C2 compiler optimization in an experimental >> VM feature called PostLoopMultiversioning. It transforms the range-check >> eliminated post loop to a 1-iteration vectorized loop with vector mask. >> This optimization was contributed by Intel in 2016 to support x86 AVX512 >> masked vector instructions. However, it was disabled soon after an issue >> was found. Due to insufficient maintenance in these years, multiple bugs >> have been accumulated inside. But we (Arm) still think this is a useful >> framework for vector mask support in C2 auto-vectorized loops, for both >> x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable >> post loop vectorization. >> >> ### Changes in this patch >> >> This patch reworks post loop vectorization. The most significant change >> is removing vector mask support in C2 x86 backend and re-implementing >> it in the mid-end. With this, we can re-enable post loop vectorization >> for platforms other than x86. >> >> Previous implementation hard-codes x86 k1 register as a reserved AVX512 >> opmask register and defines two routines (setvectmask/restorevectmask) >> to set and restore the value of k1. But after [JDK-8211251](https://bugs.openjdk.java.net/browse/JDK-8211251) which encodes >> AVX512 instructions as unmasked by default, generated vector masks are >> no longer used in AVX512 vector instructions. To fix incorrect codegen >> and add vector mask support for more platforms, we turn to add a vector >> mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode >> to generate a mask and replace all Load/Store nodes in the post loop >> into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This >> IR form is exactly the same to those which are used in VectorAPI mask >> support. For now, we only add mask inputs for Load/Store nodes because >> we don't have reduction operations supported in post loop vectorization. >> After this change, the x86 k1 register is no longer reserved and can be >> allocated when PostLoopMultiversioning is enabled. >> >> Besides this change, we have fixed a compiler crash and five incorrect >> result issues with post loop vectorization. >> >> **I) C2 crashes with segmentation fault in strip-mined loops** >> >> Previous implementation was done before C2 loop strip-mining was merged >> into JDK master so it didn't take strip-mined loops into consideration. >> In C2's strip mined loops, post loop is not the sibling of the main loop >> in ideal loop tree. Instead, it's the sibling of the main loop's parent. >> This patch fixed a SIGSEGV issue caused by NULL pointer when locating >> post loop from strip-mined main loop. >> >> **II) Incorrect result issues with post loop vectorization** >> >> We have also fixed five incorrect vectorization issues. Some of them are >> hidden deep and can only be reproduced with corner cases. These issues >> have a common cause that it assumes the post loop can be vectorized if >> the vectorization in corresponding main loop is successful. But in many >> cases this assumption is wrong. Below are details. >> >> - **[Issue-1] Incorrect vectorization for partial vectorizable loops** >> >> This issue can be reproduced by below loop where only some operations in >> the loop body are vectorizable. >> >> for (int i = 0; i < 10000; i++) { >> res[i] = a[i] * b[i]; >> k = 3 * k + 1; >> } >> >> In the main loop, superword can work well if parts of the operations in >> loop body are not vectorizable since those parts can be unrolled only. >> But for post loops, we don't create vectors through combining scalar IRs >> generated from loop unrolling. Instead, we are doing scalars to vectors >> replacement for all operations in the loop body. Hence, all operations >> should be either vectorized together or not vectorized at all. To fix >> this kind of cases, we add an extra field "_slp_vector_pack_count" in >> CountedLoopNode to record the eventual count of vector packs in the main >> loop. This value is then passed to post loop and compared with post loop >> pack count. Vectorization will be bailed out in post loop if it creates >> more vector packs than in the main loop. >> >> - **[Issue-2] Incorrect result in loops with growing-down vectors** >> >> This issue appears with growing-down vectors, that is, vectors that grow >> to smaller memory address as the loop iterates. It can be reproduced by >> below counting-up loop with negative scale value in array index. >> >> for (int i = 0; i < 10000; i++) { >> a[MAX - i] = b[MAX - i]; >> } >> >> Cause of this issue is that for a growing-down vector, generated vector >> mask value has reversed vector-lane order so it masks incorrect vector >> lanes. Note that if negative scale value appears in counting-down loops, >> the vector will be growing up. With this rule, we fix the issue by only >> allowing positive array index scales in counting-up loops and negative >> array index scales in counting-down loops. This check is done with the >> help of SWPointer by comparing scale values in each memory access in the >> loop with loop stride value. >> >> - **[Issue-3] Incorrect result in manually unrolled loops** >> >> This issue can be reproduced by below manually unrolled loop. >> >> for (int i = 0; i < 10000; i += 2) { >> c[i] = a[i] + b[i]; >> c[i + 1] = a[i + 1] * b[i + 1]; >> } >> >> In this loop, operations in the 2nd statement duplicate those in the 1st >> statement with a small memory address offset. Vectorization in the main >> loop works well in this case because C2 does further unrolling and pack >> combination. But we cannot vectorize the post loop through replacement >> from scalars to vectors because it creates duplicated vector operations. >> To fix this, we restrict post loop vectorization to loops with stride >> values of 1 or -1. >> >> - **[Issue-4] Incorrect result in loops with mixed vector element sizes** >> >> This issue is found after we enable post loop vectorization for AArch64. >> It's reproducible by multiple array operations with different element >> sizes inside a loop. On x86, there is no issue because the values of x86 >> AVX512 opmasks only depend on which vector lanes are active. But AArch64 >> is different - the values of SVE predicates also depend on lane size of >> the vector. Hence, on AArch64 SVE, if a loop has mixed vector element >> sizes, we should use different vector masks. For now, we just support >> loops with only one vector element size, i.e., "int + float" vectors in >> a single loop is ok but "int + double" vectors in a single loop is not >> vectorizable. This fix also enables subword vectors support to make all >> primitive type array operations vectorizable. >> >> - **[Issue-5] Incorrect result in loops with potential data dependence** >> >> This issue can be reproduced by below corner case on AArch64 only. >> >> for (int i = 0; i < 10000; i++) { >> a[i] = x; >> a[i + OFFSET] = y; >> } >> >> In this case, two stores in the loop have data dependence if the OFFSET >> value is smaller than the vector length. So we cannot do vectorization >> through replacing scalars to vectors. But the main loop vectorization >> in this case is successful on AArch64 because AArch64 has partial vector >> load/store support. It splits vector fill with different values in lanes >> to several smaller-sized fills. In this patch, we add additional data >> dependence check for this kind of cases. The check is also done with the >> help of SWPointer class. In this check, we require that every two memory >> accesses (with at least one store) of the same element type (or subword >> size) in the loop has the same array index expression. >> >> ### Tests >> >> So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with >> experimental VM option "PostLoopMultiversioning" turned on. We found no >> issue in all tests. We notice that those existing cases are not enough >> because some of above issues are not spotted by them. We would like to >> add some new cases but we found existing vectorization tests are a bit >> cumbersome - golden results must be pre-calculated and hard-coded in the >> test code for correctness verification. Thus, in this patch, we propose >> a new vectorization testing framework. >> >> Our new framework brings a simpler way to add new cases. For a new test >> case, we only need to create a new method annotated with "@Test". The >> test runner will invoke each annotated method twice automatically. First >> time it runs in the interpreter and second time it's forced compiled by >> C2. Then the two return results are compared. So in this framework each >> test method should return a primitive value or an array of primitives. >> In this way, no extra verification code for vectorization correctness is >> required. This test runner is still jtreg-based and takes advantages of >> the jtreg WhiteBox API, which enables test methods running at specific >> compilation levels. Each test class inside is also jtreg-based. It just >> need to inherit from the test runner class and run with two additional >> options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI". >> >> ### Summary & Future work >> >> In this patch, we reworked post loop vectorization. We made it platform >> independent and fixed several issues inside. We also implemented a new >> vectorization testing framework with many test cases inside. Meanwhile, >> we did some code cleanups. >> >> This patch only touches C2 code guarded with PostLoopMultiversioning, >> except a few data structure changes. So, there's no behavior change when >> experimental VM option PostLoopMultiversioning is off. Also, to reduce >> risks, we still propose to keep post loop vectorization experimental for >> now. But if it receives positive feedback, we would like to change it to >> non-experimental in the future. > > I haven't looked at the code yet but just gave this a quick run through our testing. I'm seeing several hundred failures: > > java.lang.ClassNotFoundException: compiler/vectorization/runner/ArrayCopyTest > at java.base/java.lang.Class.forName0(Native Method) > at java.base/java.lang.Class.forName(Class.java:383) > at java.base/java.lang.Class.forName(Class.java:376) > at compiler.vectorization.runner.VectorizationTestRunner.createTestInstance(VectorizationTestRunner.java:183) > at compiler.vectorization.runner.VectorizationTestRunner.main(VectorizationTestRunner.java:199) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:577) > at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) > at java.base/java.lang.Thread.run(Thread.java:833) > java.lang.RuntimeException: Cannot create test instance for class compiler/vectorization/runner/ArrayCopyTest > at compiler.vectorization.runner.VectorizationTestRunner.fail(VectorizationTestRunner.java:195) > at compiler.vectorization.runner.VectorizationTestRunner.createTestInstance(VectorizationTestRunner.java:188) > at compiler.vectorization.runner.VectorizationTestRunner.main(VectorizationTestRunner.java:199) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:577) > at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) > at java.base/java.lang.Thread.run(Thread.java:833) > > Similar ClassNotFoundExceptions happen with the other tests. The new tests also intermittently time out with `-Xcomp`. > > > java.lang.RuntimeException: Test failed in compiler.vectorization.runner.ArrayIndexFillTest.fillByteArray: Method is not compiled after 30s. > at compiler.vectorization.runner.VectorizationTestRunner.run(VectorizationTestRunner.java:72) > at compiler.vectorization.runner.VectorizationTestRunner.main(VectorizationTestRunner.java:200) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:577) > at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) > at java.base/java.lang.Thread.run(Thread.java:833) > > > Running with `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation`. Similar failures happen with the other tests. > > There are also failures in the pre-submit tests. Hi @TobiHartmann , thanks for your test work. I have already noticed the failure issues. So far, all the failures I have found are under `compiler/vectorization/runner` - these are new tests added by me from this patch. Cause is that in my new test framework, I use `WhiteBox` APIs to do compilation level control for the correctness check. But it may not work if additional compiler control VM options are specified. I will fix it soon. ------------- PR: https://git.openjdk.java.net/jdk/pull/6828 From pli at openjdk.java.net Wed Dec 15 09:21:39 2021 From: pli at openjdk.java.net (Pengfei Li) Date: Wed, 15 Dec 2021 09:21:39 GMT Subject: RFR: 8183390: Fix and re-enable post loop vectorization [v2] In-Reply-To: References: Message-ID: > ### Background > > Post loop vectorization is a C2 compiler optimization in an experimental > VM feature called PostLoopMultiversioning. It transforms the range-check > eliminated post loop to a 1-iteration vectorized loop with vector mask. > This optimization was contributed by Intel in 2016 to support x86 AVX512 > masked vector instructions. However, it was disabled soon after an issue > was found. Due to insufficient maintenance in these years, multiple bugs > have been accumulated inside. But we (Arm) still think this is a useful > framework for vector mask support in C2 auto-vectorized loops, for both > x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable > post loop vectorization. > > ### Changes in this patch > > This patch reworks post loop vectorization. The most significant change > is removing vector mask support in C2 x86 backend and re-implementing > it in the mid-end. With this, we can re-enable post loop vectorization > for platforms other than x86. > > Previous implementation hard-codes x86 k1 register as a reserved AVX512 > opmask register and defines two routines (setvectmask/restorevectmask) > to set and restore the value of k1. But after [JDK-8211251](https://bugs.openjdk.java.net/browse/JDK-8211251) which encodes > AVX512 instructions as unmasked by default, generated vector masks are > no longer used in AVX512 vector instructions. To fix incorrect codegen > and add vector mask support for more platforms, we turn to add a vector > mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode > to generate a mask and replace all Load/Store nodes in the post loop > into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This > IR form is exactly the same to those which are used in VectorAPI mask > support. For now, we only add mask inputs for Load/Store nodes because > we don't have reduction operations supported in post loop vectorization. > After this change, the x86 k1 register is no longer reserved and can be > allocated when PostLoopMultiversioning is enabled. > > Besides this change, we have fixed a compiler crash and five incorrect > result issues with post loop vectorization. > > **I) C2 crashes with segmentation fault in strip-mined loops** > > Previous implementation was done before C2 loop strip-mining was merged > into JDK master so it didn't take strip-mined loops into consideration. > In C2's strip mined loops, post loop is not the sibling of the main loop > in ideal loop tree. Instead, it's the sibling of the main loop's parent. > This patch fixed a SIGSEGV issue caused by NULL pointer when locating > post loop from strip-mined main loop. > > **II) Incorrect result issues with post loop vectorization** > > We have also fixed five incorrect vectorization issues. Some of them are > hidden deep and can only be reproduced with corner cases. These issues > have a common cause that it assumes the post loop can be vectorized if > the vectorization in corresponding main loop is successful. But in many > cases this assumption is wrong. Below are details. > > - **[Issue-1] Incorrect vectorization for partial vectorizable loops** > > This issue can be reproduced by below loop where only some operations in > the loop body are vectorizable. > > for (int i = 0; i < 10000; i++) { > res[i] = a[i] * b[i]; > k = 3 * k + 1; > } > > In the main loop, superword can work well if parts of the operations in > loop body are not vectorizable since those parts can be unrolled only. > But for post loops, we don't create vectors through combining scalar IRs > generated from loop unrolling. Instead, we are doing scalars to vectors > replacement for all operations in the loop body. Hence, all operations > should be either vectorized together or not vectorized at all. To fix > this kind of cases, we add an extra field "_slp_vector_pack_count" in > CountedLoopNode to record the eventual count of vector packs in the main > loop. This value is then passed to post loop and compared with post loop > pack count. Vectorization will be bailed out in post loop if it creates > more vector packs than in the main loop. > > - **[Issue-2] Incorrect result in loops with growing-down vectors** > > This issue appears with growing-down vectors, that is, vectors that grow > to smaller memory address as the loop iterates. It can be reproduced by > below counting-up loop with negative scale value in array index. > > for (int i = 0; i < 10000; i++) { > a[MAX - i] = b[MAX - i]; > } > > Cause of this issue is that for a growing-down vector, generated vector > mask value has reversed vector-lane order so it masks incorrect vector > lanes. Note that if negative scale value appears in counting-down loops, > the vector will be growing up. With this rule, we fix the issue by only > allowing positive array index scales in counting-up loops and negative > array index scales in counting-down loops. This check is done with the > help of SWPointer by comparing scale values in each memory access in the > loop with loop stride value. > > - **[Issue-3] Incorrect result in manually unrolled loops** > > This issue can be reproduced by below manually unrolled loop. > > for (int i = 0; i < 10000; i += 2) { > c[i] = a[i] + b[i]; > c[i + 1] = a[i + 1] * b[i + 1]; > } > > In this loop, operations in the 2nd statement duplicate those in the 1st > statement with a small memory address offset. Vectorization in the main > loop works well in this case because C2 does further unrolling and pack > combination. But we cannot vectorize the post loop through replacement > from scalars to vectors because it creates duplicated vector operations. > To fix this, we restrict post loop vectorization to loops with stride > values of 1 or -1. > > - **[Issue-4] Incorrect result in loops with mixed vector element sizes** > > This issue is found after we enable post loop vectorization for AArch64. > It's reproducible by multiple array operations with different element > sizes inside a loop. On x86, there is no issue because the values of x86 > AVX512 opmasks only depend on which vector lanes are active. But AArch64 > is different - the values of SVE predicates also depend on lane size of > the vector. Hence, on AArch64 SVE, if a loop has mixed vector element > sizes, we should use different vector masks. For now, we just support > loops with only one vector element size, i.e., "int + float" vectors in > a single loop is ok but "int + double" vectors in a single loop is not > vectorizable. This fix also enables subword vectors support to make all > primitive type array operations vectorizable. > > - **[Issue-5] Incorrect result in loops with potential data dependence** > > This issue can be reproduced by below corner case on AArch64 only. > > for (int i = 0; i < 10000; i++) { > a[i] = x; > a[i + OFFSET] = y; > } > > In this case, two stores in the loop have data dependence if the OFFSET > value is smaller than the vector length. So we cannot do vectorization > through replacing scalars to vectors. But the main loop vectorization > in this case is successful on AArch64 because AArch64 has partial vector > load/store support. It splits vector fill with different values in lanes > to several smaller-sized fills. In this patch, we add additional data > dependence check for this kind of cases. The check is also done with the > help of SWPointer class. In this check, we require that every two memory > accesses (with at least one store) of the same element type (or subword > size) in the loop has the same array index expression. > > ### Tests > > So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with > experimental VM option "PostLoopMultiversioning" turned on. We found no > issue in all tests. We notice that those existing cases are not enough > because some of above issues are not spotted by them. We would like to > add some new cases but we found existing vectorization tests are a bit > cumbersome - golden results must be pre-calculated and hard-coded in the > test code for correctness verification. Thus, in this patch, we propose > a new vectorization testing framework. > > Our new framework brings a simpler way to add new cases. For a new test > case, we only need to create a new method annotated with "@Test". The > test runner will invoke each annotated method twice automatically. First > time it runs in the interpreter and second time it's forced compiled by > C2. Then the two return results are compared. So in this framework each > test method should return a primitive value or an array of primitives. > In this way, no extra verification code for vectorization correctness is > required. This test runner is still jtreg-based and takes advantages of > the jtreg WhiteBox API, which enables test methods running at specific > compilation levels. Each test class inside is also jtreg-based. It just > need to inherit from the test runner class and run with two additional > options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI". > > ### Summary & Future work > > In this patch, we reworked post loop vectorization. We made it platform > independent and fixed several issues inside. We also implemented a new > vectorization testing framework with many test cases inside. Meanwhile, > we did some code cleanups. > > This patch only touches C2 code guarded with PostLoopMultiversioning, > except a few data structure changes. So, there's no behavior change when > experimental VM option PostLoopMultiversioning is off. Also, to reduce > risks, we still propose to keep post loop vectorization experimental for > now. But if it receives positive feedback, we would like to change it to > non-experimental in the future. Pengfei Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Fix issues in newly added test framework Change-Id: I6e61abf05e9665325cb3abaf407360b18355c6b1 - Merge branch 'master' into postloop Change-Id: I9bb5a808d7540426dedb141fd198d25eb1f569e6 - 8183390: Fix and re-enable post loop vectorization ** Background Post loop vectorization is a C2 compiler optimization in an experimental VM feature called PostLoopMultiversioning. It transforms the range-check eliminated post loop to a 1-iteration vectorized loop with vector mask. This optimization was contributed by Intel in 2016 to support x86 AVX512 masked vector instructions. However, it was disabled soon after an issue was found. Due to insufficient maintenance in these years, multiple bugs have been accumulated inside. But we (Arm) still think this is a useful framework for vector mask support in C2 auto-vectorized loops, for both x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable post loop vectorization. ** Changes in this patch This patch reworks post loop vectorization. The most significant change is removing vector mask support in C2 x86 backend and re-implementing it in the mid-end. With this, we can re-enable post loop vectorization for platforms other than x86. Previous implementation hard-codes x86 k1 register as a reserved AVX512 opmask register and defines two routines (setvectmask/restorevectmask) to set and restore the value of k1. But after JDK-8211251 which encodes AVX512 instructions as unmasked by default, generated vector masks are no longer used in AVX512 vector instructions. To fix incorrect codegen and add vector mask support for more platforms, we turn to add a vector mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode to generate a mask and replace all Load/Store nodes in the post loop into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This IR form is exactly the same to those which are used in VectorAPI mask support. For now, we only add mask inputs for Load/Store nodes because we don't have reduction operations supported in post loop vectorization. After this change, the x86 k1 register is no longer reserved and can be allocated when PostLoopMultiversioning is enabled. Besides this change, we have fixed a compiler crash and five incorrect result issues with post loop vectorization. - 1) C2 crashes with segmentation fault in strip-mined loops Previous implementation was done before C2 loop strip-mining was merged into JDK master so it didn't take strip-mined loops into consideration. In C2's strip mined loops, post loop is not the sibling of the main loop in ideal loop tree. Instead, it's the sibling of the main loop's parent. This patch fixed a SIGSEGV issue caused by NULL pointer when locating post loop from strip-mined main loop. - 2) Incorrect result issues with post loop vectorization We have also fixed five incorrect vectorization issues. Some of them are hidden deep and can only be reproduced with corner cases. These issues have a common cause that it assumes the post loop can be vectorized if the vectorization in corresponding main loop is successful. But in many cases this assumption is wrong. Below are details. [Issue-1] Incorrect vectorization for partial vectorizable loops This issue can be reproduced by below loop where only some operations in the loop body are vectorizable. for (int i = 0; i < 10000; i++) { res[i] = a[i] * b[i]; k = 3 * k + 1; } In the main loop, superword can work well if parts of the operations in loop body are not vectorizable since those parts can be unrolled only. But for post loops, we don't create vectors through combining scalar IRs generated from loop unrolling. Instead, we are doing scalars to vectors replacement for all operations in the loop body. Hence, all operations should be either vectorized together or not vectorized at all. To fix this kind of cases, we add an extra field "_slp_vector_pack_count" in CountedLoopNode to record the eventual count of vector packs in the main loop. This value is then passed to post loop and compared with post loop pack count. Vectorization will be bailed out in post loop if it creates more vector packs than in the main loop. [Issue-2] Incorrect result in loops with growing-down vectors This issue appears with growing-down vectors, that is, vectors that grow to smaller memory address as the loop iterates. It can be reproduced by below counting-up loop with negative scale value in array index. for (int i = 0; i < 10000; i++) { a[MAX - i] = b[MAX - i]; } Cause of this issue is that for a growing-down vector, generated vector mask value has reversed vector-lane order so it masks incorrect vector lanes. Note that if negative scale value appears in counting-down loops, the vector will be growing up. With this rule, we fix the issue by only allowing positive array index scales in counting-up loops and negative array index scales in counting-down loops. This check is done with the help of SWPointer by comparing scale values in each memory access in the loop with loop stride value. [Issue-3] Incorrect result in manually unrolled loops This issue can be reproduced by below manually unrolled loop. for (int i = 0; i < 10000; i += 2) { c[i] = a[i] + b[i]; c[i + 1] = a[i + 1] * b[i + 1]; } In this loop, operations in the 2nd statement duplicate those in the 1st statement with a small memory address offset. Vectorization in the main loop works well in this case because C2 does further unrolling and pack combination. But we cannot vectorize the post loop through replacement from scalars to vectors because it creates duplicated vector operations. To fix this, we restrict post loop vectorization to loops with stride values of 1 or -1. [Issue-4] Incorrect result in loops with mixed vector element sizes This issue is found after we enable post loop vectorization for AArch64. It's reproducible by multiple array operations with different element sizes inside a loop. On x86, there is no issue because the values of x86 AVX512 opmasks only depend on which vector lanes are active. But AArch64 is different - the values of SVE predicates also depend on lane size of the vector. Hence, on AArch64 SVE, if a loop has mixed vector element sizes, we should use different vector masks. For now, we just support loops with only one vector element size, i.e., "int + float" vectors in a single loop is ok but "int + double" vectors in a single loop is not vectorizable. This fix also enables subword vectors support to make all primitive type array operations vectorizable. [Issue-5] Incorrect result in loops with potential data dependence This issue can be reproduced by below corner case on AArch64 only. for (int i = 0; i < 10000; i++) { a[i] = x; a[i + OFFSET] = y; } In this case, two stores in the loop have data dependence if the OFFSET value is smaller than the vector length. So we cannot do vectorization through replacing scalars to vectors. But the main loop vectorization in this case is successful on AArch64 because AArch64 has partial vector load/store support. It splits vector fill with different values in lanes to several smaller-sized fills. In this patch, we add additional data dependence check for this kind of cases. The check is also done with the help of SWPointer class. In this check, we require that every two memory accesses (with at least one store) of the same element type (or subword size) in the loop has the same array index expression. ** Tests So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with experimental VM option "PostLoopMultiversioning" turned on. We found no issue in all tests. We notice that those existing cases are not enough because some of above issues are not spotted by them. We would like to add some new cases but we found existing vectorization tests are a bit cumbersome - golden results must be pre-calculated and hard-coded in the test code for correctness verification. Thus, in this patch, we propose a new vectorization testing framework. Our new framework brings a simpler way to add new cases. For a new test case, we only need to create a new method annotated with "@Test". The test runner will invoke each annotated method twice automatically. First time it runs in the interpreter and second time it's forced compiled by C2. Then the two return results are compared. So in this framework each test method should return a primitive value or an array of primitives. In this way, no extra verification code for vectorization correctness is required. This test runner is still jtreg-based and takes advantages of the jtreg WhiteBox API, which enables test methods running at specific compilation levels. Each test class inside is also jtreg-based. It just need to inherit from the test runner class and run with two additional options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI". ** Summary & Future work In this patch, we reworked post loop vectorization. We made it platform independent and fixed several issues inside. We also implemented a new vectorization testing framework with many test cases inside. Meanwhile, we did some code cleanups. This patch only touches C2 code guarded with PostLoopMultiversioning, except a few data structure changes. So, there's no behavior change when experimental VM option PostLoopMultiversioning is off. Also, to reduce risks, we still propose to keep post loop vectorization experimental for now. But if it receives positive feedback, we would like to change it to non-experimental in the future. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6828/files - new: https://git.openjdk.java.net/jdk/pull/6828/files/cae3b16b..85ce597d Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6828&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6828&range=00-01 Stats: 922 lines in 75 files changed: 688 ins; 84 del; 150 mod Patch: https://git.openjdk.java.net/jdk/pull/6828.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6828/head:pull/6828 PR: https://git.openjdk.java.net/jdk/pull/6828 From lutz.schmidt at sap.com Wed Dec 15 09:27:18 2021 From: lutz.schmidt at sap.com (Schmidt, Lutz) Date: Wed, 15 Dec 2021 09:27:18 +0000 Subject: JDK-8231460: Java update from 11.0.11 to 11.0.13 changes JVM code cache behavior and results in more process cpu usage and unexpected profiled nmethods memory usage In-Reply-To: <1d111a35-b341-d69f-bbb9-f4084f82dab3@oracle.com> References: <1d111a35-b341-d69f-bbb9-f4084f82dab3@oracle.com> Message-ID: <0E149152-347A-4D5E-9A1F-7084A988CBFC@sap.com> Hi Josef, Tobias, sorry for responding with some delay. Busy with other things I did not closely monitor all the OpenJDK mail traffic. The CodeCache behavior you describe sounds interesting. However, I do not see a way how the mentioned changes (JDK-8223444 and JDK-8231460) would cause or have any effect on that behavior. Why? The changes only modify the inner workings of the code cache. They have no effect on the contents (# of methods, method size, method lifetime, ...) of the code cache. Just the internal (free) space management is made more efficient and effective. To gain some insight into what's going on, you may want to use CodeHeap Analytics. The documentation for this feature can be found at https://docs.oracle.com/en/java/javase/11/tools/java.html#GUID-4856361B-8BFD-4964-AE84-121F5F6CF111 Klick the download link to access the full documentation. To start with, jcmd Compiler.CodeHeap_Analytics aggregate will provide some general state information. I would suggest to collect the information in regular intervals during startup, until you see the drop in 'profiled nmethods' space. Please issue jcmd 23923 Compiler.CodeHeap_Analytics discard at the end to free up JVM memory. Hope that helps! Best Regards, Lutz ?On 14.12.21, 14:31, "Tobias Hartmann" wrote: Hi Josef, Thanks for reporting this issue! Any chance you could run your application with different builds to narrow the problem down to a single change? @Lutz: As the author of 8223444 and 8231460, any ideas? Best regards, Tobias On 01.12.21 16:35, Josef Lehner wrote: > Dear OpenJDK team, > > as described in this StackOverflow question, I want to reach out to you and > question whether the JVM code cache / codeheap still works as designed. > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F70086548%2Fjava-update-from-11-0-11-to-11-0-13-changes-jvm-code-cache-behavior-and-results&data=04%7C01%7Clutz.schmidt%40sap.com%7C42d5016b11854667a90c08d9bf061369%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C637750855085868849%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8C59GLHHS%2FUSMWKHNJ23xZAD%2BskU0i2mmuWGPPrLU7E%3D&reserved=0 > > What we experience in our huge application is that with Java 11.0.13 and > -XX:ReservedCodeCacheSize=375m the code cache / codeheap 'profiled > nmethods' (C1 optimized, ~ 180 MB) drops after a very short time to a very > low level (less than 50 MB) and stays at this level forever while > 'non-profiled nmethods' (C2 optimized) is already at its limits. > After we tripled the -XX:ReservedCodeCacheSize to 1024 MB, both areas > 'profiled nmethods' and 'non-profiled nmethods' have stayed at a much > higher constant level (~ 258 MB) for over a week now instead of dropping to > less than 50 MB after 15 min (-XX:ReservedCodeCacheSize=375m) or dropping > after 3 hours (-XX:ReservedCodeCacheSize=512m). > From my point of view as a non-expert I would expect that the C1 optimized > code does not get removed (or at least not so much) from 'profiled > nmethods' as there is no space left in 'non-profiled nmethods' to optimize > it further. What do you think? > > Important changes in 11.0.12: > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8223444&data=04%7C01%7Clutz.schmidt%40sap.com%7C42d5016b11854667a90c08d9bf061369%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C637750855085868849%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Hm3EIcqksTv55T36l3GlgsEto1jctXcJNEJzZ65nlfY%3D&reserved=0 Improve CodeHeap Free > Space Management > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8231460&data=04%7C01%7Clutz.schmidt%40sap.com%7C42d5016b11854667a90c08d9bf061369%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C637750855085868849%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=BBbrm3kqM9xSH%2FX7BwovduWR3C67ss0CgOL1YRooUOU%3D&reserved=0 Performance issue > (CodeHeap) with large free blocks > > Best regards > Josef Lehner > From jbhateja at openjdk.java.net Wed Dec 15 10:19:31 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 15 Dec 2021 10:19:31 GMT Subject: [jdk18] RFR: 8278796: Incorrect behavior of FloatVector.withLane on X86 Message-ID: - Incorrect operand is being passed to insertps instruction which causes incorrectness issues in FloatVector.withLane operation. - Existing JTREG test cases have been modified appropriately with a non-zero insertion index. Kindly review and share your comments. Best Regards, Jatin ------------- Commit messages: - 8278796: Incorrect behavior of FloatVector.withLane on X86 Changes: https://git.openjdk.java.net/jdk18/pull/28/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=28&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278796 Stats: 66 lines in 33 files changed: 1 ins; 0 del; 65 mod Patch: https://git.openjdk.java.net/jdk18/pull/28.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/28/head:pull/28 PR: https://git.openjdk.java.net/jdk18/pull/28 From roland at openjdk.java.net Wed Dec 15 10:25:03 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 15 Dec 2021 10:25:03 GMT Subject: RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert [v3] In-Reply-To: References: Message-ID: <4zpe1GthmCB13XrslkG0o1hghjYWAJ-RAdJeazlIz24=.b1134724-e8d7-463f-9030-d5d75a670081@github.com> On Tue, 14 Dec 2021 22:04:19 GMT, Dean Long wrote: > John wrote me and said your fix looks OK (thought not a general solution). For your test, he suggested adding a 3rd case with extra call depth: "test3 -> mcaller -> m". Also, I believe @vnkozlov will take a look too. Thanks @dean-long. I need to open a PR for jdk 18 and close that one. ------------- PR: https://git.openjdk.java.net/jdk/pull/6572 From roland at openjdk.java.net Wed Dec 15 10:25:03 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 15 Dec 2021 10:25:03 GMT Subject: Withdrawn: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert In-Reply-To: References: Message-ID: On Fri, 26 Nov 2021 10:09:29 GMT, Roland Westrelin wrote: > Root cause is identical to 8273165 AFIU: late inline of a virtual call > can throw from 2 different paths (null check and the call > itself). That breaks because the logic for exceptions expects the > stack for all paths that throw exceptions to have the same stack size. > > AFAIU, the stack doesn't matter exception handling: either the > exception is caught by a exception handler and then the stack is > popped and the exception is pushed or, the exception is rethrown to > the caller in which case the current stack is also popped (that is the > jvm state for the current method). As a consequence the fix I propose > is to ignore the stack in GraphKit::combine_exception_states(). > > AFAIU, the same fix would work for 8273165 but I left the current work > around as is: not sure if we want to be conservative for now or not This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/6572 From roland at openjdk.java.net Wed Dec 15 10:34:23 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 15 Dec 2021 10:34:23 GMT Subject: [jdk18] RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert Message-ID: The bug and fix were discussed in a previous PR: https://github.com/openjdk/jdk/pull/6572 I pushed all commits from that PR on top of jdk 18 and added a couple extra tests as suggested in: https://github.com/openjdk/jdk/pull/6572#issuecomment-994086590 ------------- Commit messages: - whitespaces - extra tests - comment - alternate fix - make test runnable with release build - more - fix Changes: https://git.openjdk.java.net/jdk18/pull/29/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=29&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8275638 Stats: 117 lines in 3 files changed: 117 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk18/pull/29.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/29/head:pull/29 PR: https://git.openjdk.java.net/jdk18/pull/29 From lutz.schmidt at sap.com Wed Dec 15 10:38:27 2021 From: lutz.schmidt at sap.com (Schmidt, Lutz) Date: Wed, 15 Dec 2021 10:38:27 +0000 Subject: JDK-8231460: Java update from 11.0.11 to 11.0.13 changes JVM code cache behavior and results in more process cpu usage and unexpected profiled nmethods memory usage In-Reply-To: <0E149152-347A-4D5E-9A1F-7084A988CBFC@sap.com> References: <1d111a35-b341-d69f-bbb9-f4084f82dab3@oracle.com> <0E149152-347A-4D5E-9A1F-7084A988CBFC@sap.com> Message-ID: <32209B75-302A-429A-8C18-08BD277FE9C5@sap.com> Some additional thoughts: Obviously, the code cache is sized too tight to reach a steady state. While the system is warming up, methods are first C1-compiled (with profiling). That fills the code cache segment for profiled nmethods. Over time, many of these methods become hotter and their C2-compilation is triggered, filling the segment for non-profiled nmethods. Once a C2-compiled copy is available, the related C1-compiled copy becomes obsolete and will be swept away over time. As time progresses, more and more of the Java code is considered (really) hot and gets C2-compiled. There is not much left in the C1-compiled state. That explains why you see a usage drop in the 'profiled nmethods' segment. The 'non-profiled nmethods' segment, on the other hand, is packed full. You should see permanent recompilation events and permanent high code cache sweeper activity. Simply said, sizzling hot methods are C2-(re)compiled, pushing out methods which have cooled off a bit. That is the "good" case. If sweeping can't keep up, JIT compilation might as well get disabled altogether. Your application would then continue to run with that static state of the code cache. Because all methods are hot, they are not considered for recompilation by C1. Those compiler gurus out there are welcome to correct me if I'm wrong! Best Regards, Lutz ?On 15.12.21, 10:27, "Schmidt, Lutz" wrote: Hi Josef, Tobias, sorry for responding with some delay. Busy with other things I did not closely monitor all the OpenJDK mail traffic. The CodeCache behavior you describe sounds interesting. However, I do not see a way how the mentioned changes (JDK-8223444 and JDK-8231460) would cause or have any effect on that behavior. Why? The changes only modify the inner workings of the code cache. They have no effect on the contents (# of methods, method size, method lifetime, ...) of the code cache. Just the internal (free) space management is made more efficient and effective. To gain some insight into what's going on, you may want to use CodeHeap Analytics. The documentation for this feature can be found at https://docs.oracle.com/en/java/javase/11/tools/java.html#GUID-4856361B-8BFD-4964-AE84-121F5F6CF111 Klick the download link to access the full documentation. To start with, jcmd Compiler.CodeHeap_Analytics aggregate will provide some general state information. I would suggest to collect the information in regular intervals during startup, until you see the drop in 'profiled nmethods' space. Please issue jcmd 23923 Compiler.CodeHeap_Analytics discard at the end to free up JVM memory. Hope that helps! Best Regards, Lutz On 14.12.21, 14:31, "Tobias Hartmann" wrote: Hi Josef, Thanks for reporting this issue! Any chance you could run your application with different builds to narrow the problem down to a single change? @Lutz: As the author of 8223444 and 8231460, any ideas? Best regards, Tobias On 01.12.21 16:35, Josef Lehner wrote: > Dear OpenJDK team, > > as described in this StackOverflow question, I want to reach out to you and > question whether the JVM code cache / codeheap still works as designed. > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F70086548%2Fjava-update-from-11-0-11-to-11-0-13-changes-jvm-code-cache-behavior-and-results&data=04%7C01%7Clutz.schmidt%40sap.com%7C42d5016b11854667a90c08d9bf061369%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C637750855085868849%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8C59GLHHS%2FUSMWKHNJ23xZAD%2BskU0i2mmuWGPPrLU7E%3D&reserved=0 > > What we experience in our huge application is that with Java 11.0.13 and > -XX:ReservedCodeCacheSize=375m the code cache / codeheap 'profiled > nmethods' (C1 optimized, ~ 180 MB) drops after a very short time to a very > low level (less than 50 MB) and stays at this level forever while > 'non-profiled nmethods' (C2 optimized) is already at its limits. > After we tripled the -XX:ReservedCodeCacheSize to 1024 MB, both areas > 'profiled nmethods' and 'non-profiled nmethods' have stayed at a much > higher constant level (~ 258 MB) for over a week now instead of dropping to > less than 50 MB after 15 min (-XX:ReservedCodeCacheSize=375m) or dropping > after 3 hours (-XX:ReservedCodeCacheSize=512m). > From my point of view as a non-expert I would expect that the C1 optimized > code does not get removed (or at least not so much) from 'profiled > nmethods' as there is no space left in 'non-profiled nmethods' to optimize > it further. What do you think? > > Important changes in 11.0.12: > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8223444&data=04%7C01%7Clutz.schmidt%40sap.com%7C42d5016b11854667a90c08d9bf061369%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C637750855085868849%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Hm3EIcqksTv55T36l3GlgsEto1jctXcJNEJzZ65nlfY%3D&reserved=0 Improve CodeHeap Free > Space Management > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8231460&data=04%7C01%7Clutz.schmidt%40sap.com%7C42d5016b11854667a90c08d9bf061369%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C637750855085868849%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=BBbrm3kqM9xSH%2FX7BwovduWR3C67ss0CgOL1YRooUOU%3D&reserved=0 Performance issue > (CodeHeap) with large free blocks > > Best regards > Josef Lehner > From aph at openjdk.java.net Wed Dec 15 10:40:00 2021 From: aph at openjdk.java.net (Andrew Haley) Date: Wed, 15 Dec 2021 10:40:00 GMT Subject: [jdk18] RFR: 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 10:45:28 GMT, Patric Hedlin wrote: > Implementation of ISO/ASCII char set encoding, extending current implementation with ASCII encoding support. > > Implementation with slight focus on balance between footprint and efficiency, trying to utilise a dual SIMD path (e.g. Neoverse N1) for the additional Ascii-check and avoid performance loss in the ISO-only case. > > - Interleaved ISO and ASCII check code. > - Avoid 'umaxv' in the ISO main flow. > - Using post inc in main loop. > - Retain 8-char loop. > - Removing conditional prefetch (no upside). > - Adding ISO-8859-1 to encode-decode benchmark. > > Testing: tier1-6 > > The revised version compares like this (master vs. update). > > Benchmark (size) (type) Mode Cnt Score Error Units > CharsetEncodeDecode.encode 16384 UTF-8 avgt 30 17.920 ? 0.229 us/op > CharsetEncodeDecode.encode 16384 BIG5 avgt 30 18.867 ? 0.356 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-15 avgt 30 17.419 ? 0.220 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-1 avgt 30 6.200 ? 0.134 us/op > CharsetEncodeDecode.encode 16384 ASCII avgt 30 17.149 ? 0.219 us/op > CharsetEncodeDecode.encode 16384 UTF-16 avgt 30 135.115 ? 1.440 us/op > > > Benchmark (size) (type) Mode Cnt Score Error Units > CharsetEncodeDecode.encode 16384 UTF-8 avgt 30 9.018 ? 0.179 us/op > CharsetEncodeDecode.encode 16384 BIG5 avgt 30 10.550 ? 0.470 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-15 avgt 30 8.843 ? 0.187 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-1 avgt 30 6.406 ? 0.155 us/op > CharsetEncodeDecode.encode 16384 ASCII avgt 30 8.822 ? 0.173 us/op > CharsetEncodeDecode.encode 16384 UTF-16 avgt 30 135.195 ? 1.432 us/op I don't think this should go straight into the 18 release branch. It looks OK for mainline. ------------- PR: https://git.openjdk.java.net/jdk18/pull/20 From roland at openjdk.java.net Wed Dec 15 12:43:27 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 15 Dec 2021 12:43:27 GMT Subject: [jdk18] RFR: 8278413: C2 crash when allocating array of size too large Message-ID: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> On the fallthrough path from an AllocateArray, the length of the allocated array is casted (with a CastII) to [0, max_size] with max_size some number that depends on the array type and can be less than max_jint. Allocating an array of a length that's not in [0, max_size] causes the CastII to become top. The fallthrough path must be killed as well in that case otherwise this can lead to a broken graph. Currently c2 has logic to protect against an allocation of array of negative size in AllocateArrayNode::Ideal(). That call replaces the fallthrough path with an Halt node. But if the size is too big, then the fallthrough path is left as is. This patch fixes that issues. It also reworks the length negative case. I added a Bool/CmpU input to the AllocateArray that tests for a valid length. If that input becomes false, CatchNode::Value() kills the fallthrough path. That logic is similar to that for a virtual call with a null receiver. I also removed AllocateArrayNode::Ideal() now that CatchNode::Value() takes care of the same corner case. The code in AllocateArrayNode::Ideal() was added by Vladimir and he told me he tried extending CatchNode::Value() at the time but that caused test failures. I had no issues in my testing so I assume doing it that way is ok now. The new input to AllocateArray is moved to the CallStaticJava runtime call for array allocation on macro expansion as a precedence edge. The reason for that is that final graph reshape needs a way to tell whether the missing path out of the allocation is legal or not. final graph reshape then removes the then useless precedence edge. ------------- Commit messages: - alloc array fix Changes: https://git.openjdk.java.net/jdk18/pull/30/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=30&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278413 Stats: 200 lines in 9 files changed: 125 ins; 53 del; 22 mod Patch: https://git.openjdk.java.net/jdk18/pull/30.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/30/head:pull/30 PR: https://git.openjdk.java.net/jdk18/pull/30 From jbhateja at openjdk.java.net Wed Dec 15 13:17:06 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 15 Dec 2021 13:17:06 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. In-Reply-To: References: Message-ID: <5Rm0oIFi7Ju0ZTN2mSUQk0ek-n3p-eLJXZphiG8NAOA=.2af84ad7-7665-43fd-b9b3-cc9aefaf1a50@github.com> On Wed, 15 Dec 2021 01:38:32 GMT, Vladimir Kozlov wrote: > Regular testing (x64) results are good. > > But I also run `jdk/incubator/vector/` tests locally with 32-bit VM (fastdebug). Most tests passed but some hit timeout (run for more than default 2 min): > > ``` > jdk/incubator/vector/Byte512VectorTests.java > jdk/incubator/vector/ByteMaxVectorTests.java > jdk/incubator/vector/VectorReshapeTests.java > ``` > > I assume such vectors are not supported in 32-bit VM and code is slow. Simple solution is to add `/timeout=240` (4 min) to these tests. They passed for me after that. Hi @vnkozlov , x86 32 target constraints the number of vector registers to 8 (zmm0-zmm7) so above testcases should still be able to intrinsify the operation. I tested above tests on AVX512 target and they executes in lass than 2 mins. On a non-AVX512 target any operation based on Byte512 species is not inline expanded. Since the changes in this patch are only related to AVX512 thus my testing went clean with UseAVX=3. ------------- PR: https://git.openjdk.java.net/jdk18/pull/24 From jbhateja at openjdk.java.net Wed Dec 15 13:36:58 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 15 Dec 2021 13:36:58 GMT Subject: RFR: 8183390: Fix and re-enable post loop vectorization In-Reply-To: <8xK2PmI8ViNhtdSJhC2OEHjbLEc2rKCQRBAFGLw13G8=.ec7e1206-98b3-4777-a1aa-9247dfcb7bd6@github.com> References: <8xK2PmI8ViNhtdSJhC2OEHjbLEc2rKCQRBAFGLw13G8=.ec7e1206-98b3-4777-a1aa-9247dfcb7bd6@github.com> Message-ID: On Wed, 15 Dec 2021 07:25:09 GMT, Pengfei Li wrote: >> I haven't looked at the code yet but just gave this a quick run through our testing. I'm seeing several hundred failures: >> >> java.lang.ClassNotFoundException: compiler/vectorization/runner/ArrayCopyTest >> at java.base/java.lang.Class.forName0(Native Method) >> at java.base/java.lang.Class.forName(Class.java:383) >> at java.base/java.lang.Class.forName(Class.java:376) >> at compiler.vectorization.runner.VectorizationTestRunner.createTestInstance(VectorizationTestRunner.java:183) >> at compiler.vectorization.runner.VectorizationTestRunner.main(VectorizationTestRunner.java:199) >> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) >> at java.base/java.lang.reflect.Method.invoke(Method.java:577) >> at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) >> at java.base/java.lang.Thread.run(Thread.java:833) >> java.lang.RuntimeException: Cannot create test instance for class compiler/vectorization/runner/ArrayCopyTest >> at compiler.vectorization.runner.VectorizationTestRunner.fail(VectorizationTestRunner.java:195) >> at compiler.vectorization.runner.VectorizationTestRunner.createTestInstance(VectorizationTestRunner.java:188) >> at compiler.vectorization.runner.VectorizationTestRunner.main(VectorizationTestRunner.java:199) >> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) >> at java.base/java.lang.reflect.Method.invoke(Method.java:577) >> at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) >> at java.base/java.lang.Thread.run(Thread.java:833) >> >> Similar ClassNotFoundExceptions happen with the other tests. The new tests also intermittently time out with `-Xcomp`. >> >> >> java.lang.RuntimeException: Test failed in compiler.vectorization.runner.ArrayIndexFillTest.fillByteArray: Method is not compiled after 30s. >> at compiler.vectorization.runner.VectorizationTestRunner.run(VectorizationTestRunner.java:72) >> at compiler.vectorization.runner.VectorizationTestRunner.main(VectorizationTestRunner.java:200) >> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) >> at java.base/java.lang.reflect.Method.invoke(Method.java:577) >> at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) >> at java.base/java.lang.Thread.run(Thread.java:833) >> >> >> Running with `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation`. Similar failures happen with the other tests. >> >> There are also failures in the pre-submit tests. > > Hi @TobiHartmann , thanks for your test work. I have already noticed the failure issues. So far, all the failures I have found are under `compiler/vectorization/runner` - these are new tests added by me from this patch. Cause is that in my new test framework, I use `WhiteBox` APIs to do compilation level control for the correctness check. But it may not work if additional compiler control VM options are specified. I will fix it soon. Hi @pfustc , thanks will check the behavior on AVX-512 target with your patch. ------------- PR: https://git.openjdk.java.net/jdk/pull/6828 From phedlin at openjdk.java.net Wed Dec 15 14:12:59 2021 From: phedlin at openjdk.java.net (Patric Hedlin) Date: Wed, 15 Dec 2021 14:12:59 GMT Subject: [jdk18] RFR: 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 In-Reply-To: References: Message-ID: <5EmZGX3oIMMll_eKwWVNV1xkPCQRwXjnb_QMRb2tcjA=.c00aa130-d0ef-49bb-b513-32e4709b28cb@github.com> On Wed, 15 Dec 2021 10:37:04 GMT, Andrew Haley wrote: > I don't think this should go straight into the 18 release branch. It looks OK for mainline. Any particular reason it should not be included in JDK-18? ------------- PR: https://git.openjdk.java.net/jdk18/pull/20 From aph at openjdk.java.net Wed Dec 15 15:03:08 2021 From: aph at openjdk.java.net (Andrew Haley) Date: Wed, 15 Dec 2021 15:03:08 GMT Subject: [jdk18] RFR: 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 In-Reply-To: <5EmZGX3oIMMll_eKwWVNV1xkPCQRwXjnb_QMRb2tcjA=.c00aa130-d0ef-49bb-b513-32e4709b28cb@github.com> References: <5EmZGX3oIMMll_eKwWVNV1xkPCQRwXjnb_QMRb2tcjA=.c00aa130-d0ef-49bb-b513-32e4709b28cb@github.com> Message-ID: <2uKBlDzz5yUtlKD3lUXJj7nX1IW620fZomAKHzHEXvg=.da646d0d-a28f-4605-ab24-93a245a85e9b@github.com> On Wed, 15 Dec 2021 14:09:25 GMT, Patric Hedlin wrote: > > I don't think this should go straight into the 18 release branch. It looks OK for mainline. > > Any particular reason it should not be included in JDK-18? We're in RDP1 since December 9. This means that any enhancement is covered by the Late-Enhancement Request Process. https://openjdk.java.net/jeps/3#Late-Enhancement-Request-Process . I don't think this patch is urgent enough for that. ------------- PR: https://git.openjdk.java.net/jdk18/pull/20 From fgao at openjdk.java.net Wed Dec 15 15:07:39 2021 From: fgao at openjdk.java.net (Fei Gao) Date: Wed, 15 Dec 2021 15:07:39 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v5] In-Reply-To: References: Message-ID: > The patch aims to help optimize Math.abs() mainly from these three parts: > 1) Remove redundant instructions for abs with constant values > 2) Remove redundant instructions for abs with char type > 3) Convert some common abs operations to ideal forms > > 1. Remove redundant instructions for abs with constant values > > If we can decide the value of the input node for function Math.abs() > at compile-time, we can substitute the Abs node with the absolute > value of the constant and don't have to calculate it at runtime. > > For example, > int[] a > for (int i = 0; i < SIZE; i++) { > a[i] = Math.abs(-38); > } > > Before the patch, the generated code for the testcase above is: > ... > mov w10, #0xffffffda > cmp w10, wzr > cneg w17, w10, lt > dup v16.8h, w17 > ... > After the patch, the generated code for the testcase above is : > ... > movi v16.4s, #0x26 > ... > > 2. Remove redundant instructions for abs with char type > > In Java semantics, as the char type is always non-negative, we > could actually remove the absI node in the C2 middle end. > > As for vectorization part, in current SLP, the vectorization of > Math.abs() with char type is intentionally disabled after > JDK-8261022 because it generates incorrect result before. After > removing the AbsI node in the middle end, Math.abs(char) can be > vectorized naturally. > > For example, > > char[] a; > char[] b; > for (int i = 0; i < SIZE; i++) { > b[i] = (char) Math.abs(a[i]); > } > > Before the patch, the generated assembly code for the testcase > above is: > > B15: > add x13, x21, w20, sxtw #1 > ldrh w11, [x13, #16] > cmp w11, wzr > cneg w10, w11, lt > strh w10, [x13, #16] > ldrh w10, [x13, #18] > cmp w10, wzr > cneg w10, w10, lt > strh w10, [x13, #18] > ... > add w20, w20, #0x1 > cmp w20, w17 > b.lt B15 > > After the patch, the generated assembly code is: > B15: > sbfiz x18, x19, #1, #32 > add x0, x14, x18 > ldr q16, [x0, #16] > add x18, x21, x18 > str q16, [x18, #16] > ldr q16, [x0, #32] > str q16, [x18, #32] > ... > add w19, w19, #0x40 > cmp w19, w17 > b.lt B15 > > 3. Convert some common abs operations to ideal forms > > The patch overrides some virtual support functions for AbsNode > so that optimization of gvn can work on it. Here are the optimizable > forms: > > a) abs(0 - x) => abs(x) > > Before the patch: > ... > ldr w13, [x13, #16] > neg w13, w13 > cmp w13, wzr > cneg w14, w13, lt > ... > After the patch: > ... > ldr w13, [x13, #16] > cmp w13, wzr > cneg w13, w13, lt > ... > > b) abs(abs(x)) => abs(x) > > Before the patch: > ... > ldr w12, [x12, #16] > cmp w12, wzr > cneg w12, w12, lt > cmp w12, wzr > cneg w12, w12, lt > ... > After the patch: > ... > ldr w13, [x13, #16] > cmp w13, wzr > cneg w13, w13, lt > ... Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - Add a jmh benchmark case Change-Id: I64938d543126c2e3f9fad8ffc4a50e25e4473d8f - Merge branch 'master' of github.com:fg1417/jdk into fg8276673 Change-Id: I71987594e9288a489a04de696e69a62f4ad19357 - Merge branch 'master' into fg8276673 Change-Id: I5e3898054b75f49653b8c3b37e4f5007675fa963 - 8276673: Optimize abs operations in C2 compiler The patch aims to help optimize Math.abs() mainly from these three parts: 1) Remove redundant instructions for abs with constant values 2) Remove redundant instructions for abs with char type 3) Convert some common abs operations to ideal forms 1. Remove redundant instructions for abs with constant values If we can decide the value of the input node for function Math.abs() at compile-time, we can substitute the Abs node with the absolute value of the constant and don't have to calculate it at runtime. For example, int[] a for (int i = 0; i < SIZE; i++) { a[i] = Math.abs(-38); } Before the patch, the generated code for the testcase above is: ... mov w10, #0xffffffda cmp w10, wzr cneg w17, w10, lt dup v16.8h, w17 ... After the patch, the generated code for the testcase above is : ... movi v16.4s, #0x26 ... 2. Remove redundant instructions for abs with char type In Java semantics, as the char type is always non-negative, we could actually remove the absI node in the C2 middle end. As for vectorization part, in current SLP, the vectorization of Math.abs() with char type is intentionally disabled after JDK-8261022 because it generates incorrect result before. After removing the AbsI node in the middle end, Math.abs(char) can be vectorized naturally. For example, char[] a; char[] b; for (int i = 0; i < SIZE; i++) { b[i] = (char) Math.abs(a[i]); } Before the patch, the generated assembly code for the testcase above is: B15: add x13, x21, w20, sxtw #1 ldrh w11, [x13, #16] cmp w11, wzr cneg w10, w11, lt strh w10, [x13, #16] ldrh w10, [x13, #18] cmp w10, wzr cneg w10, w10, lt strh w10, [x13, #18] ... add w20, w20, #0x1 cmp w20, w17 b.lt B15 After the patch, the generated assembly code is: B15: sbfiz x18, x19, #1, #32 add x0, x14, x18 ldr q16, [x0, #16] add x18, x21, x18 str q16, [x18, #16] ldr q16, [x0, #32] str q16, [x18, #32] ... add w19, w19, #0x40 cmp w19, w17 b.lt B15 3. Convert some common abs operations to ideal forms The patch overrides some virtual support functions for AbsNode so that optimization of gvn can work on it. Here are the optimizable forms: a) abs(0 - x) => abs(x) Before the patch: ... ldr w13, [x13, #16] neg w13, w13 cmp w13, wzr cneg w14, w13, lt ... After the patch: ... ldr w13, [x13, #16] cmp w13, wzr cneg w13, w13, lt ... b) abs(abs(x)) => abs(x) Before the patch: ... ldr w12, [x12, #16] cmp w12, wzr cneg w12, w12, lt cmp w12, wzr cneg w12, w12, lt ... After the patch: ... ldr w13, [x13, #16] cmp w13, wzr cneg w13, w13, lt ... Change-Id: I5434c01a225796caaf07ffbb19983f4fe2e206bd ------------- Changes: https://git.openjdk.java.net/jdk/pull/6755/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6755&range=04 Stats: 230 lines in 5 files changed: 223 ins; 2 del; 5 mod Patch: https://git.openjdk.java.net/jdk/pull/6755.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6755/head:pull/6755 PR: https://git.openjdk.java.net/jdk/pull/6755 From fgao at openjdk.java.net Wed Dec 15 15:07:40 2021 From: fgao at openjdk.java.net (Fei Gao) Date: Wed, 15 Dec 2021 15:07:40 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v4] In-Reply-To: References: <5tL6bmguT-7Wcm_WpWXUhxIQhjtGp9UqZvgH1jD6FbU=.9cc0d6ce-d322-409b-8ce4-066a18997f43@github.com> Message-ID: On Mon, 13 Dec 2021 08:25:56 GMT, Jie Fu wrote: > > The PR optimizes abs operations in the C2 middle end. Can I have your review please? > > So what's the performance data before and after this patch? Does it also benefit on x86? > > It would be better to provide a jmh micro benchmark. Thanks. Thanks, @DamonFool . Yes, it's supposed to benefit all archs. For example, here is the performance data on x86. Before the patch: Benchmark (seed) Mode Cnt Score Error Units MathBench.absConstantInt 0 thrpt 5 291960.380 ? 10724.572 ops/ms After the patch: Benchmark (seed) Mode Cnt Score Error Units MathBench.absConstantInt 0 thrpt 5 336271.533 ? 3778.210 ops/ms The jmh micro benchmark testcase has been added in the latest commit. ------------- PR: https://git.openjdk.java.net/jdk/pull/6755 From phedlin at openjdk.java.net Wed Dec 15 15:11:08 2021 From: phedlin at openjdk.java.net (Patric Hedlin) Date: Wed, 15 Dec 2021 15:11:08 GMT Subject: [jdk18] RFR: 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 In-Reply-To: <2uKBlDzz5yUtlKD3lUXJj7nX1IW620fZomAKHzHEXvg=.da646d0d-a28f-4605-ab24-93a245a85e9b@github.com> References: <5EmZGX3oIMMll_eKwWVNV1xkPCQRwXjnb_QMRb2tcjA=.c00aa130-d0ef-49bb-b513-32e4709b28cb@github.com> <2uKBlDzz5yUtlKD3lUXJj7nX1IW620fZomAKHzHEXvg=.da646d0d-a28f-4605-ab24-93a245a85e9b@github.com> Message-ID: On Wed, 15 Dec 2021 14:59:47 GMT, Andrew Haley wrote: > We're in RDP1 since December 9. This means that any enhancement is covered by the Late-Enhancement Request Process. https://openjdk.java.net/jeps/3#Late-Enhancement-Request-Process . I don't think this patch is urgent enough for that. It has been classified as a performance regression (bug) in line with the x86 issue (JDK-8274242). Do you mean we should change this _now_? Aarch64 would be the only platform not to address the issue in JDK-18. ------------- PR: https://git.openjdk.java.net/jdk18/pull/20 From aph at openjdk.java.net Wed Dec 15 15:27:06 2021 From: aph at openjdk.java.net (Andrew Haley) Date: Wed, 15 Dec 2021 15:27:06 GMT Subject: [jdk18] RFR: 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 In-Reply-To: References: <5EmZGX3oIMMll_eKwWVNV1xkPCQRwXjnb_QMRb2tcjA=.c00aa130-d0ef-49bb-b513-32e4709b28cb@github.com> <2uKBlDzz5yUtlKD3lUXJj7nX1IW620fZomAKHzHEXvg=.da646d0d-a28f-4605-ab24-93a245a85e9b@github.com> Message-ID: On Wed, 15 Dec 2021 15:07:41 GMT, Patric Hedlin wrote: > > We're in RDP1 since December 9. This means that any enhancement is covered by the Late-Enhancement Request Process. https://openjdk.java.net/jeps/3#Late-Enhancement-Request-Process . I don't think this patch is urgent enough for that. > > It has been classified as a performance regression (bug) in line with the x86 issue (JDK-8274242). Do you mean we should change this _now_? Aarch64 would be the only platform not to address the issue in JDK-18. I see your point, and that is a significant difference. It's unfortunate that AArch64 got this patch late in the process, but as it's a bug, and Java performance isn't supposed to regress in a release, it does make sense to fix it. Having said that, I've had a lot of bad experiences with late patches, and it'd be a dreadful shame if we broke AArch64 string handling. I suppose the best way to proceed with this is to have two AArch64 reviewers go through the patch instruction by instruction, just to make sure the test covers all corner cases. Thanks. ------------- PR: https://git.openjdk.java.net/jdk18/pull/20 From duke at openjdk.java.net Wed Dec 15 15:47:13 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Wed, 15 Dec 2021 15:47:13 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive Message-ID: IR::verify iterates the whole object graph. This proves costly when used in e.g. BlockMerger inside of iterations over BlockLists, leading to quadratic or worse complexities as a function of bytecode length. In several cases, only a few Blocks were changed, and there was no need to go over the whole graph, but until now there was no less blunt tool for verification than IR::verify. This PR introduces IR::verify_local, intended to be used when only a defined set of blocks have been modified. As a complement, expand_with_neighbors provides a way to also capture the neighbors of the "modifies set" ahead of modification, so that afterwards the appropriate asserts can be made on all blocks which might possibly have been changed. All this should let us remove the expensive IR::verify calls, while still performing equivalent (or stricter) assertions. Some changes have been made in the verifiers along the way. Some amount of refactoring, and even added invariants (see validate_edge_mutiality). ------------- Commit messages: - Use delayed asserts - add expand_with_neighborhood - first use verify_local - add verify_local - Assert edge mutuality in IR::verify - rename PredecessorValidator - XentryFlag closure split from PredecessorValidator - Comment: validation goals - add validate edge mutuality - add blockEnd::is_sux - ... and 6 more: https://git.openjdk.java.net/jdk/compare/3f9638d1...716803bf Changes: https://git.openjdk.java.net/jdk/pull/6850/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6850&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8258603 Stats: 163 lines in 4 files changed: 134 ins; 14 del; 15 mod Patch: https://git.openjdk.java.net/jdk/pull/6850.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6850/head:pull/6850 PR: https://git.openjdk.java.net/jdk/pull/6850 From rriggs at openjdk.java.net Wed Dec 15 16:12:56 2021 From: rriggs at openjdk.java.net (Roger Riggs) Date: Wed, 15 Dec 2021 16:12:56 GMT Subject: [jdk18] RFR: 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 10:45:28 GMT, Patric Hedlin wrote: > Implementation of ISO/ASCII char set encoding, extending current implementation with ASCII encoding support. > > The motivation is found in the original x86 issue ([JDK-8274242](https://bugs.openjdk.java.net/browse/JDK-8274242)). > > Implementation with slight focus on balance between footprint and efficiency, trying to utilise a dual SIMD path (e.g. Neoverse N1) for the additional Ascii-check and avoid performance loss in the ISO-only case. > > - Interleaved ISO and ASCII check code. > - Avoid 'umaxv' in the ISO main flow. > - Using post inc in main loop. > - Retain 8-char loop. > - Removing conditional prefetch (no upside). > - Adding ISO-8859-1 to encode-decode benchmark. > > Testing: tier1-6 > > The revised version compares like this (master vs. update). > > Benchmark (size) (type) Mode Cnt Score Error Units > CharsetEncodeDecode.encode 16384 UTF-8 avgt 30 17.920 ? 0.229 us/op > CharsetEncodeDecode.encode 16384 BIG5 avgt 30 18.867 ? 0.356 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-15 avgt 30 17.419 ? 0.220 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-1 avgt 30 6.200 ? 0.134 us/op > CharsetEncodeDecode.encode 16384 ASCII avgt 30 17.149 ? 0.219 us/op > CharsetEncodeDecode.encode 16384 UTF-16 avgt 30 135.115 ? 1.440 us/op > > > Benchmark (size) (type) Mode Cnt Score Error Units > CharsetEncodeDecode.encode 16384 UTF-8 avgt 30 9.018 ? 0.179 us/op > CharsetEncodeDecode.encode 16384 BIG5 avgt 30 10.550 ? 0.470 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-15 avgt 30 8.843 ? 0.187 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-1 avgt 30 6.406 ? 0.155 us/op > CharsetEncodeDecode.encode 16384 ASCII avgt 30 8.822 ? 0.173 us/op > CharsetEncodeDecode.encode 16384 UTF-16 avgt 30 135.195 ? 1.432 us/op Its viable to commit to the main line and allow it to have some bake time before requesting it to be backported. That would allow some time to build confidence about the change; it might not make the first JDK 18 release but would come in later. There is no problem requesting approval for the change but it should go through that process. ------------- PR: https://git.openjdk.java.net/jdk18/pull/20 From dlong at openjdk.java.net Wed Dec 15 16:25:59 2021 From: dlong at openjdk.java.net (Dean Long) Date: Wed, 15 Dec 2021 16:25:59 GMT Subject: [jdk18] RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 10:24:59 GMT, Roland Westrelin wrote: > The bug and fix were discussed in a previous PR: > > https://github.com/openjdk/jdk/pull/6572 > > I pushed all commits from that PR on top of jdk 18 and added a couple > extra tests as suggested in: > > https://github.com/openjdk/jdk/pull/6572#issuecomment-994086590 Marked as reviewed by dlong (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk18/pull/29 From duke at openjdk.java.net Wed Dec 15 16:46:17 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Wed, 15 Dec 2021 16:46:17 GMT Subject: RFR: 8278623: compiler/vectorapi/reshape/TestVectorCastAVX512.java after JDK-8259610 In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 16:37:19 GMT, Mai ??ng Qu?n Anh wrote: > The problem is that loading vector from byte array requires the vector shape to support byte vector before the reinterpretation to the correct type. The failure to intrinsify seems to stop the compilation of the method, leads to IR verification failure. The patch simply changes the argument of vector cast methods to correct types. > > Thanh you very much. @PaulSandoz @theRealELiu @chhagedorn Could you please test and review this fix. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/6852 From duke at openjdk.java.net Wed Dec 15 16:46:17 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Wed, 15 Dec 2021 16:46:17 GMT Subject: RFR: 8278623: compiler/vectorapi/reshape/TestVectorCastAVX512.java after JDK-8259610 Message-ID: The problem is that loading vector from byte array requires the vector shape to support byte vector before the reinterpretation to the correct type. The failure to intrinsify seems to stop the compilation of the method, leads to IR verification failure. The patch simply changes the argument of vector cast methods to correct types. Thanh you very much. ------------- Commit messages: - concretize vector cast methods arguments Changes: https://git.openjdk.java.net/jdk/pull/6852/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6852&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278623 Stats: 439 lines in 10 files changed: 123 ins; 69 del; 247 mod Patch: https://git.openjdk.java.net/jdk/pull/6852.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6852/head:pull/6852 PR: https://git.openjdk.java.net/jdk/pull/6852 From svkamath at openjdk.java.net Wed Dec 15 17:56:52 2021 From: svkamath at openjdk.java.net (Smita Kamath) Date: Wed, 15 Dec 2021 17:56:52 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 06:16:23 GMT, Smita Kamath wrote: > The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. Thanks for your comments Vladimir. I will make the change and move the check. ------------- PR: https://git.openjdk.java.net/jdk18/pull/19 From kvn at openjdk.java.net Wed Dec 15 18:03:55 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 15 Dec 2021 18:03:55 GMT Subject: [jdk18] RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 10:24:59 GMT, Roland Westrelin wrote: > The bug and fix were discussed in a previous PR: > > https://github.com/openjdk/jdk/pull/6572 > > I pushed all commits from that PR on top of jdk 18 and added a couple > extra tests as suggested in: > > https://github.com/openjdk/jdk/pull/6572#issuecomment-994086590 I agree with this conservative fix. John agreed with it too. @dean-long please run testing for it before Roland pushed it. Also file RFE to clean these up - we will assign an engineer to look on it later. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/29 From sviswanathan at openjdk.java.net Wed Dec 15 18:43:53 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 15 Dec 2021 18:43:53 GMT Subject: [jdk18] RFR: 8278796: Incorrect behavior of FloatVector.withLane on X86 In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 10:11:20 GMT, Jatin Bhateja wrote: > - Incorrect operand is being passed to insertps instruction which causes incorrectness issues in FloatVector.withLane operation. > - Existing JTREG test cases have been modified appropriately with a non-zero insertion index. > > Kindly review and share your comments. > Best Regards, > Jatin Marked as reviewed by sviswanathan (Reviewer). @jatin-bhateja Thanks for fixing this issue. The patch looks good to me. ------------- PR: https://git.openjdk.java.net/jdk18/pull/28 From kvn at openjdk.java.net Wed Dec 15 19:06:53 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 15 Dec 2021 19:06:53 GMT Subject: [jdk18] RFR: 8278796: Incorrect behavior of FloatVector.withLane on X86 In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 10:11:20 GMT, Jatin Bhateja wrote: > - Incorrect operand is being passed to insertps instruction which causes incorrectness issues in FloatVector.withLane operation. > - Existing JTREG test cases have been modified appropriately with a non-zero insertion index. > > Kindly review and share your comments. > Best Regards, > Jatin Looks fine to me too. I will run testing. @jatin-bhateja please, update Description in bug report (it is empty now). ------------- PR: https://git.openjdk.java.net/jdk18/pull/28 From kvn at openjdk.java.net Wed Dec 15 21:20:58 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 15 Dec 2021 21:20:58 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 19:18:47 GMT, Jatin Bhateja wrote: > - Vector.maskAll was accelerated for AVX-512 target, but x86 existing backend implementation does not enable maskAll instruction patterns for 32 bit JVM, due to which operations fall backs over replicateB operation which broadcasts the mask value in a vector. > - In some cases after unboxing-boxing optimization this vector eventually reaches to XorVMask which has different operands one held in opmask register and other in vector. > > Kindly review and share your feedback. > > Best Regards, > Jatin I ran these tests with `-XX:-TieredCompilation -Xbatch -XX:UseAVX=3` flags and 32-bit VM on Skylake. ------------- PR: https://git.openjdk.java.net/jdk18/pull/24 From kvn at openjdk.java.net Wed Dec 15 21:55:57 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 15 Dec 2021 21:55:57 GMT Subject: RFR: 8278623: compiler/vectorapi/reshape/TestVectorCastAVX512.java after JDK-8259610 In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 16:37:19 GMT, Mai ??ng Qu?n Anh wrote: > The problem is that loading vector from byte array requires the vector shape to support byte vector before the reinterpretation to the correct type. The failure to intrinsify seems to stop the compilation of the method, leads to IR verification failure. The patch simply changes the argument of vector cast methods to correct types. > > Thanh you very much. Yes, splitting bytes testing is good idea because jtreg's filter `@requries` uses command line flags. I thought that Vector API filters would filter such tests. But looks like `VectorReshapeHelper.runMainHelper()` does not run with command line flags. I used `@run main` to solve that but filters still does not work. I am not sure it is a bug or feature. Anyway, I run your patch locally and testing passed. I submitted regular testing and I will let you know results before approval. ------------- PR: https://git.openjdk.java.net/jdk/pull/6852 From psandoz at openjdk.java.net Wed Dec 15 22:06:58 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Wed, 15 Dec 2021 22:06:58 GMT Subject: RFR: 8278623: compiler/vectorapi/reshape/TestVectorCastAVX512.java after JDK-8259610 In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 16:37:19 GMT, Mai ??ng Qu?n Anh wrote: > The problem is that loading vector from byte array requires the vector shape to support byte vector before the reinterpretation to the correct type. The failure to intrinsify seems to stop the compilation of the method, leads to IR verification failure. The patch simply changes the argument of vector cast methods to correct types. > > Thanh you very much. I think this update likely works around a bug when using `-XX:+UseKNLSetting` and the loading of 512-bit vectors, such as `IntVector` from `byte[]`. Hopefully the testing will confirm the workaround. @jatin-bhateja could review the set of tests for each AVX variant, esp. for 512 bits and check if we are missing cases? ------------- PR: https://git.openjdk.java.net/jdk/pull/6852 From kvn at openjdk.java.net Wed Dec 15 22:08:59 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 15 Dec 2021 22:08:59 GMT Subject: [jdk18] RFR: 8278796: Incorrect behavior of FloatVector.withLane on X86 In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 10:11:20 GMT, Jatin Bhateja wrote: > - Incorrect operand is being passed to insertps instruction which causes incorrectness issues in FloatVector.withLane operation. > - Existing JTREG test cases have been modified appropriately with a non-zero insertion index. > > Kindly review and share your comments. > Best Regards, > Jatin Testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/28 From svkamath at openjdk.java.net Wed Dec 15 22:53:06 2021 From: svkamath at openjdk.java.net (Smita Kamath) Date: Wed, 15 Dec 2021 22:53:06 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 20:05:59 GMT, Vladimir Kozlov wrote: >> The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. > > Yes, we need to reexecute because code could be deoptimized during `new_array()` allocation. > But why we allocate this temp array in Java heap? Why not on stack in stub code? > > Also I noticed next return from intrinsics code could be moved up before we generate new nodes in graph: `if (Matcher::htbl_entries == -1) return false;` @vnkozlov Allocating the array in the stub will cause few changes on x86-64 side as well as change in aarch64 stubGenerator code as subkeyHtbl_48_entries will no longer be passed as an argument. Do let me know if you think it is okay to proceed with these changes. Thank you. ------------- PR: https://git.openjdk.java.net/jdk18/pull/19 From dlong at openjdk.java.net Wed Dec 15 22:56:10 2021 From: dlong at openjdk.java.net (Dean Long) Date: Wed, 15 Dec 2021 22:56:10 GMT Subject: [jdk18] RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 10:24:59 GMT, Roland Westrelin wrote: > The bug and fix were discussed in a previous PR: > > https://github.com/openjdk/jdk/pull/6572 > > I pushed all commits from that PR on top of jdk 18 and added a couple > extra tests as suggested in: > > https://github.com/openjdk/jdk/pull/6572#issuecomment-994086590 I filed JDK-8278873 and started testing. ------------- PR: https://git.openjdk.java.net/jdk18/pull/29 From psandoz at openjdk.java.net Wed Dec 15 23:05:02 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Wed, 15 Dec 2021 23:05:02 GMT Subject: RFR: 8278623: compiler/vectorapi/reshape/TestVectorCastAVX512.java after JDK-8259610 In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 16:37:19 GMT, Mai ??ng Qu?n Anh wrote: > The problem is that loading vector from byte array requires the vector shape to support byte vector before the reinterpretation to the correct type. The failure to intrinsify seems to stop the compilation of the method, leads to IR verification failure. The patch simply changes the argument of vector cast methods to correct types. > > Thanh you very much. test/hotspot/jtreg/compiler/vectorapi/reshape/utils/VectorReshapeHelper.java line 311: > 309: @ForceInline > 310: private static void writeVector(VectorSpecies osp, Vector vector, Object output) { > 311: var otype = osp.elementType(); This code suggests we are missing Object-based array store method to compliment the Object-based array load method. ------------- PR: https://git.openjdk.java.net/jdk/pull/6852 From kvn at openjdk.java.net Wed Dec 15 23:08:10 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 15 Dec 2021 23:08:10 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 In-Reply-To: References: Message-ID: <-plTHCWteo2rESpVYcx6lAh1sG4ofFm3b9qDQbvs28Q=.af21ab52-6f95-4e8e-b9dc-08caec4dd826@github.com> On Tue, 14 Dec 2021 20:05:59 GMT, Vladimir Kozlov wrote: >> The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. > > Yes, we need to reexecute because code could be deoptimized during `new_array()` allocation. > But why we allocate this temp array in Java heap? Why not on stack in stub code? > > Also I noticed next return from intrinsics code could be moved up before we generate new nodes in graph: `if (Matcher::htbl_entries == -1) return false;` > @vnkozlov Allocating the array in the stub will cause few changes on x86-64 side as well as change in aarch64 stubGenerator code as subkeyHtbl_48_entries will no longer be passed as an argument. > Do let me know if you think it is okay to proceed with these changes. Thank you. @smita-kamath Thank you for looking on it. Okay, let proceed with your current fix for JDK 18. File RFE for JDK 19 to rework the code. Meanwhile I will test this fix. ------------- PR: https://git.openjdk.java.net/jdk18/pull/19 From duke at openjdk.java.net Thu Dec 16 00:03:23 2021 From: duke at openjdk.java.net (Vamsi Parasa) Date: Thu, 16 Dec 2021 00:03:23 GMT Subject: RFR: 8278868: Add x86 vectorization support for Long.bitCount() Message-ID: Vectorization support of Integer.bitCount() already exists but currently the same support is lacking for Long.bitCount(). Similar to the C2 PopCountVI node, we created a C2 PopCountVL node and used vpopcntq x86 instruction to enable vectorized Long.bitCount(). ------------- Commit messages: - 8278868:Add x86 vectorization support for Long.bitCount() Changes: https://git.openjdk.java.net/jdk/pull/6857/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6857&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278868 Stats: 154 lines in 10 files changed: 152 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/6857.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6857/head:pull/6857 PR: https://git.openjdk.java.net/jdk/pull/6857 From psandoz at openjdk.java.net Thu Dec 16 00:04:05 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Thu, 16 Dec 2021 00:04:05 GMT Subject: [jdk18] RFR: 8278796: Incorrect behavior of FloatVector.withLane on X86 In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 10:11:20 GMT, Jatin Bhateja wrote: > - Incorrect operand is being passed to insertps instruction which causes incorrectness issues in FloatVector.withLane operation. > - Existing JTREG test cases have been modified appropriately with a non-zero insertion index. > > Kindly review and share your comments. > Best Regards, > Jatin The changes to the tests look good, ideally we should test over all lane indexes, but i believe the insert intrinsic currently requires the lane index be a constant. Unsure if that is a restriction that can be lifted. ------------- PR: https://git.openjdk.java.net/jdk18/pull/28 From duke at openjdk.java.net Thu Dec 16 01:24:45 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Thu, 16 Dec 2021 01:24:45 GMT Subject: RFR: 8278471: Remove unreached rules in AddNode::IdealIL [v4] In-Reply-To: References: Message-ID: > Reorder optimizations in addnode so special cases appear before general cases; otherwise the special cases would be never covered. > > `(a - b) + (c - d)` subsumes both `(a - b) + (b - c)` and `(a - b) + (c - a)`. Therefore `(a - b) + (b - c)` and `(a - b) + (c - a)` have to be placed before `(a - b) + (c - d)` so that they can work. Zhiqiang Zang has updated the pull request incrementally with two additional commits since the last revision: - Add missing optimizations "(A+X) - (X+B)" and "(X+A) - (B+X)" for long type. - Include an ir test to verify the removed optimizations really happended already. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6752/files - new: https://git.openjdk.java.net/jdk/pull/6752/files/d4d2b180..841d404f Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6752&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6752&range=02-03 Stats: 86 lines in 3 files changed: 86 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6752.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6752/head:pull/6752 PR: https://git.openjdk.java.net/jdk/pull/6752 From duke at openjdk.java.net Thu Dec 16 01:29:04 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Thu, 16 Dec 2021 01:29:04 GMT Subject: RFR: 8278471: Remove unreached rules in AddNode::IdealIL [v3] In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 04:02:12 GMT, Vladimir Kozlov wrote: > Before you integrate this, please, add IR framework test which checks that conversions are really happened. Especially ones you removed. Thanks a lot @vnkozlov for the suggestion of IR test. I added a test and found long type not working. Then I realized "long" did not have two conversions as "int" so I added them. These IR tests helped! Thank you! ------------- PR: https://git.openjdk.java.net/jdk/pull/6752 From kvn at openjdk.java.net Thu Dec 16 01:37:02 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 16 Dec 2021 01:37:02 GMT Subject: RFR: 8278471: Remove unreached rules in AddNode::IdealIL [v4] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 01:24:45 GMT, Zhiqiang Zang wrote: >> Reorder optimizations in addnode so special cases appear before general cases; otherwise the special cases would be never covered. >> >> `(a - b) + (c - d)` subsumes both `(a - b) + (b - c)` and `(a - b) + (c - a)`. Therefore `(a - b) + (b - c)` and `(a - b) + (c - a)` have to be placed before `(a - b) + (c - d)` so that they can work. > > Zhiqiang Zang has updated the pull request incrementally with two additional commits since the last revision: > > - Add missing optimizations "(A+X) - (X+B)" and "(X+A) - (B+X)" for long type. > - Include an ir test to verify the removed optimizations really happended already. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6752 From jiefu at openjdk.java.net Thu Dec 16 01:48:00 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 16 Dec 2021 01:48:00 GMT Subject: RFR: 8278471: Remove unreached rules in AddNode::IdealIL [v4] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 01:24:45 GMT, Zhiqiang Zang wrote: >> Reorder optimizations in addnode so special cases appear before general cases; otherwise the special cases would be never covered. >> >> `(a - b) + (c - d)` subsumes both `(a - b) + (b - c)` and `(a - b) + (c - a)`. Therefore `(a - b) + (b - c)` and `(a - b) + (c - a)` have to be placed before `(a - b) + (c - d)` so that they can work. > > Zhiqiang Zang has updated the pull request incrementally with two additional commits since the last revision: > > - Add missing optimizations "(A+X) - (X+B)" and "(X+A) - (B+X)" for long type. > - Include an ir test to verify the removed optimizations really happended already. Marked as reviewed by jiefu (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6752 From duke at openjdk.java.net Thu Dec 16 04:01:07 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Thu, 16 Dec 2021 04:01:07 GMT Subject: Integrated: 8278471: Remove unreached rules in AddNode::IdealIL In-Reply-To: References: Message-ID: On Tue, 7 Dec 2021 23:34:06 GMT, Zhiqiang Zang wrote: > Reorder optimizations in addnode so special cases appear before general cases; otherwise the special cases would be never covered. > > `(a - b) + (c - d)` subsumes both `(a - b) + (b - c)` and `(a - b) + (c - a)`. Therefore `(a - b) + (b - c)` and `(a - b) + (c - a)` have to be placed before `(a - b) + (c - d)` so that they can work. This pull request has now been integrated. Changeset: f6fbb5a8 Author: Zhiqiang Zang Committer: Jie Fu URL: https://git.openjdk.java.net/jdk/commit/f6fbb5a80cfe630e76917397d21649709485d31d Stats: 96 lines in 4 files changed: 86 ins; 10 del; 0 mod 8278471: Remove unreached rules in AddNode::IdealIL Reviewed-by: jiefu, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/6752 From kvn at openjdk.java.net Thu Dec 16 06:53:58 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 16 Dec 2021 06:53:58 GMT Subject: RFR: 8278623: compiler/vectorapi/reshape/TestVectorCastAVX512.java after JDK-8259610 In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 16:37:19 GMT, Mai ??ng Qu?n Anh wrote: > The problem is that loading vector from byte array requires the vector shape to support byte vector before the reinterpretation to the correct type. The failure to intrinsify seems to stop the compilation of the method, leads to IR verification failure. The patch simply changes the argument of vector cast methods to correct types. > > Thanh you very much. Tier1-3 testing passed without issues. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6852 From kvn at openjdk.java.net Thu Dec 16 07:26:57 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 16 Dec 2021 07:26:57 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 06:16:23 GMT, Smita Kamath wrote: > The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. Unfortunately I hit assert during CTW testing on Windows-x64 when compiling `com.sun.crypto.provider.GaloisCounterMode::implGCMCrypt`. # Internal Error (t:\workspace\open\src\hotspot\share\opto\graphKit.cpp:250), pid=41320, tid=43308 # assert(ex_map->jvms()->same_calls_as(_exceptions->jvms())) failed: all collected exceptions must come from the same place Current CompileTask: C2: 7834 2950 b 4 com.sun.crypto.provider.GaloisCounterMode::implGCMCrypt (100 bytes) Stack: [0x000000c86a600000,0x000000c86a700000] Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [jvm.dll+0xbb90d1] os::platform_print_native_stack+0xf1 (os_windows_x86.cpp:235) V [jvm.dll+0xdf904e] VMError::report+0x101e (vmError.cpp:828) V [jvm.dll+0xdfaa4e] VMError::report_and_die+0x7fe (vmError.cpp:1656) V [jvm.dll+0xdfb1d4] VMError::report_and_die+0x64 (vmError.cpp:1437) V [jvm.dll+0x536f47] report_vm_error+0xb7 (debug.cpp:280) V [jvm.dll+0x6da439] GraphKit::add_exception_states_from+0x119 (graphKit.cpp:286) V [jvm.dll+0x414375] PredicatedIntrinsicGenerator::generate+0x7f5 (callGenerator.cpp:1358) V [jvm.dll+0x5c3b62] Parse::do_call+0x9c2 (doCall.cpp:651) V [jvm.dll+0xbe3745] Parse::do_one_bytecode+0x32b5 (parse2.cpp:2704) V [jvm.dll+0xbd5ae7] Parse::do_one_block+0x437 (parse1.cpp:1557) V [jvm.dll+0xbd463c] Parse::do_all_blocks+0x5cc (parse1.cpp:710) V [jvm.dll+0xbd0c3d] Parse::Parse+0xc1d (parse1.cpp:616) V [jvm.dll+0x413a15] ParseGenerator::generate+0xa5 (callGenerator.cpp:103) V [jvm.dll+0x4eba90] Compile::Compile+0x1110 (compile.cpp:714) The call stack has `PredicatedIntrinsicGenerator` so it seems related to changes. I got replay file and will try to reproduce it tomorrow. ------------- PR: https://git.openjdk.java.net/jdk18/pull/19 From jiefu at openjdk.java.net Thu Dec 16 08:52:55 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 16 Dec 2021 08:52:55 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v4] In-Reply-To: References: <5tL6bmguT-7Wcm_WpWXUhxIQhjtGp9UqZvgH1jD6FbU=.9cc0d6ce-d322-409b-8ce4-066a18997f43@github.com> Message-ID: On Wed, 15 Dec 2021 15:02:56 GMT, Fei Gao wrote: >>> The PR optimizes abs operations in the C2 middle end. Can I have your review please? >> >> So what's the performance data before and after this patch? >> Does it also benefit on x86? >> >> It would be better to provide a jmh micro benchmark. >> Thanks. > >> > The PR optimizes abs operations in the C2 middle end. Can I have your review please? >> >> So what's the performance data before and after this patch? Does it also benefit on x86? >> >> It would be better to provide a jmh micro benchmark. Thanks. > > Thanks, @DamonFool . Yes, it's supposed to benefit all archs. > For example, here is the performance data on x86. > > Before the patch: > Benchmark (seed) Mode Cnt Score Error Units > MathBench.absConstantInt 0 thrpt 5 291960.380 ? 10724.572 ops/ms > > After the patch: > Benchmark (seed) Mode Cnt Score Error Units > MathBench.absConstantInt 0 thrpt 5 336271.533 ? 3778.210 ops/ms > > The jmh micro benchmark testcase has been added in the latest commit. Hi @fg1417 , Thanks for your update. Now I see that you are trying to optimize the following three abs() patterns: 1) Math.abs(-38) 2) (char) Math.abs((char) c) 3) Math.abs(0 - x) But did you see these code patterns in real programs? I'm a bit worried that we just improve the complexity of C2 with (almost) no performance gain in the real world. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6755 From roland at openjdk.java.net Thu Dec 16 08:53:23 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Thu, 16 Dec 2021 08:53:23 GMT Subject: [jdk18] RFR: 8278790: Inner loop of long loop nest runs for too few iterations Message-ID: Given a counted loop that iterates in [A, Z), when long range checks are transformed into int range checks, a loop nest is created and the inner loop iterates in [0, Z2). The limits of the inner loop are adjusted to guarantee no overflow for the range of values of the inner loop. That is for a range check: i * scale + offset References: Message-ID: > Given a counted loop that iterates in [A, Z), when long range checks > are transformed into int range checks, a loop nest is created and > the inner loop iterates in [0, Z2). > > The limits of the inner loop are adjusted to guarantee no overflow for > the range of values of the inner loop. That is for a range check: > > i * scale + offset > 1) the bounds of the inner loop are adjusted to roughly [0, > max_jint/scale). > > Also, we don't want to loose what we know about the bounds of the loop > being transformed. > > 2) So the bound of the inner loop are also adjusted to [0, min(Z2, Z - A)) > > The bug here is that 2) is performed before 1). This was spotted with > a micro benchmarks where the initial loop had only ~2000 > iterations. The transformed loop is expected to run for the same 2000 > iterations but instead ran for 2000/scale iterations. Roland Westrelin has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: fix ------------- Changes: - all: https://git.openjdk.java.net/jdk18/pull/35/files - new: https://git.openjdk.java.net/jdk18/pull/35/files/7bd126da..75722665 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk18&pr=35&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk18&pr=35&range=00-01 Stats: 200 lines in 9 files changed: 53 ins; 125 del; 22 mod Patch: https://git.openjdk.java.net/jdk18/pull/35.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/35/head:pull/35 PR: https://git.openjdk.java.net/jdk18/pull/35 From chagedorn at openjdk.java.net Thu Dec 16 09:20:56 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 16 Dec 2021 09:20:56 GMT Subject: [jdk18] RFR: 8278790: Inner loop of long loop nest runs for too few iterations [v2] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 09:13:40 GMT, Roland Westrelin wrote: >> Given a counted loop that iterates in [A, Z), when long range checks >> are transformed into int range checks, a loop nest is created and >> the inner loop iterates in [0, Z2). >> >> The limits of the inner loop are adjusted to guarantee no overflow for >> the range of values of the inner loop. That is for a range check: >> >> i * scale + offset > >> 1) the bounds of the inner loop are adjusted to roughly [0, >> max_jint/scale). >> >> Also, we don't want to loose what we know about the bounds of the loop >> being transformed. >> >> 2) So the bound of the inner loop are also adjusted to [0, min(Z2, Z - A)) >> >> The bug here is that 2) is performed before 1). This was spotted with >> a micro benchmarks where the initial loop had only ~2000 >> iterations. The transformed loop is expected to run for the same 2000 >> iterations but instead ran for 2000/scale iterations. > > Roland Westrelin has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/35 From chagedorn at openjdk.java.net Thu Dec 16 09:24:53 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 16 Dec 2021 09:24:53 GMT Subject: RFR: 8278623: compiler/vectorapi/reshape/TestVectorCastAVX512.java after JDK-8259610 In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 21:53:20 GMT, Vladimir Kozlov wrote: > I thought that Vector API filters would filter such tests. But looks like `VectorReshapeHelper.runMainHelper()` does not run with command line flags. I used `@run main` to solve that but filters still does not work. I am not sure it is a bug or feature. Generally, the idea behind recommending `@run driver` for IR tests is to not impact the VM running the test framework setup which does not run the actual user written `@Test` methods. The framework creates a child VM and only adds the JTreg options there where the actual testing code is executed. But it sounds like these set of vector tests are an exception because the actual test selection relies on the JTreg options. So your suggestion of changing it to `@run main` would make sense here if filters depend on them. ------------- PR: https://git.openjdk.java.net/jdk/pull/6852 From chagedorn at openjdk.java.net Thu Dec 16 09:39:00 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 16 Dec 2021 09:39:00 GMT Subject: RFR: 8278623: compiler/vectorapi/reshape/TestVectorCastAVX512.java after JDK-8259610 In-Reply-To: References: Message-ID: <1Ol9KOWrGmqTIDxE-D_HITuSdPyTmQZXfp8LfCSTLec=.234046c6-0781-4e79-a2cc-9efdb99219e1@github.com> On Wed, 15 Dec 2021 16:37:19 GMT, Mai ??ng Qu?n Anh wrote: > The problem is that loading vector from byte array requires the vector shape to support byte vector before the reinterpretation to the correct type. The failure to intrinsify seems to stop the compilation of the method, leads to IR verification failure. The patch simply changes the argument of vector cast methods to correct types. > > Thanh you very much. IR tests look good! test/hotspot/jtreg/compiler/vectorapi/reshape/TestVectorCastAVX512BW.java line 32: > 30: /* > 31: * @test > 32: * @bug 8259610 Should be changed to 8278623 or added as second bug number. test/hotspot/jtreg/compiler/vectorapi/reshape/utils/TestCastMethods.java line 34: > 32: /** > 33: * The cast intrinsics implemented on each platform, commented out tests are the ones that are > 34: * supposed to work but currently don't. Maybe you can add again the reason for the commented out `makePair()` calls. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6852 From neliasso at openjdk.java.net Thu Dec 16 09:41:55 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Thu, 16 Dec 2021 09:41:55 GMT Subject: [jdk18] RFR: 8278790: Inner loop of long loop nest runs for too few iterations [v2] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 09:13:40 GMT, Roland Westrelin wrote: >> Given a counted loop that iterates in [A, Z), when long range checks >> are transformed into int range checks, a loop nest is created and >> the inner loop iterates in [0, Z2). >> >> The limits of the inner loop are adjusted to guarantee no overflow for >> the range of values of the inner loop. That is for a range check: >> >> i * scale + offset > >> 1) the bounds of the inner loop are adjusted to roughly [0, >> max_jint/scale). >> >> Also, we don't want to loose what we know about the bounds of the loop >> being transformed. >> >> 2) So the bound of the inner loop are also adjusted to [0, min(Z2, Z - A)) >> >> The bug here is that 2) is performed before 1). This was spotted with >> a micro benchmarks where the initial loop had only ~2000 >> iterations. The transformed loop is expected to run for the same 2000 >> iterations but instead ran for 2000/scale iterations. > > Roland Westrelin has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. Looks good! ------------- Marked as reviewed by neliasso (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/35 From aph at openjdk.java.net Thu Dec 16 10:05:02 2021 From: aph at openjdk.java.net (Andrew Haley) Date: Thu, 16 Dec 2021 10:05:02 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v5] In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 15:07:39 GMT, Fei Gao wrote: >> The patch aims to help optimize Math.abs() mainly from these three parts: >> 1) Remove redundant instructions for abs with constant values >> 2) Remove redundant instructions for abs with char type >> 3) Convert some common abs operations to ideal forms >> >> 1. Remove redundant instructions for abs with constant values >> >> If we can decide the value of the input node for function Math.abs() >> at compile-time, we can substitute the Abs node with the absolute >> value of the constant and don't have to calculate it at runtime. >> >> For example, >> int[] a >> for (int i = 0; i < SIZE; i++) { >> a[i] = Math.abs(-38); >> } >> >> Before the patch, the generated code for the testcase above is: >> ... >> mov w10, #0xffffffda >> cmp w10, wzr >> cneg w17, w10, lt >> dup v16.8h, w17 >> ... >> After the patch, the generated code for the testcase above is : >> ... >> movi v16.4s, #0x26 >> ... >> >> 2. Remove redundant instructions for abs with char type >> >> In Java semantics, as the char type is always non-negative, we >> could actually remove the absI node in the C2 middle end. >> >> As for vectorization part, in current SLP, the vectorization of >> Math.abs() with char type is intentionally disabled after >> JDK-8261022 because it generates incorrect result before. After >> removing the AbsI node in the middle end, Math.abs(char) can be >> vectorized naturally. >> >> For example, >> >> char[] a; >> char[] b; >> for (int i = 0; i < SIZE; i++) { >> b[i] = (char) Math.abs(a[i]); >> } >> >> Before the patch, the generated assembly code for the testcase >> above is: >> >> B15: >> add x13, x21, w20, sxtw #1 >> ldrh w11, [x13, #16] >> cmp w11, wzr >> cneg w10, w11, lt >> strh w10, [x13, #16] >> ldrh w10, [x13, #18] >> cmp w10, wzr >> cneg w10, w10, lt >> strh w10, [x13, #18] >> ... >> add w20, w20, #0x1 >> cmp w20, w17 >> b.lt B15 >> >> After the patch, the generated assembly code is: >> B15: >> sbfiz x18, x19, #1, #32 >> add x0, x14, x18 >> ldr q16, [x0, #16] >> add x18, x21, x18 >> str q16, [x18, #16] >> ldr q16, [x0, #32] >> str q16, [x18, #32] >> ... >> add w19, w19, #0x40 >> cmp w19, w17 >> b.lt B15 >> >> 3. Convert some common abs operations to ideal forms >> >> The patch overrides some virtual support functions for AbsNode >> so that optimization of gvn can work on it. Here are the optimizable >> forms: >> >> a) abs(0 - x) => abs(x) >> >> Before the patch: >> ... >> ldr w13, [x13, #16] >> neg w13, w13 >> cmp w13, wzr >> cneg w14, w13, lt >> ... >> After the patch: >> ... >> ldr w13, [x13, #16] >> cmp w13, wzr >> cneg w13, w13, lt >> ... >> >> b) abs(abs(x)) => abs(x) >> >> Before the patch: >> ... >> ldr w12, [x12, #16] >> cmp w12, wzr >> cneg w12, w12, lt >> cmp w12, wzr >> cneg w12, w12, lt >> ... >> After the patch: >> ... >> ldr w13, [x13, #16] >> cmp w13, wzr >> cneg w13, w13, lt >> ... > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: > > - Add a jmh benchmark case > > Change-Id: I64938d543126c2e3f9fad8ffc4a50e25e4473d8f > - Merge branch 'master' of github.com:fg1417/jdk into fg8276673 > > Change-Id: I71987594e9288a489a04de696e69a62f4ad19357 > - Merge branch 'master' into fg8276673 > > Change-Id: I5e3898054b75f49653b8c3b37e4f5007675fa963 > - 8276673: Optimize abs operations in C2 compiler > > The patch aims to help optimize Math.abs() mainly from these three parts: > 1) Remove redundant instructions for abs with constant values > 2) Remove redundant instructions for abs with char type > 3) Convert some common abs operations to ideal forms > > 1. Remove redundant instructions for abs with constant values > > If we can decide the value of the input node for function Math.abs() > at compile-time, we can substitute the Abs node with the absolute > value of the constant and don't have to calculate it at runtime. > > For example, > int[] a > for (int i = 0; i < SIZE; i++) { > a[i] = Math.abs(-38); > } > > Before the patch, the generated code for the testcase above is: > ... > mov w10, #0xffffffda > cmp w10, wzr > cneg w17, w10, lt > dup v16.8h, w17 > ... > After the patch, the generated code for the testcase above is : > ... > movi v16.4s, #0x26 > ... > > 2. Remove redundant instructions for abs with char type > > In Java semantics, as the char type is always non-negative, we > could actually remove the absI node in the C2 middle end. > > As for vectorization part, in current SLP, the vectorization of > Math.abs() with char type is intentionally disabled after > JDK-8261022 because it generates incorrect result before. After > removing the AbsI node in the middle end, Math.abs(char) can be > vectorized naturally. > > For example, > > char[] a; > char[] b; > for (int i = 0; i < SIZE; i++) { > b[i] = (char) Math.abs(a[i]); > } > > Before the patch, the generated assembly code for the testcase > above is: > > B15: > add x13, x21, w20, sxtw #1 > ldrh w11, [x13, #16] > cmp w11, wzr > cneg w10, w11, lt > strh w10, [x13, #16] > ldrh w10, [x13, #18] > cmp w10, wzr > cneg w10, w10, lt > strh w10, [x13, #18] > ... > add w20, w20, #0x1 > cmp w20, w17 > b.lt B15 > > After the patch, the generated assembly code is: > B15: > sbfiz x18, x19, #1, #32 > add x0, x14, x18 > ldr q16, [x0, #16] > add x18, x21, x18 > str q16, [x18, #16] > ldr q16, [x0, #32] > str q16, [x18, #32] > ... > add w19, w19, #0x40 > cmp w19, w17 > b.lt B15 > > 3. Convert some common abs operations to ideal forms > > The patch overrides some virtual support functions for AbsNode > so that optimization of gvn can work on it. Here are the optimizable > forms: > > a) abs(0 - x) => abs(x) > > Before the patch: > ... > ldr w13, [x13, #16] > neg w13, w13 > cmp w13, wzr > cneg w14, w13, lt > ... > After the patch: > ... > ldr w13, [x13, #16] > cmp w13, wzr > cneg w13, w13, lt > ... > > b) abs(abs(x)) => abs(x) > > Before the patch: > ... > ldr w12, [x12, #16] > cmp w12, wzr > cneg w12, w12, lt > cmp w12, wzr > cneg w12, w12, lt > ... > After the patch: > ... > ldr w13, [x13, #16] > cmp w13, wzr > cneg w13, w13, lt > ... > > Change-Id: I5434c01a225796caaf07ffbb19983f4fe2e206bd src/hotspot/share/opto/subnode.cpp line 1854: > 1852: // Special case for min_jint: Math.abs(min_jint) = min_jint. > 1853: // Do not use C++ abs() for min_jint to avoid undefined behavior. > 1854: return (ti->is_con(min_jint)) ? TypeInt::MIN : TypeInt::make(abs(ti->get_con())); Suggestion: return TypeInt::make(uabs(ti->get_con()); We have uabs() for julong and unsigned int. ------------- PR: https://git.openjdk.java.net/jdk/pull/6755 From neliasso at openjdk.java.net Thu Dec 16 10:05:57 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Thu, 16 Dec 2021 10:05:57 GMT Subject: [jdk18] RFR: 8278413: C2 crash when allocating array of size too large In-Reply-To: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> References: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> Message-ID: On Wed, 15 Dec 2021 12:36:17 GMT, Roland Westrelin wrote: > On the fallthrough path from an AllocateArray, the length of the > allocated array is casted (with a CastII) to [0, max_size] with > max_size some number that depends on the array type and can be less > than max_jint. > > Allocating an array of a length that's not in [0, max_size] causes the > CastII to become top. The fallthrough path must be killed as well in > that case otherwise this can lead to a broken graph. Currently c2 has > logic to protect against an allocation of array of negative size in > AllocateArrayNode::Ideal(). That call replaces the fallthrough path > with an Halt node. But if the size is too big, then the fallthrough > path is left as is. > > This patch fixes that issues. It also reworks the length negative > case. I added a Bool/CmpU input to the AllocateArray that tests for a > valid length. If that input becomes false, CatchNode::Value() kills > the fallthrough path. That logic is similar to that for a virtual call > with a null receiver. I also removed AllocateArrayNode::Ideal() now > that CatchNode::Value() takes care of the same corner case. The code > in AllocateArrayNode::Ideal() was added by Vladimir and he told me he > tried extending CatchNode::Value() at the time but that caused test > failures. I had no issues in my testing so I assume doing it that way > is ok now. > > The new input to AllocateArray is moved to the CallStaticJava runtime > call for array allocation on macro expansion as a precedence edge. The > reason for that is that final graph reshape needs a way to tell > whether the missing path out of the allocation is legal or not. final > graph reshape then removes the then useless precedence edge. Have you run any micros to verify that nothing regresses? ------------- PR: https://git.openjdk.java.net/jdk18/pull/30 From mli at openjdk.java.net Thu Dec 16 10:13:00 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Thu, 16 Dec 2021 10:13:00 GMT Subject: RFR: 8278534: Remove some unnecessary code in MethodLiveness::init_basic_blocks In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 14:58:17 GMT, Hamlin Li wrote: > This is a minor patch to remove some unnecessary code in MethodLiveness::init_basic_blocks. Kindly reminder ~ ------------- PR: https://git.openjdk.java.net/jdk/pull/6799 From neliasso at openjdk.java.net Thu Dec 16 10:26:58 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Thu, 16 Dec 2021 10:26:58 GMT Subject: [jdk18] RFR: 8278413: C2 crash when allocating array of size too large In-Reply-To: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> References: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> Message-ID: <_8dk8ZC5p-mFf_Urb-qnNmxvXxbXiFoCU8TajzXESPA=.efb2f26c-d4a2-4db4-9082-5237e71e8529@github.com> On Wed, 15 Dec 2021 12:36:17 GMT, Roland Westrelin wrote: > On the fallthrough path from an AllocateArray, the length of the > allocated array is casted (with a CastII) to [0, max_size] with > max_size some number that depends on the array type and can be less > than max_jint. > > Allocating an array of a length that's not in [0, max_size] causes the > CastII to become top. The fallthrough path must be killed as well in > that case otherwise this can lead to a broken graph. Currently c2 has > logic to protect against an allocation of array of negative size in > AllocateArrayNode::Ideal(). That call replaces the fallthrough path > with an Halt node. But if the size is too big, then the fallthrough > path is left as is. > > This patch fixes that issues. It also reworks the length negative > case. I added a Bool/CmpU input to the AllocateArray that tests for a > valid length. If that input becomes false, CatchNode::Value() kills > the fallthrough path. That logic is similar to that for a virtual call > with a null receiver. I also removed AllocateArrayNode::Ideal() now > that CatchNode::Value() takes care of the same corner case. The code > in AllocateArrayNode::Ideal() was added by Vladimir and he told me he > tried extending CatchNode::Value() at the time but that caused test > failures. I had no issues in my testing so I assume doing it that way > is ok now. > > The new input to AllocateArray is moved to the CallStaticJava runtime > call for array allocation on macro expansion as a precedence edge. The > reason for that is that final graph reshape needs a way to tell > whether the missing path out of the allocation is legal or not. final > graph reshape then removes the then useless precedence edge. I general I really like the change. I simplifies things nicely. src/hotspot/share/opto/cfgnode.cpp line 2700: > 2698: Node* valid_length_test = call->in(AllocateNode::ValidLengthTest); > 2699: const Type* valid_length_test_t = phase->type(valid_length_test); > 2700: if (valid_length_test_t->isa_int() && valid_length_test_t->is_int()->is_con(0)) { Here you do: Node* valid_length_test = call->in(AllocateNode::ValidLengthTest); const Type* valid_length_test_t = phase->type(valid_length_test); if (valid_length_test_t->isa_int() && valid_length_test_t->is_int()->is_con(0)) { But in compile.cpp:3766 you do: Node* valid_length_test = call->in(call->req()); call->rm_prec(call->req()); if (valid_length_test->find_int_con(1) == 0) { Why "call->req()" and not "call->in(AllocateNode::ValidLengthTest)"? And why not "if (valid_length_test->find_int_con(1) == 0) {" in both places? ------------- Changes requested by neliasso (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/30 From fgao at openjdk.java.net Thu Dec 16 10:33:54 2021 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 16 Dec 2021 10:33:54 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v4] In-Reply-To: References: <5tL6bmguT-7Wcm_WpWXUhxIQhjtGp9UqZvgH1jD6FbU=.9cc0d6ce-d322-409b-8ce4-066a18997f43@github.com> Message-ID: <301NmpzddpO60UwM69o7a-R2lofFMk00IpYhYKbCQP8=.24d4defe-d043-4803-aa0a-d6967fb3301d@github.com> On Wed, 15 Dec 2021 15:02:56 GMT, Fei Gao wrote: >>> The PR optimizes abs operations in the C2 middle end. Can I have your review please? >> >> So what's the performance data before and after this patch? >> Does it also benefit on x86? >> >> It would be better to provide a jmh micro benchmark. >> Thanks. > >> > The PR optimizes abs operations in the C2 middle end. Can I have your review please? >> >> So what's the performance data before and after this patch? Does it also benefit on x86? >> >> It would be better to provide a jmh micro benchmark. Thanks. > > Thanks, @DamonFool . Yes, it's supposed to benefit all archs. > For example, here is the performance data on x86. > > Before the patch: > Benchmark (seed) Mode Cnt Score Error Units > MathBench.absConstantInt 0 thrpt 5 291960.380 ? 10724.572 ops/ms > > After the patch: > Benchmark (seed) Mode Cnt Score Error Units > MathBench.absConstantInt 0 thrpt 5 336271.533 ? 3778.210 ops/ms > > The jmh micro benchmark testcase has been added in the latest commit. > Hi @fg1417 , > > Thanks for your update. > > Now I see that you are trying to optimize the following three abs() patterns: > > 1. Math.abs(-38) > 2. (char) Math.abs((char) c) > 3. Math.abs(0 - x) > > But did you see these code patterns in real programs? I'm a bit worried that we just improve the complexity of C2 with (almost) no performance gain in the real world. Thanks. Thanks for your review, @DamonFool . I really understand your concern. In terms of complexity, the change only involves AbsNode and doesn?t modify any other code part. I don?t think it will make C2 more complex. As for performance gain in the real world, as we all know, the ability of GVN to optimize a node often depends on the optimized result of its input nodes. For example, if the input node of one AbsNode is recognized as a constant after last round of GVN optimization, now, we can optimize abs(constant) to a simple constant value. Like C2 did in https://github.com/openjdk/jdk/blob/0bddd8af61b6c731f16b857c09de57ceefd72d06/src/hotspot/share/opto/subnode.cpp#L55, we may not see -(-x) or (x+y)-y in any java program directly, but it?s possible after C2 optimization. Whether the optimization to sub or to abs is trivial, low-cost but useful. Why not apply it :) Math.abs(-38) , (char) Math.abs((char) c) and Math.abs(0 - x) are just conformance testcases. As you said, maybe nobody writes these cases in the real world. These testcases are just simulating all possible scenarios that AbsNode may meet, to guarantee the correctness of the optimization. What do you think :) Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6755 From chagedorn at openjdk.java.net Thu Dec 16 10:42:59 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 16 Dec 2021 10:42:59 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 15:39:26 GMT, Ludvig Janiuk wrote: > IR::verify iterates the whole object graph. This proves costly when used in e.g. BlockMerger inside of iterations over BlockLists, leading to quadratic or worse complexities as a function of bytecode length. In several cases, only a few Blocks were changed, and there was no need to go over the whole graph, but until now there was no less blunt tool for verification than IR::verify. > > This PR introduces IR::verify_local, intended to be used when only a defined set of blocks have been modified. As a complement, expand_with_neighbors provides a way to also capture the neighbors of the "modified set" ahead of modification, so that afterwards the appropriate asserts can be made on all blocks which might possibly have been changed. All this should let us remove the expensive IR::verify calls, while still performing equivalent (or stricter) assertions. > > Some changes have been made in the verifiers along the way. Some amount of refactoring, and even added invariants (see validate_edge_mutiality). I think it's a good idea to narrow the verification down to the blocks which are actually changed and also add some additional verification steps. Even with the newly added checks, it should still be better than running a complete check each time. Did you observe a big gain in the C1 compilation time for debug builds with your new patch? Maybe you also want to check again with the tests from https://github.com/openjdk/jdk16/pull/44. src/hotspot/share/c1/c1_IR.cpp line 1265: > 1263: > 1264: inline void validate_end_not_null(BlockBegin* block) { > 1265: assert(block->end() != NULL, "Expect block end to exist."); As this method is just a single assertion and used from one place, you could directly inline it into `block_do()` below. src/hotspot/share/c1/c1_IR.cpp line 1277: > 1275: typedef GrowableArray BlockListList; > 1276: > 1277: void verify_successor_xentry_flag(const BlockBegin* block) { Could this method also be directly inlined into `block_do` below? The class name and assertion message should give enough information about the purpose of the check. src/hotspot/share/c1/c1_IR.cpp line 1294: > 1292: > 1293: // Validation goals: > 1294: // * code() length == blocks length Just a minor thing, might be better to use `-` for the itemization instead of `*` as it looks like they belong to a block comment.. src/hotspot/share/c1/c1_IR.cpp line 1374: > 1372: > 1373: inline void verify_block_begin_field(BlockBegin* block) { > 1374: for ( Instruction *cur = block; cur != NULL; cur = cur->next()) { Same here, could this method also be directly inlined into `block_do` below? Spacing and asterisk position can be improved here. src/hotspot/share/c1/c1_IR.cpp line 1386: > 1384: }; > 1385: > 1386: void validate_edge_mutuality(BlockBegin* block) { Same here, could this method also be directly inlined into `block_do` below? src/hotspot/share/c1/c1_IR.cpp line 1417: > 1415: > 1416: for (int i = 0; i < block->end()->number_of_sux(); i++) { > 1417: if (blocks.contains(block->end()->sux_at(i))) continue; To avoid confusion, I think it's better to always use braces for one-line ifs and loops instead. src/hotspot/share/c1/c1_IR.cpp line 1434: > 1432: > 1433: void IR::verify_local(BlockList& blocks) { > 1434: #ifdef ASSERT Do we need an `ASSERT` here? The code is already guarded with `ifndef PRODUCT`. Same for L1447. src/hotspot/share/c1/c1_IR.cpp line 1449: > 1447: #ifdef ASSERT > 1448: XentryFlagValidator xe; > 1449: this->iterate_postorder(&xe); You can remove `this->`. src/hotspot/share/c1/c1_IR.hpp line 341: > 339: static void print(BlockBegin* start, bool cfg_only, bool live_only = false) PRODUCT_RETURN; > 340: void print(bool cfg_only, bool live_only = false) PRODUCT_RETURN; > 341: void expand_with_neighborhood(BlockList& blocks); Also needs to be guarded with `PRODUCT_RETURN`. src/hotspot/share/c1/c1_Optimizer.cpp line 183: > 181: } > 182: > 183: #ifdef ASSERT `expand_with_neighborhood()` is guarded with `ifndef PRODUCT`. You should use the same here. src/hotspot/share/c1/c1_Optimizer.cpp line 261: > 259: > 260: #ifdef ASSERT > 261: _hir->verify_local(blocks_to_verify_later); You can use `NOT_PRODUCT()` for single line statements (assuming you change the `ASSERT` into `ifndef PRODUCT`). src/hotspot/share/c1/c1_Optimizer.cpp line 411: > 409: #endif // ASSERT > 410: > 411: #ifdef ASSERT Can be merged into one `ifdef` block. ------------- Changes requested by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6850 From chagedorn at openjdk.java.net Thu Dec 16 10:52:04 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 16 Dec 2021 10:52:04 GMT Subject: RFR: 8278534: Remove some unnecessary code in MethodLiveness::init_basic_blocks In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 14:58:17 GMT, Hamlin Li wrote: > This is a minor patch to remove some unnecessary code in MethodLiveness::init_basic_blocks. Looks good and trivial. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6799 From jiefu at openjdk.java.net Thu Dec 16 11:21:57 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 16 Dec 2021 11:21:57 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v4] In-Reply-To: <301NmpzddpO60UwM69o7a-R2lofFMk00IpYhYKbCQP8=.24d4defe-d043-4803-aa0a-d6967fb3301d@github.com> References: <5tL6bmguT-7Wcm_WpWXUhxIQhjtGp9UqZvgH1jD6FbU=.9cc0d6ce-d322-409b-8ce4-066a18997f43@github.com> <301NmpzddpO60UwM69o7a-R2lofFMk00IpYhYKbCQP8=.24d4defe-d043-4803-aa0a-d6967fb3301d@github.com> Message-ID: On Thu, 16 Dec 2021 10:31:00 GMT, Fei Gao wrote: > > Hi @fg1417 , > > Thanks for your update. > > Now I see that you are trying to optimize the following three abs() patterns: > > > > 1. Math.abs(-38) > > 2. (char) Math.abs((char) c) > > 3. Math.abs(0 - x) > > > > But did you see these code patterns in real programs? I'm a bit worried that we just improve the complexity of C2 with (almost) no performance gain in the real world. Thanks. > > Thanks for your review, @DamonFool . I really understand your concern. > > In terms of complexity, the change only involves AbsNode and doesn?t modify any other code part. I don?t think it will make C2 more complex. > > As for performance gain in the real world, as we all know, the ability of GVN to optimize a node often depends on the optimized result of its input nodes. For example, if the input node of one AbsNode is recognized as a constant after last round of GVN optimization, now, we can optimize abs(constant) to a simple constant value. Like C2 did in > > https://github.com/openjdk/jdk/blob/0bddd8af61b6c731f16b857c09de57ceefd72d06/src/hotspot/share/opto/subnode.cpp#L55 > > , we may not see -(-x) or (x+y)-y in any java program directly, but it?s possible after C2 optimization. Whether the optimization to sub or to abs is trivial, low-cost but useful. Why not apply it :) > Math.abs(-38) , (char) Math.abs((char) c) and Math.abs(0 - x) are just conformance testcases. As you said, maybe nobody writes these cases in the real world. These testcases are just simulating all possible scenarios that AbsNode may meet, to guarantee the correctness of the optimization. > > What do you think :) > > Thanks. Then, shall we also opt cases like `Math.abs(-1 * x)`, `Math.abs(x / (-1))`, and so on? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6755 From phedlin at openjdk.java.net Thu Dec 16 11:37:02 2021 From: phedlin at openjdk.java.net (Patric Hedlin) Date: Thu, 16 Dec 2021 11:37:02 GMT Subject: [jdk18] RFR: 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 10:45:28 GMT, Patric Hedlin wrote: > Implementation of ISO/ASCII char set encoding, extending current implementation with ASCII encoding support. > > The motivation is found in the original x86 issue ([JDK-8274242](https://bugs.openjdk.java.net/browse/JDK-8274242)). > > Implementation with slight focus on balance between footprint and efficiency, trying to utilise a dual SIMD path (e.g. Neoverse N1) for the additional Ascii-check and avoid performance loss in the ISO-only case. > > - Interleaved ISO and ASCII check code. > - Avoid 'umaxv' in the ISO main flow. > - Using post inc in main loop. > - Retain 8-char loop. > - Removing conditional prefetch (no upside). > - Adding ISO-8859-1 to encode-decode benchmark. > > Testing: tier1-6 > > The revised version compares like this (master vs. update). > > Benchmark (size) (type) Mode Cnt Score Error Units > CharsetEncodeDecode.encode 16384 UTF-8 avgt 30 17.920 ? 0.229 us/op > CharsetEncodeDecode.encode 16384 BIG5 avgt 30 18.867 ? 0.356 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-15 avgt 30 17.419 ? 0.220 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-1 avgt 30 6.200 ? 0.134 us/op > CharsetEncodeDecode.encode 16384 ASCII avgt 30 17.149 ? 0.219 us/op > CharsetEncodeDecode.encode 16384 UTF-16 avgt 30 135.115 ? 1.440 us/op > > > Benchmark (size) (type) Mode Cnt Score Error Units > CharsetEncodeDecode.encode 16384 UTF-8 avgt 30 9.018 ? 0.179 us/op > CharsetEncodeDecode.encode 16384 BIG5 avgt 30 10.550 ? 0.470 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-15 avgt 30 8.843 ? 0.187 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-1 avgt 30 6.406 ? 0.155 us/op > CharsetEncodeDecode.encode 16384 ASCII avgt 30 8.822 ? 0.173 us/op > CharsetEncodeDecode.encode 16384 UTF-16 avgt 30 135.195 ? 1.432 us/op So this will obviously be prolonged. Closing and moving to 19. ------------- PR: https://git.openjdk.java.net/jdk18/pull/20 From phedlin at openjdk.java.net Thu Dec 16 11:37:03 2021 From: phedlin at openjdk.java.net (Patric Hedlin) Date: Thu, 16 Dec 2021 11:37:03 GMT Subject: [jdk18] Withdrawn: 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 10:45:28 GMT, Patric Hedlin wrote: > Implementation of ISO/ASCII char set encoding, extending current implementation with ASCII encoding support. > > The motivation is found in the original x86 issue ([JDK-8274242](https://bugs.openjdk.java.net/browse/JDK-8274242)). > > Implementation with slight focus on balance between footprint and efficiency, trying to utilise a dual SIMD path (e.g. Neoverse N1) for the additional Ascii-check and avoid performance loss in the ISO-only case. > > - Interleaved ISO and ASCII check code. > - Avoid 'umaxv' in the ISO main flow. > - Using post inc in main loop. > - Retain 8-char loop. > - Removing conditional prefetch (no upside). > - Adding ISO-8859-1 to encode-decode benchmark. > > Testing: tier1-6 > > The revised version compares like this (master vs. update). > > Benchmark (size) (type) Mode Cnt Score Error Units > CharsetEncodeDecode.encode 16384 UTF-8 avgt 30 17.920 ? 0.229 us/op > CharsetEncodeDecode.encode 16384 BIG5 avgt 30 18.867 ? 0.356 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-15 avgt 30 17.419 ? 0.220 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-1 avgt 30 6.200 ? 0.134 us/op > CharsetEncodeDecode.encode 16384 ASCII avgt 30 17.149 ? 0.219 us/op > CharsetEncodeDecode.encode 16384 UTF-16 avgt 30 135.115 ? 1.440 us/op > > > Benchmark (size) (type) Mode Cnt Score Error Units > CharsetEncodeDecode.encode 16384 UTF-8 avgt 30 9.018 ? 0.179 us/op > CharsetEncodeDecode.encode 16384 BIG5 avgt 30 10.550 ? 0.470 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-15 avgt 30 8.843 ? 0.187 us/op > CharsetEncodeDecode.encode 16384 ISO-8859-1 avgt 30 6.406 ? 0.155 us/op > CharsetEncodeDecode.encode 16384 ASCII avgt 30 8.822 ? 0.173 us/op > CharsetEncodeDecode.encode 16384 UTF-16 avgt 30 135.195 ? 1.432 us/op This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk18/pull/20 From mli at openjdk.java.net Thu Dec 16 11:38:58 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Thu, 16 Dec 2021 11:38:58 GMT Subject: RFR: 8278534: Remove some unnecessary code in MethodLiveness::init_basic_blocks In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 14:58:17 GMT, Hamlin Li wrote: > This is a minor patch to remove some unnecessary code in MethodLiveness::init_basic_blocks. Thanks Christian for your review. ------------- PR: https://git.openjdk.java.net/jdk/pull/6799 From mli at openjdk.java.net Thu Dec 16 11:38:58 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Thu, 16 Dec 2021 11:38:58 GMT Subject: Integrated: 8278534: Remove some unnecessary code in MethodLiveness::init_basic_blocks In-Reply-To: References: Message-ID: <9BrKZnn43DLKGdaiBKz7kPVQ5gtoa6GthICd8VcFkEI=.16510fe8-2fbf-4ede-8982-5db5941a10ec@github.com> On Fri, 10 Dec 2021 14:58:17 GMT, Hamlin Li wrote: > This is a minor patch to remove some unnecessary code in MethodLiveness::init_basic_blocks. This pull request has now been integrated. Changeset: 7edcd348 Author: Hamlin Li URL: https://git.openjdk.java.net/jdk/commit/7edcd348699b47050e4c5e3181c66fd0ee72830f Stats: 6 lines in 1 file changed: 0 ins; 6 del; 0 mod 8278534: Remove some unnecessary code in MethodLiveness::init_basic_blocks Reviewed-by: chagedorn ------------- PR: https://git.openjdk.java.net/jdk/pull/6799 From duke at openjdk.java.net Thu Dec 16 12:02:36 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Thu, 16 Dec 2021 12:02:36 GMT Subject: RFR: 8278623: compiler/vectorapi/reshape/TestVectorCastAVX512.java after JDK-8259610 [v2] In-Reply-To: References: Message-ID: > The problem is that loading vector from byte array requires the vector shape to support byte vector before the reinterpretation to the correct type. The failure to intrinsify seems to stop the compilation of the method, leads to IR verification failure. The patch simply changes the argument of vector cast methods to correct types. > > Thanh you very much. Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: address reviews ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6852/files - new: https://git.openjdk.java.net/jdk/pull/6852/files/30a466ef..c853ce6b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6852&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6852&range=00-01 Stats: 3 lines in 2 files changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/6852.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6852/head:pull/6852 PR: https://git.openjdk.java.net/jdk/pull/6852 From duke at openjdk.java.net Thu Dec 16 12:02:38 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Thu, 16 Dec 2021 12:02:38 GMT Subject: RFR: 8278623: compiler/vectorapi/reshape/TestVectorCastAVX512.java after JDK-8259610 [v2] In-Reply-To: <1Ol9KOWrGmqTIDxE-D_HITuSdPyTmQZXfp8LfCSTLec=.234046c6-0781-4e79-a2cc-9efdb99219e1@github.com> References: <1Ol9KOWrGmqTIDxE-D_HITuSdPyTmQZXfp8LfCSTLec=.234046c6-0781-4e79-a2cc-9efdb99219e1@github.com> Message-ID: On Thu, 16 Dec 2021 09:34:05 GMT, Christian Hagedorn wrote: >> Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: >> >> address reviews > > test/hotspot/jtreg/compiler/vectorapi/reshape/utils/TestCastMethods.java line 34: > >> 32: /** >> 33: * The cast intrinsics implemented on each platform, commented out tests are the ones that are >> 34: * supposed to work but currently don't. > > Maybe you can add again the reason for the commented out `makePair()` calls. Address comments, thanks a lot. ------------- PR: https://git.openjdk.java.net/jdk/pull/6852 From roland at openjdk.java.net Thu Dec 16 12:29:00 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Thu, 16 Dec 2021 12:29:00 GMT Subject: [jdk18] RFR: 8278413: C2 crash when allocating array of size too large In-Reply-To: References: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> Message-ID: <-34D8kbPp7Ssy1b7ki6RknqmIbTFf1uUmJ1DEHrtblQ=.0ce71dd2-6d95-4760-a1b1-0a12e86ff33f@github.com> On Thu, 16 Dec 2021 10:02:38 GMT, Nils Eliasson wrote: > Have you run any micros to verify that nothing regresses? Thanks`` for looking at this. This fixes a crash for a corner case (array size close to max_jint). I don't think it can affect performance so I haven't run any performance testing. ------------- PR: https://git.openjdk.java.net/jdk18/pull/30 From roland at openjdk.java.net Thu Dec 16 12:35:55 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Thu, 16 Dec 2021 12:35:55 GMT Subject: [jdk18] RFR: 8278413: C2 crash when allocating array of size too large In-Reply-To: <_8dk8ZC5p-mFf_Urb-qnNmxvXxbXiFoCU8TajzXESPA=.efb2f26c-d4a2-4db4-9082-5237e71e8529@github.com> References: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> <_8dk8ZC5p-mFf_Urb-qnNmxvXxbXiFoCU8TajzXESPA=.efb2f26c-d4a2-4db4-9082-5237e71e8529@github.com> Message-ID: On Thu, 16 Dec 2021 10:21:31 GMT, Nils Eliasson wrote: > Here you do: > > ``` > Node* valid_length_test = call->in(AllocateNode::ValidLengthTest); > const Type* valid_length_test_t = phase->type(valid_length_test); > if (valid_length_test_t->isa_int() && valid_length_test_t->is_int()->is_con(0)) { > ``` > > But in compile.cpp:3766 you do: > > ``` > Node* valid_length_test = call->in(call->req()); > call->rm_prec(call->req()); > if (valid_length_test->find_int_con(1) == 0) { > ``` > > Why "call->req()" and not "call->in(AllocateNode::ValidLengthTest)"? call->in(AllocateNode::ValidLengthTest) only works for allocateArrayNode. The code I modified in compile.cpp runs after macro expansion so there's no AllocateArrayNode anymore. Instead there's a call to the runtime. When the AllocateArrayNode is macro expanded I move in(AllocateNode::ValidLengthTest) to the new runtime call as a precedence edge that is at req(). I tried adding it as an extra parameter that would be removed in compile.cpp but that messed up debug infos and I hit some asserts. > > And why not "if (valid_length_test->find_int_con(1) == 0) {" in both places? So`that the Value() call does the right thing during CCP where a node can be registered as a constant but not transformed yet to a ConNode. ------------- PR: https://git.openjdk.java.net/jdk18/pull/30 From duke at openjdk.java.net Thu Dec 16 12:35:57 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Thu, 16 Dec 2021 12:35:57 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 10:40:15 GMT, Christian Hagedorn wrote: >> IR::verify iterates the whole object graph. This proves costly when used in e.g. BlockMerger inside of iterations over BlockLists, leading to quadratic or worse complexities as a function of bytecode length. In several cases, only a few Blocks were changed, and there was no need to go over the whole graph, but until now there was no less blunt tool for verification than IR::verify. >> >> This PR introduces IR::verify_local, intended to be used when only a defined set of blocks have been modified. As a complement, expand_with_neighbors provides a way to also capture the neighbors of the "modified set" ahead of modification, so that afterwards the appropriate asserts can be made on all blocks which might possibly have been changed. All this should let us remove the expensive IR::verify calls, while still performing equivalent (or stricter) assertions. >> >> Some changes have been made in the verifiers along the way. Some amount of refactoring, and even added invariants (see validate_edge_mutiality). > > I think it's a good idea to narrow the verification down to the blocks which are actually changed and also add some additional verification steps. Even with the newly added checks, it should still be better than running a complete check each time. Did you observe a big gain in the C1 compilation time for debug builds with your new patch? Maybe you also want to check again with the tests from https://github.com/openjdk/jdk16/pull/44. @chhagedorn Thank you for the review! I will get to cleaning up the comments, and I'll also get back about changes in compilations times. ------------- PR: https://git.openjdk.java.net/jdk/pull/6850 From chagedorn at openjdk.java.net Thu Dec 16 12:56:57 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 16 Dec 2021 12:56:57 GMT Subject: RFR: 8278623: compiler/vectorapi/reshape/TestVectorCastAVX512.java after JDK-8259610 [v2] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 12:02:36 GMT, Mai ??ng Qu?n Anh wrote: >> The problem is that loading vector from byte array requires the vector shape to support byte vector before the reinterpretation to the correct type. The failure to intrinsify seems to stop the compilation of the method, leads to IR verification failure. The patch simply changes the argument of vector cast methods to correct types. >> >> Thanh you very much. > > Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: > > address reviews Marked as reviewed by chagedorn (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6852 From jbhateja at openjdk.java.net Thu Dec 16 13:08:29 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 16 Dec 2021 13:08:29 GMT Subject: [jdk18] RFR: 8278796: Incorrect behavior of FloatVector.withLane on X86 [v2] In-Reply-To: References: Message-ID: > - Incorrect operand is being passed to insertps instruction which causes incorrectness issues in FloatVector.withLane operation. > - Existing JTREG test cases have been modified appropriately with a non-zero insertion index. > > Kindly review and share your comments. > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8278796: Extending the test cases to use dynamic insert index and value. ------------- Changes: - all: https://git.openjdk.java.net/jdk18/pull/28/files - new: https://git.openjdk.java.net/jdk18/pull/28/files/eac1660e..0539b593 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk18&pr=28&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk18&pr=28&range=00-01 Stats: 279 lines in 33 files changed: 93 ins; 0 del; 186 mod Patch: https://git.openjdk.java.net/jdk18/pull/28.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/28/head:pull/28 PR: https://git.openjdk.java.net/jdk18/pull/28 From jbhateja at openjdk.java.net Thu Dec 16 13:08:31 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 16 Dec 2021 13:08:31 GMT Subject: [jdk18] RFR: 8278796: Incorrect behavior of FloatVector.withLane on X86 [v2] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 00:01:11 GMT, Paul Sandoz wrote: > The changes to the tests look good, ideally we should test over all lane indexes, but i believe the insert intrinsic currently requires the lane index be a constant. Unsure if that is a restriction that can be lifted. Hi @PaulSandoz , Your comments have been addressed. ------------- PR: https://git.openjdk.java.net/jdk18/pull/28 From aph at openjdk.java.net Thu Dec 16 13:18:59 2021 From: aph at openjdk.java.net (Andrew Haley) Date: Thu, 16 Dec 2021 13:18:59 GMT Subject: [jdk18] RFR: 8274243: Implement fast-path for ASCII-compatible CharsetEncoders on aarch64 In-Reply-To: References: Message-ID: <9yDa1xxnZh3yLMjNhmxlf4lPduEbNKWVKDJdNQ3_-s4=.4fd7a715-8258-4605-9f6f-c7865db99571@github.com> On Thu, 16 Dec 2021 11:33:11 GMT, Patric Hedlin wrote: > So this will obviously be prolonged. Closing and moving to 19. That sounds right. Let's get it into mainline soon, and we can do a backport to 18.1. ------------- PR: https://git.openjdk.java.net/jdk18/pull/20 From duke at openjdk.java.net Thu Dec 16 13:29:53 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Thu, 16 Dec 2021 13:29:53 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 15:39:26 GMT, Ludvig Janiuk wrote: > IR::verify iterates the whole object graph. This proves costly when used in e.g. BlockMerger inside of iterations over BlockLists, leading to quadratic or worse complexities as a function of bytecode length. In several cases, only a few Blocks were changed, and there was no need to go over the whole graph, but until now there was no less blunt tool for verification than IR::verify. > > This PR introduces IR::verify_local, intended to be used when only a defined set of blocks have been modified. As a complement, expand_with_neighbors provides a way to also capture the neighbors of the "modified set" ahead of modification, so that afterwards the appropriate asserts can be made on all blocks which might possibly have been changed. All this should let us remove the expensive IR::verify calls, while still performing equivalent (or stricter) assertions. > > Some changes have been made in the verifiers along the way. Some amount of refactoring, and even added invariants (see validate_edge_mutiality). small local interspersed test with 2 first rounds discarded seems to show a slight increase in compilation time, which can be explained by the added checks: ? | C1 compile time | ? | Emit LIR | ? | ? -- | -- | -- | -- | -- | -- ? | this PR | master | this PR | master | ? ? | 1,63 | 1,705 | 1,34 | 1,412 | ? ? | 1,701 | 1,563 | 1,412 | 1,276 | ? ? | 1,694 | 1,564 | 1,405 | 1,27 | ? ? | 1,704 | 1,593 | 1,406 | 1,306 | ? ? | 1,696 | 1,709 | 1,407 | 1,415 | ? ? | 1,704 | 1,622 | 1,418 | 1,335 | ? ? | 1,72 | 1,732 | 1,427 | 1,433 | ? ? | 1,686 | 1,712 | 1,396 | 1,42 | ? avg | 1,691875 | 1,65 | 1,401375 | 1,358375 | ? Invocation: ` for i in {1..10}; do .../jdk-myfix/bin/java -Xbatch -XX:+CITime compiler.c2.cr6340864.TestLongVect > time/myfix-$i.txt; .../jdk-master/bin/java -Xbatch -XX:+CITime compiler.c2.cr6340864.TestLongVect > time/master-$i.txt; done` Note that the timeouts of [openjdk/jdk16#44](https://github.com/openjdk/jdk16/pull/44) were remediated, likely by accident, in https://bugs.openjdk.java.net/browse/JDK-8267806, thus making it a worse candidate for testing this. ------------- PR: https://git.openjdk.java.net/jdk/pull/6850 From duke at openjdk.java.net Thu Dec 16 15:15:59 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Thu, 16 Dec 2021 15:15:59 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 15:39:26 GMT, Ludvig Janiuk wrote: > IR::verify iterates the whole object graph. This proves costly when used in e.g. BlockMerger inside of iterations over BlockLists, leading to quadratic or worse complexities as a function of bytecode length. In several cases, only a few Blocks were changed, and there was no need to go over the whole graph, but until now there was no less blunt tool for verification than IR::verify. > > This PR introduces IR::verify_local, intended to be used when only a defined set of blocks have been modified. As a complement, expand_with_neighbors provides a way to also capture the neighbors of the "modified set" ahead of modification, so that afterwards the appropriate asserts can be made on all blocks which might possibly have been changed. All this should let us remove the expensive IR::verify calls, while still performing equivalent (or stricter) assertions. > > Some changes have been made in the verifiers along the way. Some amount of refactoring, and even added invariants (see validate_edge_mutiality). The increase in the previous test was likely not statistically significant, I ran a new test with 100 iterations and the difference seems gone. 100-2 iterations | C1 compile time | ? | Emit LIR | ? | Linear Scan | ? -- | -- | -- | -- | -- | -- | -- ? | my fix | master | my fix | master | my fix | master avg | 1,6151 | 1,617 | 1,3228 | 1,3238 | 1,2886 | 1,2893 variance | 0,0024 | 0,002 | 0,0022 | 0,0019 | 0,0021 | 0,0018 ------------- PR: https://git.openjdk.java.net/jdk/pull/6850 From chagedorn at openjdk.java.net Thu Dec 16 15:44:04 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 16 Dec 2021 15:44:04 GMT Subject: [jdk18] RFR: 8278420: C2: assert(!n->is_Store() && !n->is_LoadStore()) failed: no node with a side effect In-Reply-To: References: Message-ID: On Fri, 10 Dec 2021 15:43:52 GMT, Christian Hagedorn wrote: > The test case fails with the assertion when an actual unreachable store node with only uses outside of the loop is tried to be sunk out of a dead loop in split-if. This is quite an edge case in which C2 is not able to remove the inner loop but the store for `iFldArr2` inside this loop dies due to improved type information after peeling. This removes some memory phis as well and leaves the store `iFldArr1` with only outside the loop uses. A more detailed explanation how we end up in this situation is shown in the comments of the test case. > > This suggests that the assertion is too strong. I propose to relax the assertion and bail out if we are trying to sink a store node. However, I don't think that we will reach this code with `LoadStore` nodes as they have other memory outputs inside a loop, preventing to reach this assertion code. > > Thanks, > Christian After digging deeper into this, I found more issues with loop peeling and loop predication by writing additional tests. I basically found all the issues related to skeleton predicates again but this time with an additional peeling before which makes it quite rare and hard to trigger in practice. I think we need the full range of fixes between the loop and the peeled iteration that we have done before: 1. Initialized skeleton predicate as suggested above to remove a dead loop after peeling. 2. Initial skeleton predicates for the initial value and one for the changing stride with unrolling. We could create pre/main/post and then over-unroll the main loop which lets `CastII` nodes die while the main loop is not removed ([JDK-8193130](https://bugs.openjdk.java.net/browse/JDK-8193130), [JDK-8203915](https://bugs.openjdk.java.net/browse/JDK-8203915)). 3. Update data nodes dependent on the predicates before the peeled iteration down to the loop. Because when later unrolling the loop, the cloned data nodes get again the same control input to the predicates before the peeled iteration. This could schedule loads from the main loop even though the main loop is never entered ([JDK-8237859](https://bugs.openjdk.java.net/browse/JDK-8237859) but with peeling before). In addition, we could also think about adding empty predicates between the loop and the peeled iteration to enable more rounds of loop predication similar to what we are doing for long counted loops. But this is more an idea for an RFE. There is currently no easy workaround for 1. - 3. and fixing them completely with skeleton predicates is more complex. I don't think it is worth the risk to do this in 18 given how rare 1. - 3. are. On top of that, these are not recent regressions. I will therefore close this PR and re-target this bug to 19/mainline. ------------- PR: https://git.openjdk.java.net/jdk18/pull/11 From chagedorn at openjdk.java.net Thu Dec 16 15:44:05 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 16 Dec 2021 15:44:05 GMT Subject: [jdk18] Withdrawn: 8278420: C2: assert(!n->is_Store() && !n->is_LoadStore()) failed: no node with a side effect In-Reply-To: References: Message-ID: <9Ga81CYLUZkt3sPWp_yOlNHWDCdt6Dlzcv6Wq6LnQbA=.87c13404-e2ee-46d0-a077-64a312060a3f@github.com> On Fri, 10 Dec 2021 15:43:52 GMT, Christian Hagedorn wrote: > The test case fails with the assertion when an actual unreachable store node with only uses outside of the loop is tried to be sunk out of a dead loop in split-if. This is quite an edge case in which C2 is not able to remove the inner loop but the store for `iFldArr2` inside this loop dies due to improved type information after peeling. This removes some memory phis as well and leaves the store `iFldArr1` with only outside the loop uses. A more detailed explanation how we end up in this situation is shown in the comments of the test case. > > This suggests that the assertion is too strong. I propose to relax the assertion and bail out if we are trying to sink a store node. However, I don't think that we will reach this code with `LoadStore` nodes as they have other memory outputs inside a loop, preventing to reach this assertion code. > > Thanks, > Christian This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk18/pull/11 From duke at openjdk.java.net Thu Dec 16 16:11:55 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Thu, 16 Dec 2021 16:11:55 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 15:39:26 GMT, Ludvig Janiuk wrote: > IR::verify iterates the whole object graph. This proves costly when used in e.g. BlockMerger inside of iterations over BlockLists, leading to quadratic or worse complexities as a function of bytecode length. In several cases, only a few Blocks were changed, and there was no need to go over the whole graph, but until now there was no less blunt tool for verification than IR::verify. > > This PR introduces IR::verify_local, intended to be used when only a defined set of blocks have been modified. As a complement, expand_with_neighbors provides a way to also capture the neighbors of the "modified set" ahead of modification, so that afterwards the appropriate asserts can be made on all blocks which might possibly have been changed. All this should let us remove the expensive IR::verify calls, while still performing equivalent (or stricter) assertions. > > Some changes have been made in the verifiers along the way. Some amount of refactoring, and even added invariants (see validate_edge_mutiality). A test on -version shows a clear reduction in Optimize time, leading to a small reduction in c1 compile time. I call this a "boosted test" because `-XX:RepeatCompilation=10 -Xcomp -XX:TieredStopAtLevel=1` is used to make the time spent compiling in c1 longer and thereby measureable. 100-2 iterations | C1 comp time | ? | Bulid HIR | ? | Optimize | ? -- | -- | -- | -- | -- | -- | -- boosted -version | this PR | master | this PR | master | this PR | master diff | ?3% | ? | ?6% | ? | ?66,7% | ? avg | 1,6185 | 1,6685 | 0,7651 | 0,8143 | 0,0325 | 0,0976 variance | 0,0019 | 0,0019 | 0,0004 | 0,0005 | 0 | 0 Invocation: ` for i in {1..100}; do .../jdk-master/bin/java -XX:RepeatCompilation=10 -Xcomp -XX:+CITime -XX:TieredStopAtLevel=1 -version > time/mast-$i.txt; .../jdk-myfix/bin/java -XX:RepeatCompilation=10 -Xcomp -XX:+CITime -XX:TieredStopAtLevel=1 -version > time/myfix-$i.txt; echo $i; done` And again, it's important to remember that the point of this fix is to protect against "worst cases" with e.g. functions with very long bytecode. The compile time improvement in the general case, if any, is only a bonus. ------------- PR: https://git.openjdk.java.net/jdk/pull/6850 From psandoz at openjdk.java.net Thu Dec 16 16:21:03 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Thu, 16 Dec 2021 16:21:03 GMT Subject: [jdk18] RFR: 8278796: Incorrect behavior of FloatVector.withLane on X86 [v2] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 13:08:29 GMT, Jatin Bhateja wrote: >> - Incorrect operand is being passed to insertps instruction which causes incorrectness issues in FloatVector.withLane operation. >> - Existing JTREG test cases have been modified appropriately with a non-zero insertion index. >> >> Kindly review and share your comments. >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8278796: Extending the test cases to use dynamic insert index and value. I think that is a little more robust, API-wise, but do i misunderstand the insert intrinsic with regards to requiring the lane index be a constant? ------------- PR: https://git.openjdk.java.net/jdk18/pull/28 From duke at openjdk.java.net Thu Dec 16 16:32:02 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Thu, 16 Dec 2021 16:32:02 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 16:44:22 GMT, Christian Hagedorn wrote: >> IR::verify iterates the whole object graph. This proves costly when used in e.g. BlockMerger inside of iterations over BlockLists, leading to quadratic or worse complexities as a function of bytecode length. In several cases, only a few Blocks were changed, and there was no need to go over the whole graph, but until now there was no less blunt tool for verification than IR::verify. >> >> This PR introduces IR::verify_local, intended to be used when only a defined set of blocks have been modified. As a complement, expand_with_neighbors provides a way to also capture the neighbors of the "modified set" ahead of modification, so that afterwards the appropriate asserts can be made on all blocks which might possibly have been changed. All this should let us remove the expensive IR::verify calls, while still performing equivalent (or stricter) assertions. >> >> Some changes have been made in the verifiers along the way. Some amount of refactoring, and even added invariants (see validate_edge_mutiality). > > src/hotspot/share/c1/c1_Optimizer.cpp line 411: > >> 409: #endif // ASSERT >> 410: >> 411: #ifdef ASSERT > > Can be merged into one `ifdef` block. I'm removing that specific verify_local call anyway, left in by accident ------------- PR: https://git.openjdk.java.net/jdk/pull/6850 From duke at openjdk.java.net Thu Dec 16 16:37:27 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Thu, 16 Dec 2021 16:37:27 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive [v2] In-Reply-To: References: Message-ID: > IR::verify iterates the whole object graph. This proves costly when used in e.g. BlockMerger inside of iterations over BlockLists, leading to quadratic or worse complexities as a function of bytecode length. In several cases, only a few Blocks were changed, and there was no need to go over the whole graph, but until now there was no less blunt tool for verification than IR::verify. > > This PR introduces IR::verify_local, intended to be used when only a defined set of blocks have been modified. As a complement, expand_with_neighbors provides a way to also capture the neighbors of the "modified set" ahead of modification, so that afterwards the appropriate asserts can be made on all blocks which might possibly have been changed. All this should let us remove the expensive IR::verify calls, while still performing equivalent (or stricter) assertions. > > Some changes have been made in the verifiers along the way. Some amount of refactoring, and even added invariants (see validate_edge_mutiality). Ludvig Janiuk has updated the pull request incrementally with two additional commits since the last revision: - cleanup - remove unneeded check ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6850/files - new: https://git.openjdk.java.net/jdk/pull/6850/files/716803bf..8b5f5111 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6850&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6850&range=00-01 Stats: 101 lines in 3 files changed: 21 ins; 54 del; 26 mod Patch: https://git.openjdk.java.net/jdk/pull/6850.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6850/head:pull/6850 PR: https://git.openjdk.java.net/jdk/pull/6850 From psandoz at openjdk.java.net Thu Dec 16 16:37:57 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Thu, 16 Dec 2021 16:37:57 GMT Subject: RFR: 8278623: compiler/vectorapi/reshape/TestVectorCastAVX512.java after JDK-8259610 [v2] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 12:02:36 GMT, Mai ??ng Qu?n Anh wrote: >> The problem is that loading vector from byte array requires the vector shape to support byte vector before the reinterpretation to the correct type. The failure to intrinsify seems to stop the compilation of the method, leads to IR verification failure. The patch simply changes the argument of vector cast methods to correct types. >> >> Thanh you very much. > > Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: > > address reviews Marked as reviewed by psandoz (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6852 From duke at openjdk.java.net Thu Dec 16 16:42:35 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Thu, 16 Dec 2021 16:42:35 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive [v3] In-Reply-To: References: Message-ID: > IR::verify iterates the whole object graph. This proves costly when used in e.g. BlockMerger inside of iterations over BlockLists, leading to quadratic or worse complexities as a function of bytecode length. In several cases, only a few Blocks were changed, and there was no need to go over the whole graph, but until now there was no less blunt tool for verification than IR::verify. > > This PR introduces IR::verify_local, intended to be used when only a defined set of blocks have been modified. As a complement, expand_with_neighbors provides a way to also capture the neighbors of the "modified set" ahead of modification, so that afterwards the appropriate asserts can be made on all blocks which might possibly have been changed. All this should let us remove the expensive IR::verify calls, while still performing equivalent (or stricter) assertions. > > Some changes have been made in the verifiers along the way. Some amount of refactoring, and even added invariants (see validate_edge_mutiality). Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: cleanup2 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6850/files - new: https://git.openjdk.java.net/jdk/pull/6850/files/8b5f5111..587840cf Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6850&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6850&range=01-02 Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/6850.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6850/head:pull/6850 PR: https://git.openjdk.java.net/jdk/pull/6850 From duke at openjdk.java.net Thu Dec 16 16:42:37 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Thu, 16 Dec 2021 16:42:37 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive [v3] In-Reply-To: References: Message-ID: <1qS5EfK74TZNdrYLyb56z3xCRlN_ovlxVS6snM6UFr8=.468f78a1-75e0-4404-9065-e61dbe149605@github.com> On Thu, 16 Dec 2021 10:40:15 GMT, Christian Hagedorn wrote: >> Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: >> >> cleanup2 > > I think it's a good idea to narrow the verification down to the blocks which are actually changed and also add some additional verification steps. Even with the newly added checks, it should still be better than running a complete check each time. Did you observe a big gain in the C1 compilation time for debug builds with your new patch? Maybe you also want to check again with the tests from https://github.com/openjdk/jdk16/pull/44. @chhagedorn I believe I have addressed all your comments now. ------------- PR: https://git.openjdk.java.net/jdk/pull/6850 From duke at openjdk.java.net Thu Dec 16 17:39:03 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Thu, 16 Dec 2021 17:39:03 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v3] In-Reply-To: References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> Message-ID: <7237WS6boDnPMjFCmhEbJ4lKF4Ve3uXCnd-uPKGY_XI=.6059ecd5-1a4c-4e1b-aff6-2a4e4bb93cfb@github.com> On Wed, 8 Dec 2021 19:31:35 GMT, Zhiqiang Zang wrote: >> A new ideal optimization can be introduced for addnode: converting "x + x" into "x << 1". >> >> >> // Convert "x + x" into "x << 1" >> if (in1 == in2) { >> return new LShiftINode(in1, phase->intcon(1)); >> } > > Zhiqiang Zang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - Merge master. > - enrich tests and do the same transformation for "long". > - include a new optimization for ideal in addnode: Convert "x + x" into "x << 1", and associated tests. @merykitty If you get a chance could you please take a quick look at this PR? I was wondering if this conversion makes sense. I am concerned if this obvious optimization has been done somewhere else which I was not able to find. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/6675 From jbhateja at openjdk.java.net Thu Dec 16 17:46:35 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 16 Dec 2021 17:46:35 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. [v2] In-Reply-To: References: Message-ID: <7xWcxp4p6wK4ccQdTb95R-vVlZtWixADO9ouDhRu0fY=.8b30e77c-4cc6-4f86-a255-7f6aa24a8122@github.com> > - Vector.maskAll was accelerated for AVX-512 target, but x86 existing backend implementation does not enable maskAll instruction patterns for 32 bit JVM, due to which operations fall backs over replicateB operation which broadcasts the mask value in a vector. > - In some cases after unboxing-boxing optimization this vector eventually reaches to XorVMask which has different operands one held in opmask register and other in vector. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8278508: Review comments resolution. ------------- Changes: - all: https://git.openjdk.java.net/jdk18/pull/24/files - new: https://git.openjdk.java.net/jdk18/pull/24/files/583e5e48..4fb1ea1b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk18&pr=24&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk18&pr=24&range=00-01 Stats: 3 lines in 3 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.java.net/jdk18/pull/24.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/24/head:pull/24 PR: https://git.openjdk.java.net/jdk18/pull/24 From jbhateja at openjdk.java.net Thu Dec 16 17:46:37 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 16 Dec 2021 17:46:37 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. [v2] In-Reply-To: <5Rm0oIFi7Ju0ZTN2mSUQk0ek-n3p-eLJXZphiG8NAOA=.2af84ad7-7665-43fd-b9b3-cc9aefaf1a50@github.com> References: <5Rm0oIFi7Ju0ZTN2mSUQk0ek-n3p-eLJXZphiG8NAOA=.2af84ad7-7665-43fd-b9b3-cc9aefaf1a50@github.com> Message-ID: <6k21acmYFs6y8-BOxAH1FNc1TytYHpmhMwAOcOyU6DU=.bbfa7b7f-120e-4771-9028-9309b5a7fee9@github.com> On Wed, 15 Dec 2021 13:13:43 GMT, Jatin Bhateja wrote: > ``` > jdk/incubator/vector/VectorReshapeTests.java > ``` Hi @vnkozlov , I have added explicit time out check as suggested in above 3 testcases. ------------- PR: https://git.openjdk.java.net/jdk18/pull/24 From sviswanathan at openjdk.java.net Thu Dec 16 18:31:04 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 16 Dec 2021 18:31:04 GMT Subject: [jdk18] RFR: 8278796: Incorrect behavior of FloatVector.withLane on X86 [v2] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 16:17:27 GMT, Paul Sandoz wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8278796: Extending the test cases to use dynamic insert index and value. > > I think that is a little more robust, API-wise, but do i misunderstand the insert intrinsic with regards to requiring the lane index be a constant? @PaulSandoz The withLane implementation calls the withLaneHelper with constant index. e.g. please see Int128Vector withLane implementation has a switch statement to achieve this. ------------- PR: https://git.openjdk.java.net/jdk18/pull/28 From psandoz at openjdk.java.net Thu Dec 16 18:31:04 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Thu, 16 Dec 2021 18:31:04 GMT Subject: [jdk18] RFR: 8278796: Incorrect behavior of FloatVector.withLane on X86 [v2] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 16:17:27 GMT, Paul Sandoz wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8278796: Extending the test cases to use dynamic insert index and value. > > I think that is a little more robust, API-wise, but do i misunderstand the insert intrinsic with regards to requiring the lane index be a constant? > @PaulSandoz The withLane implementation calls the withLaneHelper with constant index. e.g. please see Int128Vector withLane implementation has a switch statement to achieve this. Oh yes, of course, thanks! ------------- PR: https://git.openjdk.java.net/jdk18/pull/28 From kvn at openjdk.java.net Thu Dec 16 18:34:03 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 16 Dec 2021 18:34:03 GMT Subject: RFR: 8278623: compiler/vectorapi/reshape/TestVectorCastAVX512.java after JDK-8259610 [v2] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 12:02:36 GMT, Mai ??ng Qu?n Anh wrote: >> The problem is that loading vector from byte array requires the vector shape to support byte vector before the reinterpretation to the correct type. The failure to intrinsify seems to stop the compilation of the method, leads to IR verification failure. The patch simply changes the argument of vector cast methods to correct types. >> >> Thanh you very much. > > Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: > > address reviews Marked as reviewed by kvn (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6852 From duke at openjdk.java.net Thu Dec 16 18:40:04 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Thu, 16 Dec 2021 18:40:04 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v3] In-Reply-To: References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> Message-ID: <_MskcNL92fUY822sDIaFRMiZ55gDq3JpMKYKa58jjYA=.5f5376cc-8dd7-484f-81ed-21111e1252f1@github.com> On Wed, 8 Dec 2021 19:31:35 GMT, Zhiqiang Zang wrote: >> A new ideal optimization can be introduced for addnode: converting "x + x" into "x << 1". >> >> >> // Convert "x + x" into "x << 1" >> if (in1 == in2) { >> return new LShiftINode(in1, phase->intcon(1)); >> } > > Zhiqiang Zang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - Merge master. > - enrich tests and do the same transformation for "long". > - include a new optimization for ideal in addnode: Convert "x + x" into "x << 1", and associated tests. Tbh I don't find this transformation necessary, `x + x` is a very cheap operation and is generally easier for the optimiser to work with than `x << 1`. Cheers. ------------- PR: https://git.openjdk.java.net/jdk/pull/6675 From kvn at openjdk.java.net Thu Dec 16 18:43:03 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 16 Dec 2021 18:43:03 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. [v2] In-Reply-To: <7xWcxp4p6wK4ccQdTb95R-vVlZtWixADO9ouDhRu0fY=.8b30e77c-4cc6-4f86-a255-7f6aa24a8122@github.com> References: <7xWcxp4p6wK4ccQdTb95R-vVlZtWixADO9ouDhRu0fY=.8b30e77c-4cc6-4f86-a255-7f6aa24a8122@github.com> Message-ID: On Thu, 16 Dec 2021 17:46:35 GMT, Jatin Bhateja wrote: >> - Vector.maskAll was accelerated for AVX-512 target, but x86 existing backend implementation does not enable maskAll instruction patterns for 32 bit JVM, due to which operations fall backs over replicateB operation which broadcasts the mask value in a vector. >> - In some cases after unboxing-boxing optimization this vector eventually reaches to XorVMask which has different operands one held in opmask register and other in vector. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8278508: Review comments resolution. Good. You need second review for this. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/24 From kvn at openjdk.java.net Thu Dec 16 19:11:01 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 16 Dec 2021 19:11:01 GMT Subject: [jdk18] RFR: 8278796: Incorrect behavior of FloatVector.withLane on X86 [v2] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 13:08:29 GMT, Jatin Bhateja wrote: >> - Incorrect operand is being passed to insertps instruction which causes incorrectness issues in FloatVector.withLane operation. >> - Existing JTREG test cases have been modified appropriately with a non-zero insertion index. >> >> Kindly review and share your comments. >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8278796: Extending the test cases to use dynamic insert index and value. I have to retest before update approval. ------------- PR: https://git.openjdk.java.net/jdk18/pull/28 From redestad at openjdk.java.net Thu Dec 16 19:14:00 2021 From: redestad at openjdk.java.net (Claes Redestad) Date: Thu, 16 Dec 2021 19:14:00 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v3] In-Reply-To: <_MskcNL92fUY822sDIaFRMiZ55gDq3JpMKYKa58jjYA=.5f5376cc-8dd7-484f-81ed-21111e1252f1@github.com> References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> <_MskcNL92fUY822sDIaFRMiZ55gDq3JpMKYKa58jjYA=.5f5376cc-8dd7-484f-81ed-21111e1252f1@github.com> Message-ID: <6UEIgb07IvoDvmtD17RTJXY5lvMaWtb8XeW6Tu4JYAU=.f08a9246-3c05-4970-9fae-5a2fa955e4db@github.com> On Thu, 16 Dec 2021 18:36:31 GMT, Mai ??ng Qu?n Anh wrote: > Tbh I don't find this transformation necessary, `x + x` is a very cheap operation and is generally easier for the optimiser to work with than `x << 1`. Cheers. I agree this looks to be of dubious value on the face of it. A microbenchmark to prove it's beneficial in some scenario feels like a requirement here. A targeted microbenchmark could explore if we already do or could better allow constant folding of expressions that does subsequent shifts, e.g. by turning `(x + x) << 3` into `x << 4`. ------------- PR: https://git.openjdk.java.net/jdk/pull/6675 From duke at openjdk.java.net Thu Dec 16 19:34:59 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Thu, 16 Dec 2021 19:34:59 GMT Subject: RFR: 8278623: compiler/vectorapi/reshape/TestVectorCastAVX512.java after JDK-8259610 [v2] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 12:02:36 GMT, Mai ??ng Qu?n Anh wrote: >> The problem is that loading vector from byte array requires the vector shape to support byte vector before the reinterpretation to the correct type. The failure to intrinsify seems to stop the compilation of the method, leads to IR verification failure. The patch simply changes the argument of vector cast methods to correct types. >> >> Thanh you very much. > > Mai ??ng Qu?n Anh has updated the pull request incrementally with one additional commit since the last revision: > > address reviews Thank you very much for your testing and review. ------------- PR: https://git.openjdk.java.net/jdk/pull/6852 From enikitin at openjdk.java.net Thu Dec 16 20:20:41 2021 From: enikitin at openjdk.java.net (Evgeny Nikitin) Date: Thu, 16 Dec 2021 20:20:41 GMT Subject: RFR: 8274982: Add a test for 8269574. [v2] In-Reply-To: References: Message-ID: <0_iROrhelCBAS2sNgY6vkvDjCq0sh_5skuY41Zi_Lhg=.a1f21483-cdc1-40cd-98c3-c93adbdacbdb@github.com> > This PR contains a relatively simple test which verifies that JVMTI-agents are correctly informed about exceptions caught in C2-compiled code. The 8269574 introduces pre-allocated exceptions in some paths, so the test tries to produce a number of various exceptions and check that provided small JVMTI agent got notified about all of them. Evgeny Nikitin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - Get rid of while-breaks - Add JIT requirements - Remove explicit type specifiers for own class static calls - Remove redundant build directive - Merge branch 'master' into JDK-8274982/public - 8274982: Add a test for 8269574. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/5889/files - new: https://git.openjdk.java.net/jdk/pull/5889/files/3d2dfed9..6336f6af Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=5889&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=5889&range=00-01 Stats: 980305 lines in 3370 files changed: 519681 ins; 440480 del; 20144 mod Patch: https://git.openjdk.java.net/jdk/pull/5889.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/5889/head:pull/5889 PR: https://git.openjdk.java.net/jdk/pull/5889 From enikitin at openjdk.java.net Thu Dec 16 20:20:49 2021 From: enikitin at openjdk.java.net (Evgeny Nikitin) Date: Thu, 16 Dec 2021 20:20:49 GMT Subject: RFR: 8274982: Add a test for 8269574. [v2] In-Reply-To: References: Message-ID: On Thu, 11 Nov 2021 05:56:37 GMT, David Holmes wrote: >> Evgeny Nikitin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: >> >> - Get rid of while-breaks >> - Add JIT requirements >> - Remove explicit type specifiers for own class static calls >> - Remove redundant build directive >> - Merge branch 'master' into JDK-8274982/public >> - 8274982: Add a test for 8269574. > > test/hotspot/jtreg/compiler/jvmti/TriggerBuiltinExceptionsTest.java line 28: > >> 26: * @bug 8269574 >> 27: * @summary Verifies that exceptions are reported correctly to JVMTI in the compiled code >> 28: * @requires vm.jvmti > > You also require the JIT Added a requirement for the c1 or c2. > test/hotspot/jtreg/compiler/jvmti/TriggerBuiltinExceptionsTest.java line 32: > >> 30: * >> 31: * @build sun.hotspot.WhiteBox >> 32: * @build compiler.jvmti.TriggerBuiltinExceptionsTest > > Explicit build directive should not be needed. Fixed, thanks > test/hotspot/jtreg/compiler/jvmti/TriggerBuiltinExceptionsTest.java line 59: > >> 57: public class TriggerBuiltinExceptionsTest { >> 58: private static final WhiteBox WB = WhiteBox.getWhiteBox(); >> 59: private static final int ITERATIONS = 30; //Arbitrary value, feel free to change > > Style nit: space after // Fixed. ------------- PR: https://git.openjdk.java.net/jdk/pull/5889 From enikitin at openjdk.java.net Thu Dec 16 20:20:53 2021 From: enikitin at openjdk.java.net (Evgeny Nikitin) Date: Thu, 16 Dec 2021 20:20:53 GMT Subject: RFR: 8274982: Add a test for 8269574. [v2] In-Reply-To: References: Message-ID: On Wed, 10 Nov 2021 22:32:34 GMT, Serguei Spitsyn wrote: >> Evgeny Nikitin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: >> >> - Get rid of while-breaks >> - Add JIT requirements >> - Remove explicit type specifiers for own class static calls >> - Remove redundant build directive >> - Merge branch 'master' into JDK-8274982/public >> - 8274982: Add a test for 8269574. > > test/hotspot/jtreg/compiler/jvmti/TriggerBuiltinExceptionsTest.java line 128: > >> 126: >> 127: Asserts.assertEQ( >> 128: TriggerBuiltinExceptionsTest.caughtByJVMTIAgent(), caughtByJavaTest, > > What is the reason to use the class name prefix for methods? : > TriggerBuiltinExceptionsTest.compileMethodOrThrow > TriggerBuiltinExceptionsTest.methodToCompile > TriggerBuiltinExceptionsTest.caughtByJVMTIAgent > It is not really needed, tight? Style habits, acquired in previons job... fixed. > test/hotspot/jtreg/compiler/jvmti/libTriggerBuiltinExceptions.cpp line 77: > >> 75: } >> 76: >> 77: } while (false); > > I'm not sure why the while (false) loop is needed. > You can always return JNI_ERR instead of break in all places where the > result != JVMTI_ERROR_NONE is detected and return JNI_OK at the end. > Is it to for one-return style? Remnants from a previous, draft version. Fixed (along with unnecessary 'successfull' variable removal). ------------- PR: https://git.openjdk.java.net/jdk/pull/5889 From lmesnik at openjdk.java.net Thu Dec 16 20:55:09 2021 From: lmesnik at openjdk.java.net (Leonid Mesnik) Date: Thu, 16 Dec 2021 20:55:09 GMT Subject: RFR: 8274982: Add a test for 8269574. [v2] In-Reply-To: <0_iROrhelCBAS2sNgY6vkvDjCq0sh_5skuY41Zi_Lhg=.a1f21483-cdc1-40cd-98c3-c93adbdacbdb@github.com> References: <0_iROrhelCBAS2sNgY6vkvDjCq0sh_5skuY41Zi_Lhg=.a1f21483-cdc1-40cd-98c3-c93adbdacbdb@github.com> Message-ID: On Thu, 16 Dec 2021 20:20:41 GMT, Evgeny Nikitin wrote: >> This PR contains a relatively simple test which verifies that JVMTI-agents are correctly informed about exceptions caught in C2-compiled code. The 8269574 introduces pre-allocated exceptions in some paths, so the test tries to produce a number of various exceptions and check that provided small JVMTI agent got notified about all of them. > > Evgeny Nikitin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Get rid of while-breaks > - Add JIT requirements > - Remove explicit type specifiers for own class static calls > - Remove redundant build directive > - Merge branch 'master' into JDK-8274982/public > - 8274982: Add a test for 8269574. Marked as reviewed by lmesnik (Reviewer). test/hotspot/jtreg/compiler/jvmti/libTriggerBuiltinExceptions.cpp line 79: > 77: } while (false); > 78: > 79: return (result == JNI_OK) || (result == JVMTI_ERROR_NONE) ? JNI_OK : JNI_ERR; Seems (result == JNI_OK) || (result == JVMTI_ERROR_NONE) should be (result == JNI_OK) && (result == JVMTI_ERROR_NONE). Really, I think it would be better to replace do { ... break; .. } while (false) with multiply return. It makes logic simpler and style compliant to all jvmti tests. ------------- PR: https://git.openjdk.java.net/jdk/pull/5889 From lmesnik at openjdk.java.net Thu Dec 16 20:55:09 2021 From: lmesnik at openjdk.java.net (Leonid Mesnik) Date: Thu, 16 Dec 2021 20:55:09 GMT Subject: RFR: 8274982: Add a test for 8269574. [v2] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 20:16:39 GMT, Evgeny Nikitin wrote: >> test/hotspot/jtreg/compiler/jvmti/TriggerBuiltinExceptionsTest.java line 28: >> >>> 26: * @bug 8269574 >>> 27: * @summary Verifies that exceptions are reported correctly to JVMTI in the compiled code >>> 28: * @requires vm.jvmti >> >> You also require the JIT > > Added a requirement for the c1 or c2. Really, we don't add @requires for jit compiler. There are a lot of tests that fail with Xnt. ------------- PR: https://git.openjdk.java.net/jdk/pull/5889 From dlong at openjdk.java.net Thu Dec 16 21:20:03 2021 From: dlong at openjdk.java.net (Dean Long) Date: Thu, 16 Dec 2021 21:20:03 GMT Subject: [jdk18] RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 10:24:59 GMT, Roland Westrelin wrote: > The bug and fix were discussed in a previous PR: > > https://github.com/openjdk/jdk/pull/6572 > > I pushed all commits from that PR on top of jdk 18 and added a couple > extra tests as suggested in: > > https://github.com/openjdk/jdk/pull/6572#issuecomment-994086590 Test results look good. ------------- PR: https://git.openjdk.java.net/jdk18/pull/29 From psandoz at openjdk.java.net Thu Dec 16 21:35:00 2021 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Thu, 16 Dec 2021 21:35:00 GMT Subject: [jdk18] RFR: 8278796: Incorrect behavior of FloatVector.withLane on X86 [v2] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 13:08:29 GMT, Jatin Bhateja wrote: >> - Incorrect operand is being passed to insertps instruction which causes incorrectness issues in FloatVector.withLane operation. >> - Existing JTREG test cases have been modified appropriately with a non-zero insertion index. >> >> Kindly review and share your comments. >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8278796: Extending the test cases to use dynamic insert index and value. Testing passed on latest commit. ------------- PR: https://git.openjdk.java.net/jdk18/pull/28 From dholmes at openjdk.java.net Thu Dec 16 22:17:09 2021 From: dholmes at openjdk.java.net (David Holmes) Date: Thu, 16 Dec 2021 22:17:09 GMT Subject: RFR: 8274982: Add a test for 8269574. [v2] In-Reply-To: <0_iROrhelCBAS2sNgY6vkvDjCq0sh_5skuY41Zi_Lhg=.a1f21483-cdc1-40cd-98c3-c93adbdacbdb@github.com> References: <0_iROrhelCBAS2sNgY6vkvDjCq0sh_5skuY41Zi_Lhg=.a1f21483-cdc1-40cd-98c3-c93adbdacbdb@github.com> Message-ID: On Thu, 16 Dec 2021 20:20:41 GMT, Evgeny Nikitin wrote: >> This PR contains a relatively simple test which verifies that JVMTI-agents are correctly informed about exceptions caught in C2-compiled code. The 8269574 introduces pre-allocated exceptions in some paths, so the test tries to produce a number of various exceptions and check that provided small JVMTI agent got notified about all of them. > > Evgeny Nikitin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Get rid of while-breaks > - Add JIT requirements > - Remove explicit type specifiers for own class static calls > - Remove redundant build directive > - Merge branch 'master' into JDK-8274982/public > - 8274982: Add a test for 8269574. A few more comments below. Thanks, David test/hotspot/jtreg/compiler/jvmti/libTriggerBuiltinExceptions.cpp line 37: > 35: jmethodID catch_method, jlocation catch_location) { > 36: exceptions_caught += 1; > 37: } IIUC this will count all exceptions occurring in any thread and in relation to any method. There are behind-the-scenes exceptions that can occur which may cause this to count exceptions not related to the test. I think you need to only count those thrown in the method of interest, for reliability. ------------- PR: https://git.openjdk.java.net/jdk/pull/5889 From dholmes at openjdk.java.net Thu Dec 16 22:17:09 2021 From: dholmes at openjdk.java.net (David Holmes) Date: Thu, 16 Dec 2021 22:17:09 GMT Subject: RFR: 8274982: Add a test for 8269574. [v2] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 20:51:17 GMT, Leonid Mesnik wrote: >> Added a requirement for the c1 or c2. > > Really, we don't add @requires for jit compiler. There are a lot of tests that fail with Xnt. Tests outside of the compiler area which explicitly use features like ` WB.enqueueMethodForCompilation` which explicitly will fail if there is no JIT either require the JIT or exclude running with Zero. See for example: ./runtime/Nestmates/protectionDomain/TestDifferentProtectionDomains.java ./runtime/Unsafe/InternalErrorTest.java ./runtime/exceptionMsgs/AbstractMethodError/AbstractMethodErrorTest.java EDIT: except of course this test is in the compiler area . Okay perhaps overkill - sorry for the noise. ------------- PR: https://git.openjdk.java.net/jdk/pull/5889 From iveresov at openjdk.java.net Thu Dec 16 23:03:56 2021 From: iveresov at openjdk.java.net (Igor Veresov) Date: Thu, 16 Dec 2021 23:03:56 GMT Subject: [jdk18] RFR: 8277447: Hotspot C1 compiler crashes on Kotlin suspend fun with loop Message-ID: There are a bunch of problems with `BlockListBuilder::mark_loops()` and how it handles irreducible loops. It doesn't really seem to be explicitly designed to handle those, however, it does handle most. One shape emitted by the Kotlin compiler in this particular case gives it trouble. The proper fix is to rewrite loop detection, detect irreducible loops, and switch off `SelectivePhiFunctions` is any are present. But given that we're close to the release, I'd like to add a bailout during phi insertion, and file an RFE to do the proper fix later. I wrote a minimal test to demonstrate the issue. Testing with hs-tier{1-7} is squeaky clean. ------------- Commit messages: - Remove tabs - Bailout during phi placement if odd control flow is present Changes: https://git.openjdk.java.net/jdk18/pull/40/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=40&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8277447 Stats: 121 lines in 3 files changed: 121 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk18/pull/40.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/40/head:pull/40 PR: https://git.openjdk.java.net/jdk18/pull/40 From kvn at openjdk.java.net Thu Dec 16 23:25:57 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 16 Dec 2021 23:25:57 GMT Subject: [jdk18] RFR: 8277447: Hotspot C1 compiler crashes on Kotlin suspend fun with loop In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 22:47:23 GMT, Igor Veresov wrote: > There are a bunch of problems with `BlockListBuilder::mark_loops()` and how it handles irreducible loops. It doesn't really seem to be explicitly designed to handle those, however, it does handle most. One shape emitted by the Kotlin compiler in this particular case gives it trouble. > > The proper fix is to rewrite loop detection, detect irreducible loops, and switch off `SelectivePhiFunctions` is any are present. But given that we're close to the release, I'd like to add a bailout during phi insertion, and file an RFE to do the proper fix later. > > I wrote a minimal test to demonstrate the issue. > > Testing with hs-tier{1-7} is squeaky clean. Okay. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/40 From kvn at openjdk.java.net Thu Dec 16 23:26:12 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 16 Dec 2021 23:26:12 GMT Subject: [jdk18] RFR: 8278796: Incorrect behavior of FloatVector.withLane on X86 [v2] In-Reply-To: References: Message-ID: <0-Q8W7fFWJgwCl8t-IcVbXZavkdO_LZesKStN1-TrLU=.a75da3d4-de27-4fc8-8872-6530143a4bd5@github.com> On Thu, 16 Dec 2021 13:08:29 GMT, Jatin Bhateja wrote: >> - Incorrect operand is being passed to insertps instruction which causes incorrectness issues in FloatVector.withLane operation. >> - Existing JTREG test cases have been modified appropriately with a non-zero insertion index. >> >> Kindly review and share your comments. >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8278796: Extending the test cases to use dynamic insert index and value. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/28 From dlong at openjdk.java.net Thu Dec 16 23:58:25 2021 From: dlong at openjdk.java.net (Dean Long) Date: Thu, 16 Dec 2021 23:58:25 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 06:16:23 GMT, Smita Kamath wrote: > The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. Vladimir, the assert sounds like JDK-??6868269. Is there an earlier null check that also needs the reexecute flag? ------------- PR: https://git.openjdk.java.net/jdk18/pull/19 From haosun at openjdk.java.net Fri Dec 17 00:44:54 2021 From: haosun at openjdk.java.net (Hao Sun) Date: Fri, 17 Dec 2021 00:44:54 GMT Subject: [jdk18] RFR: 8278267: ARM32: several vector test failures for ASHR Message-ID: In ARM32, "VSHL (register)" instruction [1] is shared by vector left shift and vector right shift, and the condition to distinguish them is whether the shift count value is positve or negative. Hence, negation operation is needed before conducting vector right shift. For vector right shift, the shift count can be a RShiftCntV or a normal vector node. Take test case Byte64VectorTests.java [2][3] as an example. Note that RShiftCntV is already negated via rules "vsrcntD" and "vsrcntX" whereas the normal vector node is NOT, since we don't know whether a normal vector node is used as a vector shift count or not. This is the root cause for these vector test failures. The fix is simple, moving the negation from "vsrcntD|X" to the corresponding vector right shift rules. Affected rules are vsrlBB_reg and vsraBB_reg. Note that vector shift related rules are in form of "vsAABB_CC", where 1) AA can be l (left shift), rl (logical right shift) and ra (arithmetic right shift). 2) BB can be 8B/16B (byte type), 4S/8S (short type), 2I/4I (int type) and 2L (long type). 3) CC can be reg (register case) and immI (immediate case). Minor updates: 1) Merge "vslcntD" and "vsrcntD" into rule "vscntD", as these two rules conduct the same duplication operation now. 2) Update the "match" primitive for vsraBB_immI rules. 3) Style issue: remove the surrounding space for "ins_pipe" primitive. Tests: We ran tier 1~3 tests on ARM32 platform. With this patch, previously failed vector test cases can pass now without introducing test regression. [1] https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Instruction-Details/Alphabetical-list-of-instructions/VSHL--register-?lang=en [2] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2237 [3] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2425 ------------- Commit messages: - 8278267: ARM32: several vector test failures for ASHR Changes: https://git.openjdk.java.net/jdk18/pull/41/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=41&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278267 Stats: 490 lines in 1 file changed: 120 ins; 179 del; 191 mod Patch: https://git.openjdk.java.net/jdk18/pull/41.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/41/head:pull/41 PR: https://git.openjdk.java.net/jdk18/pull/41 From svkamath at openjdk.java.net Fri Dec 17 01:14:26 2021 From: svkamath at openjdk.java.net (Smita Kamath) Date: Fri, 17 Dec 2021 01:14:26 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 07:24:11 GMT, Vladimir Kozlov wrote: >> The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. > > Unfortunately I hit assert during CTW testing on Windows-x64 when compiling `com.sun.crypto.provider.GaloisCounterMode::implGCMCrypt`. > > > # Internal Error (t:\workspace\open\src\hotspot\share\opto\graphKit.cpp:250), pid=41320, tid=43308 > # assert(ex_map->jvms()->same_calls_as(_exceptions->jvms())) failed: all collected exceptions must come from the same place > > Current CompileTask: > C2: 7834 2950 b 4 com.sun.crypto.provider.GaloisCounterMode::implGCMCrypt (100 bytes) > > Stack: [0x000000c86a600000,0x000000c86a700000] > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) > V [jvm.dll+0xbb90d1] os::platform_print_native_stack+0xf1 (os_windows_x86.cpp:235) > V [jvm.dll+0xdf904e] VMError::report+0x101e (vmError.cpp:828) > V [jvm.dll+0xdfaa4e] VMError::report_and_die+0x7fe (vmError.cpp:1656) > V [jvm.dll+0xdfb1d4] VMError::report_and_die+0x64 (vmError.cpp:1437) > V [jvm.dll+0x536f47] report_vm_error+0xb7 (debug.cpp:280) > V [jvm.dll+0x6da439] GraphKit::add_exception_states_from+0x119 (graphKit.cpp:286) > V [jvm.dll+0x414375] PredicatedIntrinsicGenerator::generate+0x7f5 (callGenerator.cpp:1358) > V [jvm.dll+0x5c3b62] Parse::do_call+0x9c2 (doCall.cpp:651) > V [jvm.dll+0xbe3745] Parse::do_one_bytecode+0x32b5 (parse2.cpp:2704) > V [jvm.dll+0xbd5ae7] Parse::do_one_block+0x437 (parse1.cpp:1557) > V [jvm.dll+0xbd463c] Parse::do_all_blocks+0x5cc (parse1.cpp:710) > V [jvm.dll+0xbd0c3d] Parse::Parse+0xc1d (parse1.cpp:616) > V [jvm.dll+0x413a15] ParseGenerator::generate+0xa5 (callGenerator.cpp:103) > V [jvm.dll+0x4eba90] Compile::Compile+0x1110 (compile.cpp:714) > > > The call stack has `PredicatedIntrinsicGenerator` so it seems related to changes. > > I got replay file and will try to reproduce it tomorrow. @vnkozlov Since, array allocation on heap is causing issues, would it be better if I update the code to have the temp array allocation on stack? The updated code will eliminate the need for re-execution. I have the change ready and can push it if you think it is alright to do so. Please do let me know your thoughts. Thanks. ------------- PR: https://git.openjdk.java.net/jdk18/pull/19 From jbhateja at openjdk.java.net Fri Dec 17 03:10:28 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 17 Dec 2021 03:10:28 GMT Subject: [jdk18] Integrated: 8278796: Incorrect behavior of FloatVector.withLane on X86 In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 10:11:20 GMT, Jatin Bhateja wrote: > - Incorrect operand is being passed to insertps instruction which causes incorrectness issues in FloatVector.withLane operation. > - Existing JTREG test cases have been modified appropriately with a non-zero insertion index. > > Kindly review and share your comments. > Best Regards, > Jatin This pull request has now been integrated. Changeset: 8494fec6 Author: Jatin Bhateja URL: https://git.openjdk.java.net/jdk18/commit/8494fec665bfa51d1702827bd0aa4f4547e67729 Stats: 283 lines in 34 files changed: 94 ins; 0 del; 189 mod 8278796: Incorrect behavior of FloatVector.withLane on X86 Reviewed-by: sviswanathan, kvn ------------- PR: https://git.openjdk.java.net/jdk18/pull/28 From kvn at openjdk.java.net Fri Dec 17 03:17:23 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 17 Dec 2021 03:17:23 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 23:54:50 GMT, Dean Long wrote: >> The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. > > Vladimir, the assert sounds like JDK-??6868269. Is there an earlier null check that also needs the reexecute flag? @dean-long, there is null check in predicate code: https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/callGenerator.cpp#L1268 This is first time we use `PreserveReexecuteState` for an intrinsic with predicate. But I am actually not sure if the failure I see is caused by this changes or existing bug. But it causes concern. @smita-kamath yes, please prepare new changes with stack allocation and I will test it. ------------- PR: https://git.openjdk.java.net/jdk18/pull/19 From kvn at openjdk.java.net Fri Dec 17 03:17:23 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 17 Dec 2021 03:17:23 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 06:16:23 GMT, Smita Kamath wrote: > The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. Note, we can't change code in `PredicatedIntrinsicGenerator` because it is used by all intrinsics with predicate and non do re-execution. ------------- PR: https://git.openjdk.java.net/jdk18/pull/19 From jbhateja at openjdk.java.net Fri Dec 17 03:21:34 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 17 Dec 2021 03:21:34 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. [v2] In-Reply-To: <7xWcxp4p6wK4ccQdTb95R-vVlZtWixADO9ouDhRu0fY=.8b30e77c-4cc6-4f86-a255-7f6aa24a8122@github.com> References: <7xWcxp4p6wK4ccQdTb95R-vVlZtWixADO9ouDhRu0fY=.8b30e77c-4cc6-4f86-a255-7f6aa24a8122@github.com> Message-ID: On Thu, 16 Dec 2021 17:46:35 GMT, Jatin Bhateja wrote: >> - Vector.maskAll was accelerated for AVX-512 target, but x86 existing backend implementation does not enable maskAll instruction patterns for 32 bit JVM, due to which operations fall backs over replicateB operation which broadcasts the mask value in a vector. >> - In some cases after unboxing-boxing optimization this vector eventually reaches to XorVMask which has different operands one held in opmask register and other in vector. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8278508: Review comments resolution. @svisva7 can you kindly be second reviewer. ------------- PR: https://git.openjdk.java.net/jdk18/pull/24 From sspitsyn at openjdk.java.net Fri Dec 17 03:26:32 2021 From: sspitsyn at openjdk.java.net (Serguei Spitsyn) Date: Fri, 17 Dec 2021 03:26:32 GMT Subject: RFR: 8274982: Add a test for 8269574. [v2] In-Reply-To: <0_iROrhelCBAS2sNgY6vkvDjCq0sh_5skuY41Zi_Lhg=.a1f21483-cdc1-40cd-98c3-c93adbdacbdb@github.com> References: <0_iROrhelCBAS2sNgY6vkvDjCq0sh_5skuY41Zi_Lhg=.a1f21483-cdc1-40cd-98c3-c93adbdacbdb@github.com> Message-ID: On Thu, 16 Dec 2021 20:20:41 GMT, Evgeny Nikitin wrote: >> This PR contains a relatively simple test which verifies that JVMTI-agents are correctly informed about exceptions caught in C2-compiled code. The 8269574 introduces pre-allocated exceptions in some paths, so the test tries to produce a number of various exceptions and check that provided small JVMTI agent got notified about all of them. > > Evgeny Nikitin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Get rid of while-breaks > - Add JIT requirements > - Remove explicit type specifiers for own class static calls > - Remove redundant build directive > - Merge branch 'master' into JDK-8274982/public > - 8274982: Add a test for 8269574. Marked as reviewed by sspitsyn (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/5889 From dlong at openjdk.java.net Fri Dec 17 04:41:24 2021 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 17 Dec 2021 04:41:24 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 06:16:23 GMT, Smita Kamath wrote: > The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. Based on past bugs, it looks like to use new_array here, we need to pass /*deoptimize_on_exception=*/true as the optional 5th argument, just like inline_native_clone does. ------------- PR: https://git.openjdk.java.net/jdk18/pull/19 From roland at openjdk.java.net Fri Dec 17 07:45:25 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 17 Dec 2021 07:45:25 GMT Subject: [jdk18] RFR: 8278790: Inner loop of long loop nest runs for too few iterations [v2] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 09:17:40 GMT, Christian Hagedorn wrote: >> Roland Westrelin has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. > > Looks good! @chhagedorn @neliasso thanks for the reviews. ------------- PR: https://git.openjdk.java.net/jdk18/pull/35 From roland at openjdk.java.net Fri Dec 17 07:48:27 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 17 Dec 2021 07:48:27 GMT Subject: [jdk18] Integrated: 8278790: Inner loop of long loop nest runs for too few iterations In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 08:44:59 GMT, Roland Westrelin wrote: > Given a counted loop that iterates in [A, Z), when long range checks > are transformed into int range checks, a loop nest is created and > the inner loop iterates in [0, Z2). > > The limits of the inner loop are adjusted to guarantee no overflow for > the range of values of the inner loop. That is for a range check: > > i * scale + offset > 1) the bounds of the inner loop are adjusted to roughly [0, > max_jint/scale). > > Also, we don't want to loose what we know about the bounds of the loop > being transformed. > > 2) So the bound of the inner loop are also adjusted to [0, min(Z2, Z - A)) > > The bug here is that 2) is performed before 1). This was spotted with > a micro benchmarks where the initial loop had only ~2000 > iterations. The transformed loop is expected to run for the same 2000 > iterations but instead ran for 2000/scale iterations. This pull request has now been integrated. Changeset: bb7efb35 Author: Roland Westrelin URL: https://git.openjdk.java.net/jdk18/commit/bb7efb3517b0ac66a55607c14aae3aef1f11c892 Stats: 9 lines in 1 file changed: 5 ins; 4 del; 0 mod 8278790: Inner loop of long loop nest runs for too few iterations Reviewed-by: chagedorn, neliasso ------------- PR: https://git.openjdk.java.net/jdk18/pull/35 From roland at openjdk.java.net Fri Dec 17 07:50:40 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 17 Dec 2021 07:50:40 GMT Subject: [jdk18] RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 21:16:43 GMT, Dean Long wrote: >> The bug and fix were discussed in a previous PR: >> >> https://github.com/openjdk/jdk/pull/6572 >> >> I pushed all commits from that PR on top of jdk 18 and added a couple >> extra tests as suggested in: >> >> https://github.com/openjdk/jdk/pull/6572#issuecomment-994086590 > > Test results look good. @dean-long @vnkozlov thanks for the reviews. ------------- PR: https://git.openjdk.java.net/jdk18/pull/29 From roland at openjdk.java.net Fri Dec 17 07:50:42 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 17 Dec 2021 07:50:42 GMT Subject: [jdk18] Integrated: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 10:24:59 GMT, Roland Westrelin wrote: > The bug and fix were discussed in a previous PR: > > https://github.com/openjdk/jdk/pull/6572 > > I pushed all commits from that PR on top of jdk 18 and added a couple > extra tests as suggested in: > > https://github.com/openjdk/jdk/pull/6572#issuecomment-994086590 This pull request has now been integrated. Changeset: b9a477bf Author: Roland Westrelin URL: https://git.openjdk.java.net/jdk18/commit/b9a477bf19d9f276f6b1da8984eb56d7bd5fc137 Stats: 117 lines in 3 files changed: 117 ins; 0 del; 0 mod 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert Reviewed-by: dlong, kvn ------------- PR: https://git.openjdk.java.net/jdk18/pull/29 From neliasso at openjdk.java.net Fri Dec 17 07:58:35 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Fri, 17 Dec 2021 07:58:35 GMT Subject: [jdk18] RFR: 8277447: Hotspot C1 compiler crashes on Kotlin suspend fun with loop In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 22:47:23 GMT, Igor Veresov wrote: > There are a bunch of problems with `BlockListBuilder::mark_loops()` and how it handles irreducible loops. It doesn't really seem to be explicitly designed to handle those, however, it does handle most. One shape emitted by the Kotlin compiler in this particular case gives it trouble. > > The proper fix is to rewrite loop detection, detect irreducible loops, and switch off `SelectivePhiFunctions` is any are present. But given that we're close to the release, I'd like to add a bailout during phi insertion, and file an RFE to do the proper fix later. > > I wrote a minimal test to demonstrate the issue. > > Testing with hs-tier{1-7} is squeaky clean. Looks good. ------------- Marked as reviewed by neliasso (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/40 From roland at openjdk.java.net Fri Dec 17 08:11:47 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 17 Dec 2021 08:11:47 GMT Subject: RFR: 8278949: Cleanups for 8277850 Message-ID: When 8277850 (C2: optimize mask checks in counted loops) was reviewed, John made a number of comments and suggestions after the change was integrated. This change includes all of his comments, extra tests to cover all cases. I also moved the AndIL_add_shift_and_mask() call in AndXNode::Ideal() up so the expression with a non constant mask can be optimized as well. ------------- Commit messages: - whitespaces - John's comments Changes: https://git.openjdk.java.net/jdk/pull/6876/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6876&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278949 Stats: 333 lines in 3 files changed: 291 ins; 16 del; 26 mod Patch: https://git.openjdk.java.net/jdk/pull/6876.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6876/head:pull/6876 PR: https://git.openjdk.java.net/jdk/pull/6876 From sergei.tsypanov at yandex.ru Fri Dec 17 08:27:25 2021 From: sergei.tsypanov at yandex.ru (=?utf-8?B?0KHQtdGA0LPQtdC5INCm0YvQv9Cw0L3QvtCy?=) Date: Fri, 17 Dec 2021 10:27:25 +0200 Subject: C2 produces redundant (?) assembly for while-loop in certain cases (JDK-8278518) Message-ID: <22646071639729645@iva3-49b9eb45691c.qloud-c.yandex.net> Hello, a week ago I filed the issue called JDK-8278518 String(byte[], int, int, Charset) constructor and String.translateEscapes() miss bounds check elimination for the case originally discovered by Amir Hadadi in https://stackoverflow.com/questions/70272651/missing-bounds-checking-elimination-in-string-constructor Consider the part of the code of String(byte[], int, int, Charset): ----------------------------------------------------------------------------------- while (offset < sl) { int b1 = bytes[offset]; if (b1 >= 0) { dst[dp++] = (byte)b1; offset++; // <--- continue; } if ((b1 == (byte)0xc2 || b1 == (byte)0xc3) && offset + 1 < sl) { int b2 = bytes[offset + 1]; if (!isNotContinuation(b2)) { dst[dp++] = (byte)decode2(b1, b2); offset += 2; continue; } } // anything not a latin1, including the repl // we have to go with the utf16 break; } ----------------------------------------------------------------------------------- Originally we were sure, that compiler doesn't eliminate bounds check for accessing byte[] regardless of predictable value of 'offset', see a part of LinuxPerfAsmProfiler output: ----------------------------------------------------------------------------------- 3.62% ?? ? 0x00007fed70eb4c1c: mov %ebx,%ecx 2.29% ?? ? 0x00007fed70eb4c1e: mov %edx,%r9d 2.22% ?? ? 0x00007fed70eb4c21: mov (%rsp),%r8 ;*iload_2 {reexecute=0 rethrow=0 return_oop=0} ?? ? ; - java.lang.String::<init>@107 (line 537) 2.32% ?? ? 0x00007fed70eb4c25: cmp %r13d,%ecx ? ? 0x00007fed70eb4c28: jge 0x00007fed70eb5388 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} ? ? ; - java.lang.String::<init>@110 (line 537) 3.05% ? ? 0x00007fed70eb4c2e: cmp 0x8(%rsp),%ecx ? ? 0x00007fed70eb4c32: jae 0x00007fed70eb5319 2.38% ? ? 0x00007fed70eb4c38: mov %r8,(%rsp) 2.64% ? ? 0x00007fed70eb4c3c: movslq %ecx,%r8 2.46% ? ? 0x00007fed70eb4c3f: mov %rax,%rbx 3.44% ? ? 0x00007fed70eb4c42: sub %r8,%rbx 2.62% ? ? 0x00007fed70eb4c45: add $0x1,%rbx 2.64% ? ? 0x00007fed70eb4c49: and $0xfffffffffffffffe,%rbx 2.30% ? ? 0x00007fed70eb4c4d: mov %ebx,%r8d 3.08% ? ? 0x00007fed70eb4c50: add %ecx,%r8d 2.55% ? ? 0x00007fed70eb4c53: movslq %r8d,%r8 2.45% ? ? 0x00007fed70eb4c56: add $0xfffffffffffffffe,%r8 2.13% ? ? 0x00007fed70eb4c5a: cmp (%rsp),%r8 ? ? 0x00007fed70eb4c5e: jae 0x00007fed70eb5319 3.36% ? ? 0x00007fed70eb4c64: mov %ecx,%edi ;*aload_1 {reexecute=0 rethrow=0 return_oop=0} ? ? ; - java.lang.String::<init>@113 (line 538) 2.86% ? ?? 0x00007fed70eb4c66: movsbl 0x10(%r14,%rdi,1),%r8d ;*baload {reexecute=0 rethrow=0 return_oop=0} ? ?? ; - java.lang.String::<init>@115 (line 538) 2.48% ? ?? 0x00007fed70eb4c6c: mov %r9d,%edx 2.26% ? ?? 0x00007fed70eb4c6f: inc %edx ;*iinc {reexecute=0 rethrow=0 return_oop=0} ? ?? ; - java.lang.String::<init>@127 (line 540) 3.28% ? ?? 0x00007fed70eb4c71: mov %edi,%ebx 2.44% ? ?? 0x00007fed70eb4c73: inc %ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0} ? ?? ; - java.lang.String::<init>@134 (line 541) 2.35% ? ?? 0x00007fed70eb4c75: test %r8d,%r8d ? ?? 0x00007fed70eb4c78: jge 0x00007fed70eb4c04 ;*iflt {reexecute=0 rethrow=0 return_oop=0} ?? ; - java.lang.String::<init>@120 (line 539) ----------------------------------------------------------------------------------- If we change while-loop condition from while (offset < sl) to while (offset >= 0 && offset < sl) the part of the instructions between if_icmpge and aload_1 in baseline disappears: ----------------------------------------------------------------------------------- 17.28% ?? 0x00007f6b88eb6061: mov %edx,%r10d ;*iload_2 {reexecute=0 rethrow=0 return_oop=0} ?? ; - java.lang.String::<init>@107 (line 537) 0.11% ?? 0x00007f6b88eb6064: test %r10d,%r10d ? 0x00007f6b88eb6067: jl 0x00007f6b88eb669c ;*iflt {reexecute=0 rethrow=0 return_oop=0} ? ; - java.lang.String::<init>@108 (line 537) 0.39% ? 0x00007f6b88eb606d: cmp %r13d,%r10d ? 0x00007f6b88eb6070: jge 0x00007f6b88eb66d0 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} ? ; - java.lang.String::<init>@114 (line 537) 0.66% ? 0x00007f6b88eb6076: mov %ebx,%r9d 13.70% ? 0x00007f6b88eb6079: cmp 0x8(%rsp),%r10d 0.01% ? 0x00007f6b88eb607e: jae 0x00007f6b88eb6671 0.14% ? 0x00007f6b88eb6084: movsbl 0x10(%r14,%r10,1),%edi ;*baload {reexecute=0 rethrow=0 return_oop=0} ? ; - java.lang.String::<init>@119 (line 538) 0.37% ? 0x00007f6b88eb608a: mov %r9d,%ebx 0.99% ? 0x00007f6b88eb608d: inc %ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0} ? ; - java.lang.String::<init>@131 (line 540) 12.88% ? 0x00007f6b88eb608f: movslq %r9d,%rsi ;*bastore {reexecute=0 rethrow=0 return_oop=0} ? ; - java.lang.String::<init>@196 (line 548) 0.17% ? 0x00007f6b88eb6092: mov %r10d,%edx 0.39% ? 0x00007f6b88eb6095: inc %edx ;*iinc {reexecute=0 rethrow=0 return_oop=0} ? ; - java.lang.String::<init>@138 (line 541) 0.96% ? 0x00007f6b88eb6097: test %edi,%edi 0.02% ? 0x00007f6b88eb6099: jl 0x00007f6b88eb60dc ;*iflt {reexecute=0 rethrow=0 return_oop=0} ----------------------------------------------------------------------------------- While this can be fixed on Java side (https://github.com/openjdk/jdk/pull/6812), the author of SO question pointed out (and I agree) that this is likely to be a HotSpot compiler issue and should be fixed on JVM side. Moreover, in the comments to my PR it was mentioned, that my original speculation is wrong, and bounds check is still here: 13.70% ? 0x00007f6b88eb6079: cmp 0x8(%rsp),%r10d 0.01% ? 0x00007f6b88eb607e: jae 0x00007f6b88eb6671 so the strange assembly code is something redundant. For this I've got two questions: 1) Is my original theory about bounds check wrong and if it indeed is, then what is that disappearing assembly part about? 2) Should this be fixed on JVM or Java side? If we choose JVM then the issue needs to be reassigned to someone else, because I'm not able to elaborate and test the proper fix. Regards, Sergey Tsypanov From neliasso at openjdk.java.net Fri Dec 17 08:56:47 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Fri, 17 Dec 2021 08:56:47 GMT Subject: RFR: 8278909: Unproblemlist AdaptiveBlocking001 Message-ID: Hi, vmTestbase/jit/escape/AdaptiveBlocking/AdaptiveBlocking001/AdaptiveBlocking001.java was problemlisted in 17. This problem might have been fixed. I suggest unproblemlisting it now in early 19 time frame so that we might get enough data if this problem still occurs. Regards, Nils ------------- Commit messages: - Unproblemlist AdaptiveBlocking001 Changes: https://git.openjdk.java.net/jdk/pull/6865/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6865&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278909 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6865.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6865/head:pull/6865 PR: https://git.openjdk.java.net/jdk/pull/6865 From fgao at openjdk.java.net Fri Dec 17 09:08:24 2021 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 17 Dec 2021 09:08:24 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v4] In-Reply-To: <301NmpzddpO60UwM69o7a-R2lofFMk00IpYhYKbCQP8=.24d4defe-d043-4803-aa0a-d6967fb3301d@github.com> References: <5tL6bmguT-7Wcm_WpWXUhxIQhjtGp9UqZvgH1jD6FbU=.9cc0d6ce-d322-409b-8ce4-066a18997f43@github.com> <301NmpzddpO60UwM69o7a-R2lofFMk00IpYhYKbCQP8=.24d4defe-d043-4803-aa0a-d6967fb3301d@github.com> Message-ID: <-KGTSuoH4p3Pm-30uHPk2tS2JqijS-CMw-wHFUed0Aw=.9853d68e-010c-48cc-80b5-c722d1777a67@github.com> On Thu, 16 Dec 2021 10:31:00 GMT, Fei Gao wrote: >>> > The PR optimizes abs operations in the C2 middle end. Can I have your review please? >>> >>> So what's the performance data before and after this patch? Does it also benefit on x86? >>> >>> It would be better to provide a jmh micro benchmark. Thanks. >> >> Thanks, @DamonFool . Yes, it's supposed to benefit all archs. >> For example, here is the performance data on x86. >> >> Before the patch: >> Benchmark (seed) Mode Cnt Score Error Units >> MathBench.absConstantInt 0 thrpt 5 291960.380 ? 10724.572 ops/ms >> >> After the patch: >> Benchmark (seed) Mode Cnt Score Error Units >> MathBench.absConstantInt 0 thrpt 5 336271.533 ? 3778.210 ops/ms >> >> The jmh micro benchmark testcase has been added in the latest commit. > >> Hi @fg1417 , >> >> Thanks for your update. >> >> Now I see that you are trying to optimize the following three abs() patterns: >> >> 1. Math.abs(-38) >> 2. (char) Math.abs((char) c) >> 3. Math.abs(0 - x) >> >> But did you see these code patterns in real programs? I'm a bit worried that we just improve the complexity of C2 with (almost) no performance gain in the real world. Thanks. > > Thanks for your review, @DamonFool . I really understand your concern. > > In terms of complexity, the change only involves AbsNode and doesn?t modify any other code part. I don?t think it will make C2 more complex. > > As for performance gain in the real world, as we all know, the ability of GVN to optimize a node often depends on the optimized result of its input nodes. For example, if the input node of one AbsNode is recognized as a constant after last round of GVN optimization, now, we can optimize abs(constant) to a simple constant value. Like C2 did in https://github.com/openjdk/jdk/blob/0bddd8af61b6c731f16b857c09de57ceefd72d06/src/hotspot/share/opto/subnode.cpp#L55, we may not see -(-x) or (x+y)-y in any java program directly, but it?s possible after C2 optimization. Whether the optimization to sub or to abs is trivial, low-cost but useful. Why not apply it :) > > Math.abs(-38) , (char) Math.abs((char) c) and Math.abs(0 - x) are just conformance testcases. As you said, maybe nobody writes these cases in the real world. These testcases are just simulating all possible scenarios that AbsNode may meet, to guarantee the correctness of the optimization. > > What do you think :) > > Thanks. > > > Hi @fg1417 , > > > Thanks for your update. > > > Now I see that you are trying to optimize the following three abs() patterns: > > > > > > 1. Math.abs(-38) > > > 2. (char) Math.abs((char) c) > > > 3. Math.abs(0 - x) > > > > > > But did you see these code patterns in real programs? I'm a bit worried that we just improve the complexity of C2 with (almost) no performance gain in the real world. Thanks. > > > > > > Thanks for your review, @DamonFool . I really understand your concern. > > In terms of complexity, the change only involves AbsNode and doesn?t modify any other code part. I don?t think it will make C2 more complex. > > As for performance gain in the real world, as we all know, the ability of GVN to optimize a node often depends on the optimized result of its input nodes. For example, if the input node of one AbsNode is recognized as a constant after last round of GVN optimization, now, we can optimize abs(constant) to a simple constant value. Like C2 did in > > https://github.com/openjdk/jdk/blob/0bddd8af61b6c731f16b857c09de57ceefd72d06/src/hotspot/share/opto/subnode.cpp#L55 > > > > , we may not see -(-x) or (x+y)-y in any java program directly, but it?s possible after C2 optimization. Whether the optimization to sub or to abs is trivial, low-cost but useful. Why not apply it :) > > Math.abs(-38) , (char) Math.abs((char) c) and Math.abs(0 - x) are just conformance testcases. As you said, maybe nobody writes these cases in the real world. These testcases are just simulating all possible scenarios that AbsNode may meet, to guarantee the correctness of the optimization. > > What do you think :) > > Thanks. > > Then, shall we also opt cases like `Math.abs(-1 * x)`, `Math.abs(x / (-1))`, and so on? Thanks. Hi, @DamonFool . Actually, the cases you listed above, `Math.abs(-1 * x)`, `Math.abs(x / (-1))`, are covered by the optimized pattern, `Math.abs(0-x)`. In C2, `-1 * x` is going to be `0 ? x` after GVN optimization in MulNode. C2 takes `-1 * x` as `0 ? (1*x)` firstly, and Identity here, https://github.com/openjdk/jdk/blob/0bddd8af61b6c731f16b857c09de57ceefd72d06/src/hotspot/share/opto/mulnode.cpp#L50, will combine `1*x` to x. After that, `0 ? x` matched the pattern that AbsNode can recognize. `Math.abs(x / (-1))` , too. As I mentioned before, the AbsNode optimization doesn?t work as a standalone pass and it?s combined very closely to its input nodes. Instruction sequence changes as GVN repeats and our ideal pattern may occur. Your cases also help prove that these several patterns, like abs(0-x) and abs(positive_value), are very fundamental and common. That?s why we choose them. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6755 From chagedorn at openjdk.java.net Fri Dec 17 09:09:30 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 17 Dec 2021 09:09:30 GMT Subject: RFR: 8278909: Unproblemlist AdaptiveBlocking001 In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 15:43:41 GMT, Nils Eliasson wrote: > Hi, > > vmTestbase/jit/escape/AdaptiveBlocking/AdaptiveBlocking001/AdaptiveBlocking001.java was problemlisted in 17. This problem might have been fixed. I suggest unproblemlisting it now in early 19 time frame so that we might get enough data if this problem still occurs. > > Regards, > Nils That seems reasonable. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6865 From chagedorn at openjdk.java.net Fri Dec 17 09:23:26 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 17 Dec 2021 09:23:26 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive [v3] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 16:42:35 GMT, Ludvig Janiuk wrote: >> IR::verify iterates the whole object graph. This proves costly when used in e.g. BlockMerger inside of iterations over BlockLists, leading to quadratic or worse complexities as a function of bytecode length. In several cases, only a few Blocks were changed, and there was no need to go over the whole graph, but until now there was no less blunt tool for verification than IR::verify. >> >> This PR introduces IR::verify_local, intended to be used when only a defined set of blocks have been modified. As a complement, expand_with_neighbors provides a way to also capture the neighbors of the "modified set" ahead of modification, so that afterwards the appropriate asserts can be made on all blocks which might possibly have been changed. All this should let us remove the expensive IR::verify calls, while still performing equivalent (or stricter) assertions. >> >> Some changes have been made in the verifiers along the way. Some amount of refactoring, and even added invariants (see validate_edge_mutiality). > > Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: > > cleanup2 Thanks for doing the updates and pulling the numbers together! > And again, it's important to remember that the point of this fix is to protect against "worst cases" with e.g. functions with very long bytecode. The compile time improvement in the general case, if any, is only a bonus. Yes, exactly. I think as long as are not noticeably worse on average (which we seem not to be according to your evaluation), we're good to go since this only affects debug builds. Maybe someone else can also comment on the `ASSERT` vs. `NOT_PRODUCT` usage. I think we should either convert everything to one or the other but not mix it since all the verification code belongs together. Thinking about it again, `ASSERT` might be preferable to not affect the optimized build (as `IR::verify()` was guarded by `ASSERT` before). src/hotspot/share/c1/c1_IR.cpp line 1401: > 1399: > 1400: for (int i = 0; i < block->end()->number_of_sux(); i++) { > 1401: if (blocks.contains(block->end()->sux_at(i))) { continue; } I think it's better to move the `continue` to a new line. src/hotspot/share/c1/c1_Optimizer.cpp line 441: > 439: > 440: #ifndef PRODUCT > 441: _hir->verify_local(blocks_to_verify_later); You can also use `NOT_PRODUCT()` here. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6850 From jiefu at openjdk.java.net Fri Dec 17 09:42:26 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 17 Dec 2021 09:42:26 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v4] In-Reply-To: <-KGTSuoH4p3Pm-30uHPk2tS2JqijS-CMw-wHFUed0Aw=.9853d68e-010c-48cc-80b5-c722d1777a67@github.com> References: <5tL6bmguT-7Wcm_WpWXUhxIQhjtGp9UqZvgH1jD6FbU=.9cc0d6ce-d322-409b-8ce4-066a18997f43@github.com> <301NmpzddpO60UwM69o7a-R2lofFMk00IpYhYKbCQP8=.24d4defe-d043-4803-aa0a-d6967fb3301d@github.com> <-KGTSuoH4p3Pm-30uHPk2tS2JqijS-CMw-wHFUed0Aw=.9853d68e-010c-48cc-80b5-c722d1777a67@github.com> Message-ID: On Fri, 17 Dec 2021 09:04:39 GMT, Fei Gao wrote: > Your cases also help prove that these several patterns, like abs(0-x) and abs(positive_value), are very fundamental and common. I don't think so. In real programs, I will never write code like `Math.abs(0 - x)`, `Math.abs(-1 * x)` and `Math.abs(x / (-1))`. How about improving your micro-bechmark to show the performance gain? To be honest, benchmarking with `Math.abs(-3)` seems strange to me since I don't think people will write that code. So I would suggest writing a jmh test which may be used in real programs. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6755 From neliasso at openjdk.java.net Fri Dec 17 10:46:29 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Fri, 17 Dec 2021 10:46:29 GMT Subject: [jdk18] RFR: 8278413: C2 crash when allocating array of size too large In-Reply-To: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> References: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> Message-ID: On Wed, 15 Dec 2021 12:36:17 GMT, Roland Westrelin wrote: > On the fallthrough path from an AllocateArray, the length of the > allocated array is casted (with a CastII) to [0, max_size] with > max_size some number that depends on the array type and can be less > than max_jint. > > Allocating an array of a length that's not in [0, max_size] causes the > CastII to become top. The fallthrough path must be killed as well in > that case otherwise this can lead to a broken graph. Currently c2 has > logic to protect against an allocation of array of negative size in > AllocateArrayNode::Ideal(). That call replaces the fallthrough path > with an Halt node. But if the size is too big, then the fallthrough > path is left as is. > > This patch fixes that issues. It also reworks the length negative > case. I added a Bool/CmpU input to the AllocateArray that tests for a > valid length. If that input becomes false, CatchNode::Value() kills > the fallthrough path. That logic is similar to that for a virtual call > with a null receiver. I also removed AllocateArrayNode::Ideal() now > that CatchNode::Value() takes care of the same corner case. The code > in AllocateArrayNode::Ideal() was added by Vladimir and he told me he > tried extending CatchNode::Value() at the time but that caused test > failures. I had no issues in my testing so I assume doing it that way > is ok now. > > The new input to AllocateArray is moved to the CallStaticJava runtime > call for array allocation on macro expansion as a precedence edge. The > reason for that is that final graph reshape needs a way to tell > whether the missing path out of the allocation is legal or not. final > graph reshape then removes the then useless precedence edge. Passes testing tier1-3 ------------- PR: https://git.openjdk.java.net/jdk18/pull/30 From fgao at openjdk.java.net Fri Dec 17 11:29:56 2021 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 17 Dec 2021 11:29:56 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v6] In-Reply-To: References: Message-ID: > The patch aims to help optimize Math.abs() mainly from these three parts: > 1) Remove redundant instructions for abs with constant values > 2) Remove redundant instructions for abs with char type > 3) Convert some common abs operations to ideal forms > > 1. Remove redundant instructions for abs with constant values > > If we can decide the value of the input node for function Math.abs() > at compile-time, we can substitute the Abs node with the absolute > value of the constant and don't have to calculate it at runtime. > > For example, > int[] a > for (int i = 0; i < SIZE; i++) { > a[i] = Math.abs(-38); > } > > Before the patch, the generated code for the testcase above is: > ... > mov w10, #0xffffffda > cmp w10, wzr > cneg w17, w10, lt > dup v16.8h, w17 > ... > After the patch, the generated code for the testcase above is : > ... > movi v16.4s, #0x26 > ... > > 2. Remove redundant instructions for abs with char type > > In Java semantics, as the char type is always non-negative, we > could actually remove the absI node in the C2 middle end. > > As for vectorization part, in current SLP, the vectorization of > Math.abs() with char type is intentionally disabled after > JDK-8261022 because it generates incorrect result before. After > removing the AbsI node in the middle end, Math.abs(char) can be > vectorized naturally. > > For example, > > char[] a; > char[] b; > for (int i = 0; i < SIZE; i++) { > b[i] = (char) Math.abs(a[i]); > } > > Before the patch, the generated assembly code for the testcase > above is: > > B15: > add x13, x21, w20, sxtw #1 > ldrh w11, [x13, #16] > cmp w11, wzr > cneg w10, w11, lt > strh w10, [x13, #16] > ldrh w10, [x13, #18] > cmp w10, wzr > cneg w10, w10, lt > strh w10, [x13, #18] > ... > add w20, w20, #0x1 > cmp w20, w17 > b.lt B15 > > After the patch, the generated assembly code is: > B15: > sbfiz x18, x19, #1, #32 > add x0, x14, x18 > ldr q16, [x0, #16] > add x18, x21, x18 > str q16, [x18, #16] > ldr q16, [x0, #32] > str q16, [x18, #32] > ... > add w19, w19, #0x40 > cmp w19, w17 > b.lt B15 > > 3. Convert some common abs operations to ideal forms > > The patch overrides some virtual support functions for AbsNode > so that optimization of gvn can work on it. Here are the optimizable > forms: > > a) abs(0 - x) => abs(x) > > Before the patch: > ... > ldr w13, [x13, #16] > neg w13, w13 > cmp w13, wzr > cneg w14, w13, lt > ... > After the patch: > ... > ldr w13, [x13, #16] > cmp w13, wzr > cneg w13, w13, lt > ... > > b) abs(abs(x)) => abs(x) > > Before the patch: > ... > ldr w12, [x12, #16] > cmp w12, wzr > cneg w12, w12, lt > cmp w12, wzr > cneg w12, w12, lt > ... > After the patch: > ... > ldr w13, [x13, #16] > cmp w13, wzr > cneg w13, w13, lt > ... Fei Gao has updated the pull request incrementally with one additional commit since the last revision: Use uabs() to calculate the absolute value of constant Change-Id: Ie6f37ab159fb7092e1443b9af8d620562a45ae47 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6755/files - new: https://git.openjdk.java.net/jdk/pull/6755/files/2d122a2d..e254d9f7 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6755&range=05 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6755&range=04-05 Stats: 6 lines in 1 file changed: 0 ins; 4 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/6755.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6755/head:pull/6755 PR: https://git.openjdk.java.net/jdk/pull/6755 From fgao at openjdk.java.net Fri Dec 17 11:30:00 2021 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 17 Dec 2021 11:30:00 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v5] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 10:01:41 GMT, Andrew Haley wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: >> >> - Add a jmh benchmark case >> >> Change-Id: I64938d543126c2e3f9fad8ffc4a50e25e4473d8f >> - Merge branch 'master' of github.com:fg1417/jdk into fg8276673 >> >> Change-Id: I71987594e9288a489a04de696e69a62f4ad19357 >> - Merge branch 'master' into fg8276673 >> >> Change-Id: I5e3898054b75f49653b8c3b37e4f5007675fa963 >> - 8276673: Optimize abs operations in C2 compiler >> >> The patch aims to help optimize Math.abs() mainly from these three parts: >> 1) Remove redundant instructions for abs with constant values >> 2) Remove redundant instructions for abs with char type >> 3) Convert some common abs operations to ideal forms >> >> 1. Remove redundant instructions for abs with constant values >> >> If we can decide the value of the input node for function Math.abs() >> at compile-time, we can substitute the Abs node with the absolute >> value of the constant and don't have to calculate it at runtime. >> >> For example, >> int[] a >> for (int i = 0; i < SIZE; i++) { >> a[i] = Math.abs(-38); >> } >> >> Before the patch, the generated code for the testcase above is: >> ... >> mov w10, #0xffffffda >> cmp w10, wzr >> cneg w17, w10, lt >> dup v16.8h, w17 >> ... >> After the patch, the generated code for the testcase above is : >> ... >> movi v16.4s, #0x26 >> ... >> >> 2. Remove redundant instructions for abs with char type >> >> In Java semantics, as the char type is always non-negative, we >> could actually remove the absI node in the C2 middle end. >> >> As for vectorization part, in current SLP, the vectorization of >> Math.abs() with char type is intentionally disabled after >> JDK-8261022 because it generates incorrect result before. After >> removing the AbsI node in the middle end, Math.abs(char) can be >> vectorized naturally. >> >> For example, >> >> char[] a; >> char[] b; >> for (int i = 0; i < SIZE; i++) { >> b[i] = (char) Math.abs(a[i]); >> } >> >> Before the patch, the generated assembly code for the testcase >> above is: >> >> B15: >> add x13, x21, w20, sxtw #1 >> ldrh w11, [x13, #16] >> cmp w11, wzr >> cneg w10, w11, lt >> strh w10, [x13, #16] >> ldrh w10, [x13, #18] >> cmp w10, wzr >> cneg w10, w10, lt >> strh w10, [x13, #18] >> ... >> add w20, w20, #0x1 >> cmp w20, w17 >> b.lt B15 >> >> After the patch, the generated assembly code is: >> B15: >> sbfiz x18, x19, #1, #32 >> add x0, x14, x18 >> ldr q16, [x0, #16] >> add x18, x21, x18 >> str q16, [x18, #16] >> ldr q16, [x0, #32] >> str q16, [x18, #32] >> ... >> add w19, w19, #0x40 >> cmp w19, w17 >> b.lt B15 >> >> 3. Convert some common abs operations to ideal forms >> >> The patch overrides some virtual support functions for AbsNode >> so that optimization of gvn can work on it. Here are the optimizable >> forms: >> >> a) abs(0 - x) => abs(x) >> >> Before the patch: >> ... >> ldr w13, [x13, #16] >> neg w13, w13 >> cmp w13, wzr >> cneg w14, w13, lt >> ... >> After the patch: >> ... >> ldr w13, [x13, #16] >> cmp w13, wzr >> cneg w13, w13, lt >> ... >> >> b) abs(abs(x)) => abs(x) >> >> Before the patch: >> ... >> ldr w12, [x12, #16] >> cmp w12, wzr >> cneg w12, w12, lt >> cmp w12, wzr >> cneg w12, w12, lt >> ... >> After the patch: >> ... >> ldr w13, [x13, #16] >> cmp w13, wzr >> cneg w13, w13, lt >> ... >> >> Change-Id: I5434c01a225796caaf07ffbb19983f4fe2e206bd > > src/hotspot/share/opto/subnode.cpp line 1854: > >> 1852: // Special case for min_jint: Math.abs(min_jint) = min_jint. >> 1853: // Do not use C++ abs() for min_jint to avoid undefined behavior. >> 1854: return (ti->is_con(min_jint)) ? TypeInt::MIN : TypeInt::make(abs(ti->get_con())); > > Suggestion: > > return TypeInt::make(uabs(ti->get_con()); > > We have uabs() for julong and unsigned int. Thanks for your review. Fixed. ------------- PR: https://git.openjdk.java.net/jdk/pull/6755 From duke at openjdk.java.net Fri Dec 17 12:29:20 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Fri, 17 Dec 2021 12:29:20 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive [v3] In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 16:42:35 GMT, Ludvig Janiuk wrote: >> IR::verify iterates the whole object graph. This proves costly when used in e.g. BlockMerger inside of iterations over BlockLists, leading to quadratic or worse complexities as a function of bytecode length. In several cases, only a few Blocks were changed, and there was no need to go over the whole graph, but until now there was no less blunt tool for verification than IR::verify. >> >> This PR introduces IR::verify_local, intended to be used when only a defined set of blocks have been modified. As a complement, expand_with_neighbors provides a way to also capture the neighbors of the "modified set" ahead of modification, so that afterwards the appropriate asserts can be made on all blocks which might possibly have been changed. All this should let us remove the expensive IR::verify calls, while still performing equivalent (or stricter) assertions. >> >> Some changes have been made in the verifiers along the way. Some amount of refactoring, and even added invariants (see validate_edge_mutiality). > > Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: > > cleanup2 I'll wait for the final word on ASSERT vs PRODUCT ------------- PR: https://git.openjdk.java.net/jdk/pull/6850 From enikitin at openjdk.java.net Fri Dec 17 13:07:19 2021 From: enikitin at openjdk.java.net (Evgeny Nikitin) Date: Fri, 17 Dec 2021 13:07:19 GMT Subject: RFR: 8274982: Add a test for 8269574. [v2] In-Reply-To: References: Message-ID: <6ChWczwTF_sDpFJ3oGDKt_nMNuX2d638hG0kMimhkWM=.8626f9e6-34aa-4518-8358-8557c0e83798@github.com> On Wed, 10 Nov 2021 20:49:54 GMT, Vladimir Kozlov wrote: > What testing was done? What testing tiers were run? I've run with intentionally broken prod. code (the 8269574 unfixed) - the test catches the problem. Additionally, I've run the test and tiers 1-2-3. ------------- PR: https://git.openjdk.java.net/jdk/pull/5889 From roland at openjdk.java.net Fri Dec 17 14:01:38 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 17 Dec 2021 14:01:38 GMT Subject: RFR: 8278228: C2: Improve identical back-to-back if elimination Message-ID: C2 has had the ability to optimize: (1) if (some_condition) { // body 1 } else { // body 2 } if (some_condition) { // body 3 } else { // body 4 } into: (4) if (some_condition) { // body 1 // body 3 } else { // body 2 // body 4 } for a while. This is achieved by the intermediate step: (2) if (some_condition) { // body 1 some_condition2 = true; } else { // body 2 some_condition2 = false; } if (some_condition2) { // body 3 } else { // body 4 } which then allows the use of the exiting split if optimization. As a result, the graph is transformed to: (3) if (some_condition) { // body 1 some_condition2 = true; if (some_condition2) { body3: // a Region here // body3 } else { goto body4; } } else { // body 2 some_condition2 = false; if (some_condition2) { goto body3; } else { body4: // another Region here // body4; } } and finally to (4) above. Recently, 8275610 has shown that this can break if some_condition is a null check. If, say, body 3 has a control dependent CastPP, then when body 1 and body 3 are merged, the CastPP of body 3 doesn't become control dependent on the dominating if (because in step (2), the CastPP hides behind a Region). As a result, the CastPP loses its dependency on the null check. After discussing this with Christian, it seemed this was caused by the way this transformation relies on split if: having custom code that wouldn't create Regions at body3 and body4 that are then optimized out would solve the problem. Anyway, after looking at the split if code, trying to figure out how to tease it apart in smaller steps and reusing some of them to build a new transformation, it seemed too complicated. So instead, I propose reusing split if in a slightly different way: skip step (2) but perform split if anyway to obtain: if (some_condition) { // body 1 if (some_condition) { body3: // Region1 // CastPP is here, control dependent on Region1 // body3 } else { goto body4; } } else { // body 2 if (some_condition) { goto body3; } else { body4: // Region2 // body4; } } - A CastPP node would still be behind a Region. So next step, is to push control dependent nodes through Region1 and Region2: if (some_condition) { // body 1 if (some_condition) { // A CastPP here body3: // Region1 // body3 } else { goto body4; } } else { // body 2 if (some_condition) { // A CastPP here goto body3; } else { body4: // Region2 // body4; } } - And then call dominated_by() to optimize the dominated: if (some_condition) { in both branches of the dominating if (some_condition) {. That also causes the CastPP to become dependent on the dominating if. ------------- Commit messages: - more - more - more - more - exps Changes: https://git.openjdk.java.net/jdk/pull/6882/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6882&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278228 Stats: 157 lines in 8 files changed: 121 ins; 15 del; 21 mod Patch: https://git.openjdk.java.net/jdk/pull/6882.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6882/head:pull/6882 PR: https://git.openjdk.java.net/jdk/pull/6882 From neliasso at openjdk.java.net Fri Dec 17 14:37:26 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Fri, 17 Dec 2021 14:37:26 GMT Subject: [jdk18] RFR: 8278413: C2 crash when allocating array of size too large In-Reply-To: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> References: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> Message-ID: On Wed, 15 Dec 2021 12:36:17 GMT, Roland Westrelin wrote: > On the fallthrough path from an AllocateArray, the length of the > allocated array is casted (with a CastII) to [0, max_size] with > max_size some number that depends on the array type and can be less > than max_jint. > > Allocating an array of a length that's not in [0, max_size] causes the > CastII to become top. The fallthrough path must be killed as well in > that case otherwise this can lead to a broken graph. Currently c2 has > logic to protect against an allocation of array of negative size in > AllocateArrayNode::Ideal(). That call replaces the fallthrough path > with an Halt node. But if the size is too big, then the fallthrough > path is left as is. > > This patch fixes that issues. It also reworks the length negative > case. I added a Bool/CmpU input to the AllocateArray that tests for a > valid length. If that input becomes false, CatchNode::Value() kills > the fallthrough path. That logic is similar to that for a virtual call > with a null receiver. I also removed AllocateArrayNode::Ideal() now > that CatchNode::Value() takes care of the same corner case. The code > in AllocateArrayNode::Ideal() was added by Vladimir and he told me he > tried extending CatchNode::Value() at the time but that caused test > failures. I had no issues in my testing so I assume doing it that way > is ok now. > > The new input to AllocateArray is moved to the CallStaticJava runtime > call for array allocation on macro expansion as a precedence edge. The > reason for that is that final graph reshape needs a way to tell > whether the missing path out of the allocation is legal or not. final > graph reshape then removes the then useless precedence edge. Looks good! ------------- Marked as reviewed by neliasso (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/30 From neliasso at openjdk.java.net Fri Dec 17 14:37:27 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Fri, 17 Dec 2021 14:37:27 GMT Subject: [jdk18] RFR: 8278413: C2 crash when allocating array of size too large In-Reply-To: References: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> <_8dk8ZC5p-mFf_Urb-qnNmxvXxbXiFoCU8TajzXESPA=.efb2f26c-d4a2-4db4-9082-5237e71e8529@github.com> Message-ID: On Thu, 16 Dec 2021 12:32:31 GMT, Roland Westrelin wrote: >> src/hotspot/share/opto/cfgnode.cpp line 2700: >> >>> 2698: Node* valid_length_test = call->in(AllocateNode::ValidLengthTest); >>> 2699: const Type* valid_length_test_t = phase->type(valid_length_test); >>> 2700: if (valid_length_test_t->isa_int() && valid_length_test_t->is_int()->is_con(0)) { >> >> Here you do: >> >> Node* valid_length_test = call->in(AllocateNode::ValidLengthTest); >> const Type* valid_length_test_t = phase->type(valid_length_test); >> if (valid_length_test_t->isa_int() && valid_length_test_t->is_int()->is_con(0)) { >> >> >> But in compile.cpp:3766 you do: >> >> Node* valid_length_test = call->in(call->req()); >> call->rm_prec(call->req()); >> if (valid_length_test->find_int_con(1) == 0) { >> >> >> Why "call->req()" and not "call->in(AllocateNode::ValidLengthTest)"? >> >> And why not "if (valid_length_test->find_int_con(1) == 0) {" in both places? > >> Here you do: >> >> ``` >> Node* valid_length_test = call->in(AllocateNode::ValidLengthTest); >> const Type* valid_length_test_t = phase->type(valid_length_test); >> if (valid_length_test_t->isa_int() && valid_length_test_t->is_int()->is_con(0)) { >> ``` >> >> But in compile.cpp:3766 you do: >> >> ``` >> Node* valid_length_test = call->in(call->req()); >> call->rm_prec(call->req()); >> if (valid_length_test->find_int_con(1) == 0) { >> ``` >> >> Why "call->req()" and not "call->in(AllocateNode::ValidLengthTest)"? > > call->in(AllocateNode::ValidLengthTest) only works for allocateArrayNode. The code I modified in compile.cpp runs after macro expansion so there's no AllocateArrayNode anymore. Instead there's a call to the runtime. When the AllocateArrayNode is macro expanded I move in(AllocateNode::ValidLengthTest) to the new runtime call as a precedence edge that is at req(). I tried adding it as an extra parameter that would be removed in compile.cpp but that messed up debug infos and I hit some asserts. > >> > And why not "if (valid_length_test->find_int_con(1) == 0) {" in both places? > > So`that the Value() call does the right thing during CCP where a node can be registered as a constant but not transformed yet to a ConNode. That's make sense. Thanks for explaining! ------------- PR: https://git.openjdk.java.net/jdk18/pull/30 From eosterlund at openjdk.java.net Fri Dec 17 14:41:26 2021 From: eosterlund at openjdk.java.net (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Fri, 17 Dec 2021 14:41:26 GMT Subject: RFR: 8278909: Unproblemlist AdaptiveBlocking001 In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 15:43:41 GMT, Nils Eliasson wrote: > Hi, > > vmTestbase/jit/escape/AdaptiveBlocking/AdaptiveBlocking001/AdaptiveBlocking001.java was problemlisted in 17. This problem might have been fixed. I suggest unproblemlisting it now in early 19 time frame so that we might get enough data if this problem still occurs. > > Regards, > Nils Marked as reviewed by eosterlund (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6865 From stefank at openjdk.java.net Fri Dec 17 14:41:26 2021 From: stefank at openjdk.java.net (Stefan Karlsson) Date: Fri, 17 Dec 2021 14:41:26 GMT Subject: RFR: 8278909: Unproblemlist AdaptiveBlocking001 In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 15:43:41 GMT, Nils Eliasson wrote: > Hi, > > vmTestbase/jit/escape/AdaptiveBlocking/AdaptiveBlocking001/AdaptiveBlocking001.java was problemlisted in 17. This problem might have been fixed. I suggest unproblemlisting it now in early 19 time frame so that we might get enough data if this problem still occurs. > > Regards, > Nils Marked as reviewed by stefank (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/6865 From neliasso at openjdk.java.net Fri Dec 17 15:12:25 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Fri, 17 Dec 2021 15:12:25 GMT Subject: Integrated: 8278909: Unproblemlist AdaptiveBlocking001 In-Reply-To: References: Message-ID: <54NlBdZ8sySDBmNjFV_kWtwiYdH5EMgp_ovFxnB4qAU=.4d447414-72ea-4410-81e3-a5914a04613a@github.com> On Thu, 16 Dec 2021 15:43:41 GMT, Nils Eliasson wrote: > Hi, > > vmTestbase/jit/escape/AdaptiveBlocking/AdaptiveBlocking001/AdaptiveBlocking001.java was problemlisted in 17. This problem might have been fixed. I suggest unproblemlisting it now in early 19 time frame so that we might get enough data if this problem still occurs. > > Regards, > Nils This pull request has now been integrated. Changeset: a68f28ce Author: Nils Eliasson URL: https://git.openjdk.java.net/jdk/commit/a68f28cea6a726aa57c04a4fc5a665cae3513154 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod 8278909: Unproblemlist AdaptiveBlocking001 Reviewed-by: chagedorn, eosterlund, stefank ------------- PR: https://git.openjdk.java.net/jdk/pull/6865 From duke at openjdk.java.net Fri Dec 17 15:38:51 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Fri, 17 Dec 2021 15:38:51 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive [v4] In-Reply-To: References: Message-ID: > IR::verify iterates the whole object graph. This proves costly when used in e.g. BlockMerger inside of iterations over BlockLists, leading to quadratic or worse complexities as a function of bytecode length. In several cases, only a few Blocks were changed, and there was no need to go over the whole graph, but until now there was no less blunt tool for verification than IR::verify. > > This PR introduces IR::verify_local, intended to be used when only a defined set of blocks have been modified. As a complement, expand_with_neighbors provides a way to also capture the neighbors of the "modified set" ahead of modification, so that afterwards the appropriate asserts can be made on all blocks which might possibly have been changed. All this should let us remove the expensive IR::verify calls, while still performing equivalent (or stricter) assertions. > > Some changes have been made in the verifiers along the way. Some amount of refactoring, and even added invariants (see validate_edge_mutiality). Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: cleanup 3 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6850/files - new: https://git.openjdk.java.net/jdk/pull/6850/files/587840cf..35c1bee7 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6850&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6850&range=02-03 Stats: 10 lines in 1 file changed: 0 ins; 2 del; 8 mod Patch: https://git.openjdk.java.net/jdk/pull/6850.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6850/head:pull/6850 PR: https://git.openjdk.java.net/jdk/pull/6850 From duke at openjdk.java.net Fri Dec 17 15:42:09 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Fri, 17 Dec 2021 15:42:09 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive [v5] In-Reply-To: References: Message-ID: > IR::verify iterates the whole object graph. This proves costly when used in e.g. BlockMerger inside of iterations over BlockLists, leading to quadratic or worse complexities as a function of bytecode length. In several cases, only a few Blocks were changed, and there was no need to go over the whole graph, but until now there was no less blunt tool for verification than IR::verify. > > This PR introduces IR::verify_local, intended to be used when only a defined set of blocks have been modified. As a complement, expand_with_neighbors provides a way to also capture the neighbors of the "modified set" ahead of modification, so that afterwards the appropriate asserts can be made on all blocks which might possibly have been changed. All this should let us remove the expensive IR::verify calls, while still performing equivalent (or stricter) assertions. > > Some changes have been made in the verifiers along the way. Some amount of refactoring, and even added invariants (see validate_edge_mutiality). Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: flip ifs ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6850/files - new: https://git.openjdk.java.net/jdk/pull/6850/files/35c1bee7..0f49a69b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6850&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6850&range=03-04 Stats: 9 lines in 1 file changed: 3 ins; 0 del; 6 mod Patch: https://git.openjdk.java.net/jdk/pull/6850.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6850/head:pull/6850 PR: https://git.openjdk.java.net/jdk/pull/6850 From duke at openjdk.java.net Fri Dec 17 15:42:16 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Fri, 17 Dec 2021 15:42:16 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive [v3] In-Reply-To: References: Message-ID: On Fri, 17 Dec 2021 08:35:47 GMT, Christian Hagedorn wrote: >> Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: >> >> cleanup2 > > src/hotspot/share/c1/c1_IR.cpp line 1401: > >> 1399: >> 1400: for (int i = 0; i < block->end()->number_of_sux(); i++) { >> 1401: if (blocks.contains(block->end()->sux_at(i))) { continue; } > > I think it's better to move the `continue` to a new line. I went ahead and flipped the ifs instead > src/hotspot/share/c1/c1_Optimizer.cpp line 441: > >> 439: >> 440: #ifndef PRODUCT >> 441: _hir->verify_local(blocks_to_verify_later); > > You can also use `NOT_PRODUCT()` here. Went with DEBUG_ONLY ------------- PR: https://git.openjdk.java.net/jdk/pull/6850 From roland at openjdk.java.net Fri Dec 17 15:55:10 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 17 Dec 2021 15:55:10 GMT Subject: RFR: 8278228: C2: Improve identical back-to-back if elimination [v2] In-Reply-To: References: Message-ID: > C2 has had the ability to optimize: > > (1) > > if (some_condition) { > // body 1 > } else { > // body 2 > } > if (some_condition) { > // body 3 > } else { > // body 4 > } > > into: > > (4) > > if (some_condition) { > // body 1 > // body 3 > } else { > // body 2 > // body 4 > } > > for a while. > > This is achieved by the intermediate step: > > (2) > > if (some_condition) { > // body 1 > some_condition2 = true; > } else { > // body 2 > some_condition2 = false; > } > if (some_condition2) { > // body 3 > } else { > // body 4 > } > > which then allows the use of the exiting split if optimization. As a > result, the graph is transformed to: > > (3) > > if (some_condition) { > // body 1 > some_condition2 = true; > if (some_condition2) { > body3: // a Region here > // body3 > } else { > goto body4; > } > } else { > // body 2 > some_condition2 = false; > if (some_condition2) { > goto body3; > } else { > body4: // another Region here > // body4; > } > } > > and finally to (4) above. > > Recently, 8275610 has shown that this can break if some_condition is a > null check. If, say, body 3 has a control dependent CastPP, then when > body 1 and body 3 are merged, the CastPP of body 3 doesn't become > control dependent on the dominating if (because in step (2), the > CastPP hides behind a Region). As a result, the CastPP loses its > dependency on the null check. > > After discussing this with Christian, it seemed this was caused by the > way this transformation relies on split if: having custom code that > wouldn't create Regions at body3 and body4 that are then optimized out > would solve the problem. Anyway, after looking at the split if code, > trying to figure out how to tease it apart in smaller steps and > reusing some of them to build a new transformation, it seemed too > complicated. So instead, I propose reusing split if in a slightly > different way: > > skip step (2) but perform split if anyway to obtain: > > if (some_condition) { > // body 1 > if (some_condition) { > body3: // Region1 > // CastPP is here, control dependent on Region1 > // body3 > } else { > goto body4; > } > } else { > // body 2 > if (some_condition) { > goto body3; > } else { > body4: // Region2 > // body4; > } > } > > - A CastPP node would still be behind a Region. So next step, is to push > control dependent nodes through Region1 and Region2: > > if (some_condition) { > // body 1 > if (some_condition) { > // A CastPP here > body3: // Region1 > // body3 > } else { > goto body4; > } > } else { > // body 2 > if (some_condition) { > // A CastPP here > goto body3; > } else { > body4: // Region2 > // body4; > } > } > > - And then call dominated_by() to optimize the dominated: > > if (some_condition) { > > in both branches of the dominating if (some_condition) {. That also > causes the CastPP to become dependent on the dominating if. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: only for nodes that depends_only_on_test() ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6882/files - new: https://git.openjdk.java.net/jdk/pull/6882/files/d64204e6..5cdef65f Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6882&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6882&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/6882.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6882/head:pull/6882 PR: https://git.openjdk.java.net/jdk/pull/6882 From roland at openjdk.java.net Fri Dec 17 15:55:12 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 17 Dec 2021 15:55:12 GMT Subject: RFR: 8278228: C2: Improve identical back-to-back if elimination In-Reply-To: References: Message-ID: On Fri, 17 Dec 2021 13:52:56 GMT, Roland Westrelin wrote: > C2 has had the ability to optimize: > > (1) > > if (some_condition) { > // body 1 > } else { > // body 2 > } > if (some_condition) { > // body 3 > } else { > // body 4 > } > > into: > > (4) > > if (some_condition) { > // body 1 > // body 3 > } else { > // body 2 > // body 4 > } > > for a while. > > This is achieved by the intermediate step: > > (2) > > if (some_condition) { > // body 1 > some_condition2 = true; > } else { > // body 2 > some_condition2 = false; > } > if (some_condition2) { > // body 3 > } else { > // body 4 > } > > which then allows the use of the exiting split if optimization. As a > result, the graph is transformed to: > > (3) > > if (some_condition) { > // body 1 > some_condition2 = true; > if (some_condition2) { > body3: // a Region here > // body3 > } else { > goto body4; > } > } else { > // body 2 > some_condition2 = false; > if (some_condition2) { > goto body3; > } else { > body4: // another Region here > // body4; > } > } > > and finally to (4) above. > > Recently, 8275610 has shown that this can break if some_condition is a > null check. If, say, body 3 has a control dependent CastPP, then when > body 1 and body 3 are merged, the CastPP of body 3 doesn't become > control dependent on the dominating if (because in step (2), the > CastPP hides behind a Region). As a result, the CastPP loses its > dependency on the null check. > > After discussing this with Christian, it seemed this was caused by the > way this transformation relies on split if: having custom code that > wouldn't create Regions at body3 and body4 that are then optimized out > would solve the problem. Anyway, after looking at the split if code, > trying to figure out how to tease it apart in smaller steps and > reusing some of them to build a new transformation, it seemed too > complicated. So instead, I propose reusing split if in a slightly > different way: > > skip step (2) but perform split if anyway to obtain: > > if (some_condition) { > // body 1 > if (some_condition) { > body3: // Region1 > // CastPP is here, control dependent on Region1 > // body3 > } else { > goto body4; > } > } else { > // body 2 > if (some_condition) { > goto body3; > } else { > body4: // Region2 > // body4; > } > } > > - A CastPP node would still be behind a Region. So next step, is to push > control dependent nodes through Region1 and Region2: > > if (some_condition) { > // body 1 > if (some_condition) { > // A CastPP here > body3: // Region1 > // body3 > } else { > goto body4; > } > } else { > // body 2 > if (some_condition) { > // A CastPP here > goto body3; > } else { > body4: // Region2 > // body4; > } > } > > - And then call dominated_by() to optimize the dominated: > > if (some_condition) { > > in both branches of the dominating if (some_condition) {. That also > causes the CastPP to become dependent on the dominating if. I pushed a tweak to PhaseIdealLoop::push_pinned_nodes_thru_region() so it's applied only to nodes that will actually be rewired by dominated_by() and not all pinned nodes. ------------- PR: https://git.openjdk.java.net/jdk/pull/6882 From iveresov at openjdk.java.net Fri Dec 17 16:31:26 2021 From: iveresov at openjdk.java.net (Igor Veresov) Date: Fri, 17 Dec 2021 16:31:26 GMT Subject: [jdk18] RFR: 8277447: Hotspot C1 compiler crashes on Kotlin suspend fun with loop In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 22:47:23 GMT, Igor Veresov wrote: > There are a bunch of problems with `BlockListBuilder::mark_loops()` and how it handles irreducible loops. It doesn't really seem to be explicitly designed to handle those, however, it does handle most. One shape emitted by the Kotlin compiler in this particular case gives it trouble. > > The proper fix is to rewrite loop detection, detect irreducible loops, and switch off `SelectivePhiFunctions` is any are present. But given that we're close to the release, I'd like to add a bailout during phi insertion, and file an RFE to do the proper fix later. > > I wrote a minimal test to demonstrate the issue. > > Testing with hs-tier{1-7} is squeaky clean. Thanks, guys! ------------- PR: https://git.openjdk.java.net/jdk18/pull/40 From iveresov at openjdk.java.net Fri Dec 17 16:34:38 2021 From: iveresov at openjdk.java.net (Igor Veresov) Date: Fri, 17 Dec 2021 16:34:38 GMT Subject: [jdk18] Integrated: 8277447: Hotspot C1 compiler crashes on Kotlin suspend fun with loop In-Reply-To: References: Message-ID: On Thu, 16 Dec 2021 22:47:23 GMT, Igor Veresov wrote: > There are a bunch of problems with `BlockListBuilder::mark_loops()` and how it handles irreducible loops. It doesn't really seem to be explicitly designed to handle those, however, it does handle most. One shape emitted by the Kotlin compiler in this particular case gives it trouble. > > The proper fix is to rewrite loop detection, detect irreducible loops, and switch off `SelectivePhiFunctions` is any are present. But given that we're close to the release, I'd like to add a bailout during phi insertion, and file an RFE to do the proper fix later. > > I wrote a minimal test to demonstrate the issue. > > Testing with hs-tier{1-7} is squeaky clean. This pull request has now been integrated. Changeset: b46f0b0b Author: Igor Veresov URL: https://git.openjdk.java.net/jdk18/commit/b46f0b0b1f2ada705f8b5aac9b7d8423699437a1 Stats: 121 lines in 3 files changed: 121 ins; 0 del; 0 mod 8277447: Hotspot C1 compiler crashes on Kotlin suspend fun with loop Reviewed-by: kvn, neliasso ------------- PR: https://git.openjdk.java.net/jdk18/pull/40 From duke at openjdk.java.net Fri Dec 17 17:52:57 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Fri, 17 Dec 2021 17:52:57 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v4] In-Reply-To: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> Message-ID: <_meFbLFhJ-kWbm1HYROPXSKT5ou0e0oNkbv7MF_dvpk=.1482788a-3e4e-4e3c-9e9e-5cf85b1f2e38@github.com> > A new ideal optimization can be introduced for addnode: converting "x + x" into "x << 1". > > > // Convert "x + x" into "x << 1" > if (in1 == in2) { > return new LShiftINode(in1, phase->intcon(1)); > } Zhiqiang Zang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: - update mircobenchmark to reflect the new optimization. - include ir tests. - remove test for the removed optimization. - replace (x + x) -> x >> 1 with (x + x) >> c -> x >> (c + 1). - Merge master. - Merge master. - enrich tests and do the same transformation for "long". - include a new optimization for ideal in addnode: Convert "x + x" into "x << 1", and associated tests. ------------- Changes: https://git.openjdk.java.net/jdk/pull/6675/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6675&range=03 Stats: 309 lines in 6 files changed: 302 ins; 0 del; 7 mod Patch: https://git.openjdk.java.net/jdk/pull/6675.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6675/head:pull/6675 PR: https://git.openjdk.java.net/jdk/pull/6675 From svkamath at openjdk.java.net Fri Dec 17 18:10:00 2021 From: svkamath at openjdk.java.net (Smita Kamath) Date: Fri, 17 Dec 2021 18:10:00 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 [v2] In-Reply-To: References: Message-ID: > The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: Fix to allocate 48 additonal htbl entries in the stub. ------------- Changes: - all: https://git.openjdk.java.net/jdk18/pull/19/files - new: https://git.openjdk.java.net/jdk18/pull/19/files/c9c73d1f..f727ba16 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk18&pr=19&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk18&pr=19&range=00-01 Stats: 58 lines in 11 files changed: 3 ins; 45 del; 10 mod Patch: https://git.openjdk.java.net/jdk18/pull/19.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/19/head:pull/19 PR: https://git.openjdk.java.net/jdk18/pull/19 From duke at openjdk.java.net Fri Dec 17 18:22:03 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Fri, 17 Dec 2021 18:22:03 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v5] In-Reply-To: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> Message-ID: > A new ideal optimization can be introduced for addnode: converting "x + x" into "x << 1". > > > // Convert "x + x" into "x << 1" > if (in1 == in2) { > return new LShiftINode(in1, phase->intcon(1)); > } Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: clean. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6675/files - new: https://git.openjdk.java.net/jdk/pull/6675/files/257a38ab..1ecff961 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6675&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6675&range=03-04 Stats: 15 lines in 3 files changed: 0 ins; 15 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6675.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6675/head:pull/6675 PR: https://git.openjdk.java.net/jdk/pull/6675 From duke at openjdk.java.net Fri Dec 17 18:28:20 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Fri, 17 Dec 2021 18:28:20 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v3] In-Reply-To: <6UEIgb07IvoDvmtD17RTJXY5lvMaWtb8XeW6Tu4JYAU=.f08a9246-3c05-4970-9fae-5a2fa955e4db@github.com> References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> <_MskcNL92fUY822sDIaFRMiZ55gDq3JpMKYKa58jjYA=.5f5376cc-8dd7-484f-81ed-21111e1252f1@github.com> <6UEIgb07IvoDvmtD17RTJXY5lvMaWtb8XeW6Tu4JYAU=.f08a9246-3c05-4970-9fae-5a2fa955e4db@github.com> Message-ID: On Thu, 16 Dec 2021 19:10:35 GMT, Claes Redestad wrote: > > Tbh I don't find this transformation necessary, `x + x` is a very cheap operation and is generally easier for the optimiser to work with than `x << 1`. Cheers. > > I agree this looks to be of dubious value on the face of it. A microbenchmark to prove it's beneficial in some scenario feels like a requirement here. > > A targeted microbenchmark could explore if we already do or could better allow constant folding of expressions that does subsequent shifts, e.g. by turning `(x + x) << 3` into `x << 4`. Thank you both for commenting and suggestions! @merykitty @cl4es I moved the transformation into LShiftNode, i.e., to convert `(x + x) << 3` into `x << 4`, and I observed a 5% performance improvement (percentage of average time reduced) via the microbenchmark I write. I am not sure how much improvement we usually expect for a transformation, is this one currently a beneficial transformation? Thank you! ------------- PR: https://git.openjdk.java.net/jdk/pull/6675 From kvn at openjdk.java.net Fri Dec 17 19:30:32 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 17 Dec 2021 19:30:32 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 [v2] In-Reply-To: References: Message-ID: <-hoeuB7TNw0hDUF1Uy2YyYGPEIpfZ7VckcXDUiWed10=.d6c9d532-a1c6-454a-9918-39c743a3d82d@github.com> On Fri, 17 Dec 2021 18:10:00 GMT, Smita Kamath wrote: >> The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Fix to allocate 48 additonal htbl entries in the stub. I have few comments. I will start testing. src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 4444: > 4442: __ movptr(state, state_mem); > 4443: #endif > 4444: __ subptr(rsp, 96 * longSize); // Create space on the stack for htbl entries Is this aligned correctly? Or alignment does not matter? src/hotspot/share/opto/library_call.cpp line 6766: > 6764: Node* ghash_object = argument(8); > 6765: > 6766: // (1) in, ct and out are arrays. You need to restore indent. ------------- PR: https://git.openjdk.java.net/jdk18/pull/19 From phh at openjdk.java.net Fri Dec 17 19:44:27 2021 From: phh at openjdk.java.net (Paul Hohensee) Date: Fri, 17 Dec 2021 19:44:27 GMT Subject: RFR: 8278104: C1 should support the compiler directive 'BreakAtExecute' In-Reply-To: References: Message-ID: On Sun, 12 Dec 2021 04:21:51 GMT, Guoxiong Li wrote: > Hi all, > > Currently, the directive `BreakAtExecute` is not effective at C1. And the `CompileCommand=break` doesn't break the compiled method, too. This patch unifies the `BreakAtExecute` and `CompileCommand=break` to the directive 'BreakAtExecute' and uses the directive 'BreakAtExecute' to identify whether a breakpoint should be added. > > The test group `hotspot_compiler` passed locally.(linux x86_64 fastdebug) > And the pre-submit tests passed before submitting the PR. > > Thanks for taking the time to review. > > Best Regards, > -- Guoxiong Lgtm. ------------- Marked as reviewed by phh (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6807 From kvn at openjdk.java.net Fri Dec 17 19:53:28 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 17 Dec 2021 19:53:28 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive [v5] In-Reply-To: References: Message-ID: <8Vz4DDiP2vqHU5nfppPyKjmFkDY8RwIWweapHTCx4vo=.e21f9de5-ed2b-4792-954b-a0936036e4ea@github.com> On Fri, 17 Dec 2021 15:42:09 GMT, Ludvig Janiuk wrote: >> IR::verify iterates the whole object graph. This proves costly when used in e.g. BlockMerger inside of iterations over BlockLists, leading to quadratic or worse complexities as a function of bytecode length. In several cases, only a few Blocks were changed, and there was no need to go over the whole graph, but until now there was no less blunt tool for verification than IR::verify. >> >> This PR introduces IR::verify_local, intended to be used when only a defined set of blocks have been modified. As a complement, expand_with_neighbors provides a way to also capture the neighbors of the "modified set" ahead of modification, so that afterwards the appropriate asserts can be made on all blocks which might possibly have been changed. All this should let us remove the expensive IR::verify calls, while still performing equivalent (or stricter) assertions. >> >> Some changes have been made in the verifiers along the way. Some amount of refactoring, and even added invariants (see validate_edge_mutiality). > > Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: > > flip ifs So the whole block of code in `c1_IR.cpp` is under `#ifndef PRODUCT`. But is it used in **optimized** build or only in debug? I suggest to change `#ifndef PRODUCT` at line 1224 to `#ifdef ASSERT` and try to build optimized VM to see if it is used in it. I see only prints and asserts. ------------- PR: https://git.openjdk.java.net/jdk/pull/6850 From kvn at openjdk.java.net Fri Dec 17 19:53:29 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 17 Dec 2021 19:53:29 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive [v5] In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 16:43:10 GMT, Christian Hagedorn wrote: >> Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: >> >> flip ifs > > src/hotspot/share/c1/c1_Optimizer.cpp line 183: > >> 181: } >> 182: >> 183: #ifdef ASSERT > > `expand_with_neighborhood()` is guarded with `ifndef PRODUCT`. You should use the same here. No, `#ifdef ASSERT` is good here. `blocks_to_verify_later` was processed before by `IR::verify()` only in debug build (method's code was under `#ifdef ASSERT`). Now it passed to `IR::verify_local()` which is called only debug build. ------------- PR: https://git.openjdk.java.net/jdk/pull/6850 From kvn at openjdk.java.net Fri Dec 17 19:55:24 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 17 Dec 2021 19:55:24 GMT Subject: RFR: 8278104: C1 should support the compiler directive 'BreakAtExecute' In-Reply-To: References: Message-ID: On Sun, 12 Dec 2021 04:21:51 GMT, Guoxiong Li wrote: > Hi all, > > Currently, the directive `BreakAtExecute` is not effective at C1. And the `CompileCommand=break` doesn't break the compiled method, too. This patch unifies the `BreakAtExecute` and `CompileCommand=break` to the directive 'BreakAtExecute' and uses the directive 'BreakAtExecute' to identify whether a breakpoint should be added. > > The test group `hotspot_compiler` passed locally.(linux x86_64 fastdebug) > And the pre-submit tests passed before submitting the PR. > > Thanks for taking the time to review. > > Best Regards, > -- Guoxiong okay. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6807 From kvn at openjdk.java.net Fri Dec 17 19:57:25 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 17 Dec 2021 19:57:25 GMT Subject: [jdk18] RFR: 8278413: C2 crash when allocating array of size too large In-Reply-To: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> References: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> Message-ID: <018cnxyPDqYVFzuDj1O7ZkCRYdvuotDpBGbH1Zp-iTE=.e2ed467f-69be-4ac2-9818-b767e7b57e00@github.com> On Wed, 15 Dec 2021 12:36:17 GMT, Roland Westrelin wrote: > On the fallthrough path from an AllocateArray, the length of the > allocated array is casted (with a CastII) to [0, max_size] with > max_size some number that depends on the array type and can be less > than max_jint. > > Allocating an array of a length that's not in [0, max_size] causes the > CastII to become top. The fallthrough path must be killed as well in > that case otherwise this can lead to a broken graph. Currently c2 has > logic to protect against an allocation of array of negative size in > AllocateArrayNode::Ideal(). That call replaces the fallthrough path > with an Halt node. But if the size is too big, then the fallthrough > path is left as is. > > This patch fixes that issues. It also reworks the length negative > case. I added a Bool/CmpU input to the AllocateArray that tests for a > valid length. If that input becomes false, CatchNode::Value() kills > the fallthrough path. That logic is similar to that for a virtual call > with a null receiver. I also removed AllocateArrayNode::Ideal() now > that CatchNode::Value() takes care of the same corner case. The code > in AllocateArrayNode::Ideal() was added by Vladimir and he told me he > tried extending CatchNode::Value() at the time but that caused test > failures. I had no issues in my testing so I assume doing it that way > is ok now. > > The new input to AllocateArray is moved to the CallStaticJava runtime > call for array allocation on macro expansion as a precedence edge. The > reason for that is that final graph reshape needs a way to tell > whether the missing path out of the allocation is legal or not. final > graph reshape then removes the then useless precedence edge. Looks good. Thank you for fixing it. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/30 From ayang at openjdk.java.net Fri Dec 17 22:10:25 2021 From: ayang at openjdk.java.net (Albert Mingkun Yang) Date: Fri, 17 Dec 2021 22:10:25 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive [v5] In-Reply-To: References: Message-ID: On Fri, 17 Dec 2021 15:42:09 GMT, Ludvig Janiuk wrote: >> IR::verify iterates the whole object graph. This proves costly when used in e.g. BlockMerger inside of iterations over BlockLists, leading to quadratic or worse complexities as a function of bytecode length. In several cases, only a few Blocks were changed, and there was no need to go over the whole graph, but until now there was no less blunt tool for verification than IR::verify. >> >> This PR introduces IR::verify_local, intended to be used when only a defined set of blocks have been modified. As a complement, expand_with_neighbors provides a way to also capture the neighbors of the "modified set" ahead of modification, so that afterwards the appropriate asserts can be made on all blocks which might possibly have been changed. All this should let us remove the expensive IR::verify calls, while still performing equivalent (or stricter) assertions. >> >> Some changes have been made in the verifiers along the way. Some amount of refactoring, and even added invariants (see validate_edge_mutiality). > > Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: > > flip ifs I wonder if "map fusion" works here, as in improving the perf of verification. Using `IR::verify_local` as an example: { VerifyClosure cl; blocks.iterate_forward(&cl); } class VerifyClosure : public BlockClosure { void block_do(BlockBegin* block) override { verify_end_not_null(block); verify_edge_mutuality(block); verify_block_begin_field(block); // more verifier } } This should cut down #iteration to 1. ------------- PR: https://git.openjdk.java.net/jdk/pull/6850 From duke at openjdk.java.net Fri Dec 17 22:18:06 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Fri, 17 Dec 2021 22:18:06 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v6] In-Reply-To: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> Message-ID: > A new ideal optimization can be introduced for addnode: converting "x + x" into "x << 1". > > > // Convert "x + x" into "x << 1" > if (in1 == in2) { > return new LShiftINode(in1, phase->intcon(1)); > } Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: rename the ir test. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6675/files - new: https://git.openjdk.java.net/jdk/pull/6675/files/1ecff961..290e9f20 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6675&range=05 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6675&range=04-05 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/6675.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6675/head:pull/6675 PR: https://git.openjdk.java.net/jdk/pull/6675 From dlong at openjdk.java.net Fri Dec 17 22:36:33 2021 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 17 Dec 2021 22:36:33 GMT Subject: RFR: 8278518: String(byte[], int, int, Charset) constructor and String.translateEscapes() miss bounds check elimination In-Reply-To: <2IcCYyqxBU_oi_i9n1LZzTivcLD7QWxAlZga6kiGOPg=.a09fd12b-d1aa-4124-9bb2-31f86da2f706@github.com> References: <2IcCYyqxBU_oi_i9n1LZzTivcLD7QWxAlZga6kiGOPg=.a09fd12b-d1aa-4124-9bb2-31f86da2f706@github.com> Message-ID: On Mon, 13 Dec 2021 09:39:55 GMT, ?????? ??????? wrote: > Originally this was spotted by by Amir Hadadi in https://stackoverflow.com/questions/70272651/missing-bounds-checking-elimination-in-string-constructor > > It looks like in the following code in `String(byte[], int, int, Charset)` > > while (offset < sl) { > int b1 = bytes[offset]; > if (b1 >= 0) { > dst[dp++] = (byte)b1; > offset++; // <--- > continue; > } > if ((b1 == (byte)0xc2 || b1 == (byte)0xc3) && > offset + 1 < sl) { > int b2 = bytes[offset + 1]; > if (!isNotContinuation(b2)) { > dst[dp++] = (byte)decode2(b1, b2); > offset += 2; > continue; > } > } > // anything not a latin1, including the repl > // we have to go with the utf16 > break; > } > > bounds check elimination is not executed when accessing byte array via `bytes[offset]. > > The reason, I guess, is that offset variable is modified within the loop (marked with arrow). > > Possible fix for this could be changing: > > `while (offset < sl)` ---> `while (offset >= 0 && offset < sl)` > > However the best is to invest in C2 optimization to handle all such cases. > > The following benchmark demonstrates good improvement: > > @State(Scope.Thread) > @BenchmarkMode(Mode.AverageTime) > @OutputTimeUnit(TimeUnit.NANOSECONDS) > public class StringConstructorBenchmark { > private byte[] array; > private String str; > > @Setup > public void setup() { > str = "Quizdeltagerne spiste jordb?r med fl?de, mens cirkusklovnen. ?";//Latin1 ending with Russian > array = str.getBytes(StandardCharsets.UTF_8); > } > > @Benchmark > public String newString() { > return new String(array, 0, array.length, StandardCharsets.UTF_8); > } > > @Benchmark > public String translateEscapes() { > return str.translateEscapes(); > } > } > > Results: > > //baseline > Benchmark Mode Cnt Score Error Units > StringConstructorBenchmark.newString avgt 50 173,092 ? 3,048 ns/op > > //patched > Benchmark Mode Cnt Score Error Units > StringConstructorBenchmark.newString avgt 50 126,908 ? 2,355 ns/op > > The same is observed in String.translateEscapes() for the same String as in the benchmark above: > > //baseline > Benchmark Mode Cnt Score Error Units > StringConstructorBenchmark.translateEscapes avgt 100 53,627 ? 0,850 ns/op > > //patched > Benchmark Mode Cnt Score Error Units > StringConstructorBenchmark.translateEscapes avgt 100 48,087 ? 1,129 ns/op > > Also I've looked into this with `LinuxPerfAsmProfiler`, full output for baseline is available here https://gist.github.com/stsypanov/d2524f98477d633fb1d4a2510fedeea6 and for patched code here https://gist.github.com/stsypanov/16c787e4f9fa3dd122522f16331b68b7. > > Here's the part of baseline assembly responsible for `while` loop: > > 3.62% ?? ? 0x00007fed70eb4c1c: mov %ebx,%ecx > 2.29% ?? ? 0x00007fed70eb4c1e: mov %edx,%r9d > 2.22% ?? ? 0x00007fed70eb4c21: mov (%rsp),%r8 ;*iload_2 {reexecute=0 rethrow=0 return_oop=0} > ?? ? ; - java.lang.String::<init>@107 (line 537) > 2.32% ?? ? 0x00007fed70eb4c25: cmp %r13d,%ecx > ? ? 0x00007fed70eb4c28: jge 0x00007fed70eb5388 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} > ? ? ; - java.lang.String::<init>@110 (line 537) > 3.05% ? ? 0x00007fed70eb4c2e: cmp 0x8(%rsp),%ecx > ? ? 0x00007fed70eb4c32: jae 0x00007fed70eb5319 > 2.38% ? ? 0x00007fed70eb4c38: mov %r8,(%rsp) > 2.64% ? ? 0x00007fed70eb4c3c: movslq %ecx,%r8 > 2.46% ? ? 0x00007fed70eb4c3f: mov %rax,%rbx > 3.44% ? ? 0x00007fed70eb4c42: sub %r8,%rbx > 2.62% ? ? 0x00007fed70eb4c45: add $0x1,%rbx > 2.64% ? ? 0x00007fed70eb4c49: and $0xfffffffffffffffe,%rbx > 2.30% ? ? 0x00007fed70eb4c4d: mov %ebx,%r8d > 3.08% ? ? 0x00007fed70eb4c50: add %ecx,%r8d > 2.55% ? ? 0x00007fed70eb4c53: movslq %r8d,%r8 > 2.45% ? ? 0x00007fed70eb4c56: add $0xfffffffffffffffe,%r8 > 2.13% ? ? 0x00007fed70eb4c5a: cmp (%rsp),%r8 > ? ? 0x00007fed70eb4c5e: jae 0x00007fed70eb5319 > 3.36% ? ? 0x00007fed70eb4c64: mov %ecx,%edi ;*aload_1 {reexecute=0 rethrow=0 return_oop=0} > ? ? ; - java.lang.String::<init>@113 (line 538) > 2.86% ? ?? 0x00007fed70eb4c66: movsbl 0x10(%r14,%rdi,1),%r8d ;*baload {reexecute=0 rethrow=0 return_oop=0} > ? ?? ; - java.lang.String::<init>@115 (line 538) > 2.48% ? ?? 0x00007fed70eb4c6c: mov %r9d,%edx > 2.26% ? ?? 0x00007fed70eb4c6f: inc %edx ;*iinc {reexecute=0 rethrow=0 return_oop=0} > ? ?? ; - java.lang.String::<init>@127 (line 540) > 3.28% ? ?? 0x00007fed70eb4c71: mov %edi,%ebx > 2.44% ? ?? 0x00007fed70eb4c73: inc %ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0} > ? ?? ; - java.lang.String::<init>@134 (line 541) > 2.35% ? ?? 0x00007fed70eb4c75: test %r8d,%r8d > ? ?? 0x00007fed70eb4c78: jge 0x00007fed70eb4c04 ;*iflt {reexecute=0 rethrow=0 return_oop=0} > ?? ; - java.lang.String::<init>@120 (line 539) > > and this one is for patched code: > > 17.28% ?? 0x00007f6b88eb6061: mov %edx,%r10d ;*iload_2 {reexecute=0 rethrow=0 return_oop=0} > ?? ; - java.lang.String::<init>@107 (line 537) > 0.11% ?? 0x00007f6b88eb6064: test %r10d,%r10d > ? 0x00007f6b88eb6067: jl 0x00007f6b88eb669c ;*iflt {reexecute=0 rethrow=0 return_oop=0} > ? ; - java.lang.String::<init>@108 (line 537) > 0.39% ? 0x00007f6b88eb606d: cmp %r13d,%r10d > ? 0x00007f6b88eb6070: jge 0x00007f6b88eb66d0 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} > ? ; - java.lang.String::<init>@114 (line 537) > 0.66% ? 0x00007f6b88eb6076: mov %ebx,%r9d > 13.70% ? 0x00007f6b88eb6079: cmp 0x8(%rsp),%r10d > 0.01% ? 0x00007f6b88eb607e: jae 0x00007f6b88eb6671 > 0.14% ? 0x00007f6b88eb6084: movsbl 0x10(%r14,%r10,1),%edi ;*baload {reexecute=0 rethrow=0 return_oop=0} > ? ; - java.lang.String::<init>@119 (line 538) > 0.37% ? 0x00007f6b88eb608a: mov %r9d,%ebx > 0.99% ? 0x00007f6b88eb608d: inc %ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0} > ? ; - java.lang.String::<init>@131 (line 540) > 12.88% ? 0x00007f6b88eb608f: movslq %r9d,%rsi ;*bastore {reexecute=0 rethrow=0 return_oop=0} > ? ; - java.lang.String::<init>@196 (line 548) > 0.17% ? 0x00007f6b88eb6092: mov %r10d,%edx > 0.39% ? 0x00007f6b88eb6095: inc %edx ;*iinc {reexecute=0 rethrow=0 return_oop=0} > ? ; - java.lang.String::<init>@138 (line 541) > 0.96% ? 0x00007f6b88eb6097: test %edi,%edi > 0.02% ? 0x00007f6b88eb6099: jl 0x00007f6b88eb60dc ;*iflt {reexecute=0 rethrow=0 return_oop=0} > > Between instructions mapped to `if_icmpge` and `aload_1` in baseline we have bounds check which is missing from patched code. This does look like a HotSpot JIT compiler issue to me. My guess is that it is related to how checkBoundsOffCount() checks for offset < 0: 396 if ((length | fromIndex | size) < 0 || size > length - fromIndex) using | to combine three values. ------------- PR: https://git.openjdk.java.net/jdk/pull/6812 From dean.long at oracle.com Fri Dec 17 22:37:37 2021 From: dean.long at oracle.com (dean.long at oracle.com) Date: Fri, 17 Dec 2021 14:37:37 -0800 Subject: C2 produces redundant (?) assembly for while-loop in certain cases (JDK-8278518) In-Reply-To: <22646071639729645@iva3-49b9eb45691c.qloud-c.yandex.net> References: <22646071639729645@iva3-49b9eb45691c.qloud-c.yandex.net> Message-ID: Hi, Sergey On 12/17/21 12:27 AM, ?????? ??????? wrote: > 1) Is my original theory about bounds check wrong and if it indeed is, > then what is that disappearing assembly part about? > > 2) Should this be fixed on JVM or Java side? > If we choose JVM then the issue needs to be reassigned > to someone else, because I'm not able > to elaborate and test the proper fix. This does look like a JIT issue.? I've changed the bug to hotspot/compiler. thanks, dl From duke at openjdk.java.net Fri Dec 17 23:46:35 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Fri, 17 Dec 2021 23:46:35 GMT Subject: Integrated: 8278623: compiler/vectorapi/reshape/TestVectorCastAVX512.java after JDK-8259610 In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 16:37:19 GMT, Mai ??ng Qu?n Anh wrote: > The problem is that loading vector from byte array requires the vector shape to support byte vector before the reinterpretation to the correct type. The failure to intrinsify seems to stop the compilation of the method, leads to IR verification failure. The patch simply changes the argument of vector cast methods to correct types. > > Thanh you very much. This pull request has now been integrated. Changeset: cc44e137 Author: merykitty Committer: Vladimir Kozlov URL: https://git.openjdk.java.net/jdk/commit/cc44e137973808436311aaaa50916d051759f705 Stats: 437 lines in 10 files changed: 123 ins; 68 del; 246 mod 8278623: compiler/vectorapi/reshape/TestVectorCastAVX512.java after JDK-8259610 Reviewed-by: kvn, chagedorn, psandoz ------------- PR: https://git.openjdk.java.net/jdk/pull/6852 From redestad at openjdk.java.net Fri Dec 17 23:55:27 2021 From: redestad at openjdk.java.net (Claes Redestad) Date: Fri, 17 Dec 2021 23:55:27 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v6] In-Reply-To: References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> Message-ID: <15396tnbmp4ipCUSryMfj-i9gmFcsNY1N8AFjdikbbA=.6a72ae12-2aa7-4902-bf3b-a58a9e4b33a1@github.com> On Fri, 17 Dec 2021 22:18:06 GMT, Zhiqiang Zang wrote: >> A new ideal optimization can be introduced for addnode: converting "x + x" into "x << 1". >> >> >> // Convert "x + x" into "x << 1" >> if (in1 == in2) { >> return new LShiftINode(in1, phase->intcon(1)); >> } > > Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: > > rename the ir test. test/micro/org/openjdk/bench/vm/compiler/AddIdeal_XPlusX_LShiftC.java line 49: > 47: @Warmup(iterations = 20, time = 1, timeUnit = TimeUnit.SECONDS) > 48: @Measurement(iterations = 20, time = 1, timeUnit = TimeUnit.SECONDS) > 49: @Fork(value = 3 , jvmArgsAppend = {"-XX:-TieredCompilation"}) Is `-TieredCompilation` necessary to demonstrate an effect? Settings that are not strictly necessary - such as tuning - should generally be avoided (someone might want to set up separate runs with `TieredCompilation` enabled and disabled on a higher level, for example) ------------- PR: https://git.openjdk.java.net/jdk/pull/6675 From kvn at openjdk.java.net Sat Dec 18 00:14:29 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 18 Dec 2021 00:14:29 GMT Subject: RFR: 8278949: Cleanups for 8277850 In-Reply-To: References: Message-ID: On Fri, 17 Dec 2021 08:03:19 GMT, Roland Westrelin wrote: > When 8277850 (C2: optimize mask checks in counted loops) was reviewed, > John made a number of comments and suggestions after the change was > integrated. This change includes all of his comments, extra tests to > cover all cases. I also moved the AndIL_add_shift_and_mask() call in > AndXNode::Ideal() up so the expression with a non constant mask can be > optimized as well. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6876 From sviswanathan at openjdk.java.net Sat Dec 18 00:31:34 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Sat, 18 Dec 2021 00:31:34 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. [v2] In-Reply-To: <7xWcxp4p6wK4ccQdTb95R-vVlZtWixADO9ouDhRu0fY=.8b30e77c-4cc6-4f86-a255-7f6aa24a8122@github.com> References: <7xWcxp4p6wK4ccQdTb95R-vVlZtWixADO9ouDhRu0fY=.8b30e77c-4cc6-4f86-a255-7f6aa24a8122@github.com> Message-ID: On Thu, 16 Dec 2021 17:46:35 GMT, Jatin Bhateja wrote: >> - Vector.maskAll was accelerated for AVX-512 target, but x86 existing backend implementation does not enable maskAll instruction patterns for 32 bit JVM, due to which operations fall backs over replicateB operation which broadcasts the mask value in a vector. >> - In some cases after unboxing-boxing optimization this vector eventually reaches to XorVMask which has different operands one held in opmask register and other in vector. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8278508: Review comments resolution. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4281: > 4279: if (mask_len > 32) { > 4280: kmovql(dst, src); > 4281: kshiftrql(dst, dst, 64 - mask_len); Here masklen is 64, so kshiftrql is not needed here? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4284: > 4282: } else { > 4283: kmovdl(dst, src); > 4284: kshiftrdl(dst, dst, 32 - mask_len); If masklen is 32 then kshiftrdl is not needed. Only needed if masklen < 32. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4289: > 4287: assert(mask_len <= 16, ""); > 4288: kmovwl(dst, src); > 4289: kshiftrwl(dst, dst, 16 - mask_len); If masklen == 16 then kshiftrwl is not needed? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4299: > 4297: kshiftlql(dst, tmp, 32); > 4298: korql(dst, dst, tmp); > 4299: kshiftrql(dst, dst, 64 - mask_len); Do we need the kshiftrql here? The masklen is 64 here. You could alternatively use: kmovdl dst, src kunpckdq dst, dst, dst src/hotspot/cpu/x86/x86.ad line 9457: > 9455: predicate(Matcher::vector_length(n) <= 32); > 9456: match(Set dst (MaskAll cnt)); > 9457: effect(TEMP dst, TEMP tmp); TEMP dst is not needed here. src/hotspot/cpu/x86/x86.ad line 9470: > 9468: predicate(Matcher::vector_length(n) <= 32); > 9469: match(Set dst (MaskAll src)); > 9470: effect(TEMP dst); TEMP dst is not needed. src/hotspot/cpu/x86/x86_32.ad line 13853: > 13851: predicate(Matcher::vector_length(n) <= 32); > 13852: match(Set dst (MaskAll src)); > 13853: effect(TEMP dst); TEMP dst is not needed. src/hotspot/cpu/x86/x86_32.ad line 13897: > 13895: %} > 13896: ins_pipe( pipe_slow ); > 13897: %} This can be removed. The more generic mask_all_evexI_GT32 rule will take care of this. src/hotspot/cpu/x86/x86_64.ad line 13016: > 13014: instruct mask_all_evexL(kReg dst, rRegL src) %{ > 13015: match(Set dst (MaskAll src)); > 13016: effect(TEMP dst); TEMP dst not needed. src/hotspot/cpu/x86/x86_64.ad line 13028: > 13026: predicate(Matcher::vector_length(n) > 32); > 13027: match(Set dst (MaskAll src)); > 13028: effect(TEMP dst, TEMP tmp); TEMP dst not needed. src/hotspot/cpu/x86/x86_64.ad line 13041: > 13039: predicate(Matcher::vector_length(n) > 32); > 13040: match(Set dst (MaskAll cnt)); > 13041: effect(TEMP dst, TEMP tmp); TEMP dst not needed. ------------- PR: https://git.openjdk.java.net/jdk18/pull/24 From duke at openjdk.java.net Sat Dec 18 02:21:49 2021 From: duke at openjdk.java.net (Mai =?UTF-8?B?xJDhurduZw==?= =?UTF-8?B?IA==?= =?UTF-8?B?UXXDom4=?= Anh) Date: Sat, 18 Dec 2021 02:21:49 GMT Subject: [jdk18] RFR: 8278948: compiler/vectorapi/reshape/TestVectorCastAVX1.java crashes in assembler Message-ID: This patch fixes a crash spotted in `compiler/vectorapi/reshape/TestVectorCastAVX1.java` in mainline. The reason for the failure is the incorrect vector encoding of integer promotion operation leads to unsupported instruction `vpmovsxbd/vpmovsxwd ymm, xmm` on AVX1. For the same reason we currently cannot cast a short or byte vector to a 256-bit float vector on AVX1, so I also fixed that. ------------- Commit messages: - fix wrong vlen_enc Changes: https://git.openjdk.java.net/jdk18/pull/46/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=46&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278948 Stats: 17 lines in 1 file changed: 3 ins; 8 del; 6 mod Patch: https://git.openjdk.java.net/jdk18/pull/46.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/46/head:pull/46 PR: https://git.openjdk.java.net/jdk18/pull/46 From redestad at openjdk.java.net Sat Dec 18 03:09:27 2021 From: redestad at openjdk.java.net (Claes Redestad) Date: Sat, 18 Dec 2021 03:09:27 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v3] In-Reply-To: <_MskcNL92fUY822sDIaFRMiZ55gDq3JpMKYKa58jjYA=.5f5376cc-8dd7-484f-81ed-21111e1252f1@github.com> References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> <_MskcNL92fUY822sDIaFRMiZ55gDq3JpMKYKa58jjYA=.5f5376cc-8dd7-484f-81ed-21111e1252f1@github.com> Message-ID: On Thu, 16 Dec 2021 18:36:31 GMT, Quan Anh Mai wrote: >> Zhiqiang Zang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: >> >> - Merge master. >> - enrich tests and do the same transformation for "long". >> - include a new optimization for ideal in addnode: Convert "x + x" into "x << 1", and associated tests. > > Tbh I don't find this transformation necessary, `x + x` is a very cheap operation and is generally easier for the optimiser to work with than `x << 1`. > Cheers. > Thank you both for commenting and suggestions! @merykitty @cl4es > > I moved the transformation into LShiftNode, i.e., to convert `(x + x) << 3` into `x << 4`, and I observed a 5% performance improvement (percentage of average time reduced) via the microbenchmark I write. I am not sure how much improvement we usually expect for a transformation, is this one currently a beneficial transformation? Thank you! For such a small win you might be looking at noise and should inspect the generated assembly to see that there's any real difference in the emitted code. Running your `test` microbenchmark using `-prof perfasm` on my Linux workstation the relevant code generated for the `helper` method with your patch looks like this: 4.40% 0x00007f2b0d6d720c: mov %esi,%eax 0x00007f2b0d6d720e: and $0x1fffff,%eax ;*iushr {reexecute=0 rethrow=0 return_oop=0} ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 88) i.e., the `((x + x) << 10) >> 11` is fully transformed down into `x & 0x1fffff`, which seem like it could be a pretty optimal result for this code (zeroes out the top 11 bits). Baseline build generates this: 4.46% 0x00007f93e16d988c: add %esi,%esi 0x00007f93e16d988e: shl $0xa,%esi 0x00007f93e16d9891: mov %esi,%eax 0x00007f93e16d9893: shr $0xb,%eax ;*iushr {reexecute=0 rethrow=0 return_oop=0} ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 88) Which is a pretty direct transformation of the Java code into assembly. So it seems your patch _does_ enable more optimal transformations! Only a few percent of the benchmark time seem to be spent in the relevant code, though. As written I can't establish a significant result between a baseline VM and a patched one as the variance from run to run is more than a few percent. A narrowed down benchmark that only tests a single value seem to better demonstrate the low-level effect: @Benchmark public int testInt() { return helper(ints_a[4711]); } @Benchmark public long testLong() { return helper(longs_a[4711]); } Not using the hand-rolled `sink` blackhole means we rely on JMH blackholes. Those traditionally had a bit more overhead than what you get with `sink`, but can since recently cooperate with the VM to remove overhead almost completely if you run with `-Djmh.blackhole.autoDetect=true` (soon enabled by default). I also removed `-TieredCompilation` from the `@Fork` and get slightly better and stable scores overall, though a bit longer warmup is needed. Baseline: Benchmark Mode Cnt Score Error Units AddIdeal_XPlusX_LShiftC.testInt avgt 15 8.925 ? 0.019 ns/op AddIdeal_XPlusX_LShiftC.testLong avgt 15 8.969 ? 0.107 ns/op Patch: Benchmark Mode Cnt Score Error Units AddIdeal_XPlusX_LShiftC.testInt avgt 15 7.650 ? 0.024 ns/op AddIdeal_XPlusX_LShiftC.testLong avgt 15 7.655 ? 0.087 ns/op A ~1.2ns improvement - or 1.15x speed-up after removing as much infrastructural overhead as possible. I think this looks in line with the observable change in the generated assembly (while we get rid of more than one instruction, they are likely pretty nicely pipelined on my Intel CPU). I think this looks like a reasonable improvement for such a straightforward change to C2. ------------- PR: https://git.openjdk.java.net/jdk/pull/6675 From svkamath at openjdk.java.net Sat Dec 18 03:23:56 2021 From: svkamath at openjdk.java.net (Smita Kamath) Date: Sat, 18 Dec 2021 03:23:56 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 [v3] In-Reply-To: References: Message-ID: > The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: Added alignment to stack allocation, resolved indentation issue ------------- Changes: - all: https://git.openjdk.java.net/jdk18/pull/19/files - new: https://git.openjdk.java.net/jdk18/pull/19/files/f727ba16..12b16ac9 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk18&pr=19&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk18&pr=19&range=01-02 Stats: 82 lines in 2 files changed: 19 ins; 13 del; 50 mod Patch: https://git.openjdk.java.net/jdk18/pull/19.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/19/head:pull/19 PR: https://git.openjdk.java.net/jdk18/pull/19 From svkamath at openjdk.java.net Sat Dec 18 03:23:59 2021 From: svkamath at openjdk.java.net (Smita Kamath) Date: Sat, 18 Dec 2021 03:23:59 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 [v2] In-Reply-To: <-hoeuB7TNw0hDUF1Uy2YyYGPEIpfZ7VckcXDUiWed10=.d6c9d532-a1c6-454a-9918-39c743a3d82d@github.com> References: <-hoeuB7TNw0hDUF1Uy2YyYGPEIpfZ7VckcXDUiWed10=.d6c9d532-a1c6-454a-9918-39c743a3d82d@github.com> Message-ID: <-0MVFIxx7NtWWWNL4dxIhEPS-bmZAbNps7DIJyfFBv4=.cc08892b-6200-4d86-a205-e407445e3919@github.com> On Fri, 17 Dec 2021 19:24:22 GMT, Vladimir Kozlov wrote: >> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix to allocate 48 additonal htbl entries in the stub. > > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 4444: > >> 4442: __ movptr(state, state_mem); >> 4443: #endif >> 4444: __ subptr(rsp, 96 * longSize); // Create space on the stack for htbl entries > > Is this aligned correctly? Or alignment does not matter? Thanks, Vladimir. I've addressed both your comments in the latest update. ------------- PR: https://git.openjdk.java.net/jdk18/pull/19 From duke at openjdk.java.net Sat Dec 18 03:28:21 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Sat, 18 Dec 2021 03:28:21 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v3] In-Reply-To: References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> <_MskcNL92fUY822sDIaFRMiZ55gDq3JpMKYKa58jjYA=.5f5376cc-8dd7-484f-81ed-21111e1252f1@github.com> Message-ID: <-ADzApE8Ccj5AL_wYZFyPCkgcdPomJytdky_YtwTKhI=.9bb51417-4211-4454-ab20-b28e2f95c239@github.com> On Sat, 18 Dec 2021 02:59:48 GMT, Claes Redestad wrote: >> Tbh I don't find this transformation necessary, `x + x` is a very cheap operation and is generally easier for the optimiser to work with than `x << 1`. >> Cheers. > >> Thank you both for commenting and suggestions! @merykitty @cl4es >> >> I moved the transformation into LShiftNode, i.e., to convert `(x + x) << 3` into `x << 4`, and I observed a 5% performance improvement (percentage of average time reduced) via the microbenchmark I write. I am not sure how much improvement we usually expect for a transformation, is this one currently a beneficial transformation? Thank you! > > For such a small win you might be looking at noise and should inspect the generated assembly to see that there's any real difference in the emitted code. > > Running your `test` microbenchmark using `-prof perfasm` on my Linux workstation the relevant code generated for the `helper` method with your patch looks like this: > > > 4.40% 0x00007f2b0d6d720c: mov %esi,%eax > 0x00007f2b0d6d720e: and $0x1fffff,%eax ;*iushr {reexecute=0 rethrow=0 return_oop=0} > ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 88) > > > i.e., the `((x + x) << 10) >> 11` is fully transformed down into `x & 0x1fffff`, which seem like it could be a pretty optimal result for this code (zeroes out the top 11 bits). > > Baseline build generates this: > > 4.46% 0x00007f93e16d988c: add %esi,%esi > 0x00007f93e16d988e: shl $0xa,%esi > 0x00007f93e16d9891: mov %esi,%eax > 0x00007f93e16d9893: shr $0xb,%eax ;*iushr {reexecute=0 rethrow=0 return_oop=0} > ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 88) > > > Which is a pretty direct transformation of the Java code into assembly. So it seems your patch _does_ enable more optimal transformations! > > Only a few percent of the benchmark time seem to be spent in the relevant code, though. As written I can't establish a significant result between a baseline VM and a patched one as the variance from run to run is more than a few percent. A narrowed down benchmark that only tests a single value seem to better demonstrate the low-level effect: > > > @Benchmark > public int testInt() { > return helper(ints_a[4711]); > } > > @Benchmark > public long testLong() { > return helper(longs_a[4711]); > } > > > Not using the hand-rolled `sink` blackhole means we rely on JMH blackholes. Those traditionally had a bit more overhead than what you get with `sink`, but can since recently cooperate with the VM to remove overhead almost completely if you run with `-Djmh.blackhole.autoDetect=true` (soon enabled by default). I also removed `-TieredCompilation` from the `@Fork` and get slightly better and stable scores overall, though a bit longer warmup is needed. > > Baseline: > > Benchmark Mode Cnt Score Error Units > AddIdeal_XPlusX_LShiftC.testInt avgt 15 8.925 ? 0.019 ns/op > AddIdeal_XPlusX_LShiftC.testLong avgt 15 8.969 ? 0.107 ns/op > > Patch: > > Benchmark Mode Cnt Score Error Units > AddIdeal_XPlusX_LShiftC.testInt avgt 15 7.650 ? 0.024 ns/op > AddIdeal_XPlusX_LShiftC.testLong avgt 15 7.655 ? 0.087 ns/op > > > A ~1.2ns improvement - or 1.15x speed-up after removing as much infrastructural overhead as possible. I think this looks in line with the observable change in the generated assembly (while we get rid of more than one instruction, they are likely pretty nicely pipelined on my Intel CPU). I think this looks like a reasonable improvement for such a straightforward change to C2. Thank you @cl4es so much for taking time to review my pull request and deeply investigate the microbenchmark! I learnt a lot on writing better JMH microbenchmarks by reading your comment. > > Thank you both for commenting and suggestions! @merykitty @cl4es > > I moved the transformation into LShiftNode, i.e., to convert `(x + x) << 3` into `x << 4`, and I observed a 5% performance improvement (percentage of average time reduced) via the microbenchmark I write. I am not sure how much improvement we usually expect for a transformation, is this one currently a beneficial transformation? Thank you! > > For such a small win you might be looking at noise and should inspect the generated assembly to see that there's any real difference in the emitted code. > Great to know we can print assembly! Thanks. > Running your `test` microbenchmark using `-prof perfasm` on my Linux workstation the relevant code generated for the `helper` method with your patch looks like this: > > ``` > 4.40% 0x00007f2b0d6d720c: mov %esi,%eax > 0x00007f2b0d6d720e: and $0x1fffff,%eax ;*iushr {reexecute=0 rethrow=0 return_oop=0} > ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 88) > ``` > > i.e., the `((x + x) << 10) >> 11` is fully transformed down into `x & 0x1fffff`, which seem like it could be a pretty optimal result for this code (zeroes out the top 11 bits). > > Baseline build generates this: > > ``` > 4.46% 0x00007f93e16d988c: add %esi,%esi > 0x00007f93e16d988e: shl $0xa,%esi > 0x00007f93e16d9891: mov %esi,%eax > 0x00007f93e16d9893: shr $0xb,%eax ;*iushr {reexecute=0 rethrow=0 return_oop=0} > ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 88) > ``` > > Which is a pretty direct transformation of the Java code into assembly. So it seems your patch _does_ enable more optimal transformations! > > Only a few percent of the benchmark time seem to be spent in the relevant code, though. As written I can't establish a significant result between a baseline VM and a patched one as the variance from run to run is more than a few percent. A narrowed down benchmark that only tests a single value seem to better demonstrate the low-level effect: > > ``` > @Benchmark > public int testInt() { > return helper(ints_a[4711]); > } > > @Benchmark > public long testLong() { > return helper(longs_a[4711]); > } > ``` > Thanks for point out this new feature. I did use `Blackhole` but found it brought too much overhead from infra so I referred to JMH examples for a sink method. I'll start using `Backhole` with `-Djmh.blackhole.autoDetect=true` from now on. > Not using the hand-rolled `sink` blackhole means we rely on JMH blackholes. Those traditionally had a bit more overhead than what you get with `sink`, but can since recently cooperate with the VM to remove overhead almost completely if you run with `-Djmh.blackhole.autoDetect=true` (soon enabled by default). I also removed `-TieredCompilation` from the `@Fork` and get slightly better and stable scores overall, though a bit longer warmup is needed. > > Baseline: > > ``` > Benchmark Mode Cnt Score Error Units > AddIdeal_XPlusX_LShiftC.testInt avgt 15 8.925 ? 0.019 ns/op > AddIdeal_XPlusX_LShiftC.testLong avgt 15 8.969 ? 0.107 ns/op > ``` > > Patch: > > ``` > Benchmark Mode Cnt Score Error Units > AddIdeal_XPlusX_LShiftC.testInt avgt 15 7.650 ? 0.024 ns/op > AddIdeal_XPlusX_LShiftC.testLong avgt 15 7.655 ? 0.087 ns/op > ``` > If you get a chance, could you please share the modifications you made to the microbenchmark, if you made other changes than removing `-TieredCompilation` and testing only a single value, so I can test on my end, for more data points? Also, the examples in JMH repository seem out-of-date (e.g., the `Blackhole` autodetect feature) I think your example will be a good study material for me. Thank you. > A ~1.2ns improvement - or 1.15x speed-up after removing as much infrastructural overhead as possible. I think this looks in line with the observable change in the generated assembly (while we get rid of more than one instruction, they are likely pretty nicely pipelined on my Intel CPU). I think this looks like a reasonable improvement for such a straightforward change to C2. ------------- PR: https://git.openjdk.java.net/jdk/pull/6675 From redestad at openjdk.java.net Sat Dec 18 03:46:21 2021 From: redestad at openjdk.java.net (Claes Redestad) Date: Sat, 18 Dec 2021 03:46:21 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v3] In-Reply-To: References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> <_MskcNL92fUY822sDIaFRMiZ55gDq3JpMKYKa58jjYA=.5f5376cc-8dd7-484f-81ed-21111e1252f1@github.com> Message-ID: On Sat, 18 Dec 2021 02:59:48 GMT, Claes Redestad wrote: >> Tbh I don't find this transformation necessary, `x + x` is a very cheap operation and is generally easier for the optimiser to work with than `x << 1`. >> Cheers. > >> Thank you both for commenting and suggestions! @merykitty @cl4es >> >> I moved the transformation into LShiftNode, i.e., to convert `(x + x) << 3` into `x << 4`, and I observed a 5% performance improvement (percentage of average time reduced) via the microbenchmark I write. I am not sure how much improvement we usually expect for a transformation, is this one currently a beneficial transformation? Thank you! > > For such a small win you might be looking at noise and should inspect the generated assembly to see that there's any real difference in the emitted code. > > Running your `test` microbenchmark using `-prof perfasm` on my Linux workstation the relevant code generated for the `helper` method with your patch looks like this: > > > 4.40% 0x00007f2b0d6d720c: mov %esi,%eax > 0x00007f2b0d6d720e: and $0x1fffff,%eax ;*iushr {reexecute=0 rethrow=0 return_oop=0} > ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 88) > > > i.e., the `((x + x) << 10) >> 11` is fully transformed down into `x & 0x1fffff`, which seem like it could be a pretty optimal result for this code (zeroes out the top 11 bits). > > Baseline build generates this: > > 4.46% 0x00007f93e16d988c: add %esi,%esi > 0x00007f93e16d988e: shl $0xa,%esi > 0x00007f93e16d9891: mov %esi,%eax > 0x00007f93e16d9893: shr $0xb,%eax ;*iushr {reexecute=0 rethrow=0 return_oop=0} > ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 88) > > > Which is a pretty direct transformation of the Java code into assembly. So it seems your patch _does_ enable more optimal transformations! > > Only a few percent of the benchmark time seem to be spent in the relevant code, though. As written I can't establish a significant result between a baseline VM and a patched one as the variance from run to run is more than a few percent. A narrowed down benchmark that only tests a single value seem to better demonstrate the low-level effect: > > > @Benchmark > public int testInt() { > return helper(ints_a[4711]); > } > > @Benchmark > public long testLong() { > return helper(longs_a[4711]); > } > > > Not using the hand-rolled `sink` blackhole means we rely on JMH blackholes. Those traditionally had a bit more overhead than what you get with `sink`, but can since recently cooperate with the VM to remove overhead almost completely if you run with `-Djmh.blackhole.autoDetect=true` (soon enabled by default). I also removed `-TieredCompilation` from the `@Fork` and get slightly better and stable scores overall, though a bit longer warmup is needed. > > Baseline: > > Benchmark Mode Cnt Score Error Units > AddIdeal_XPlusX_LShiftC.testInt avgt 15 8.925 ? 0.019 ns/op > AddIdeal_XPlusX_LShiftC.testLong avgt 15 8.969 ? 0.107 ns/op > > Patch: > > Benchmark Mode Cnt Score Error Units > AddIdeal_XPlusX_LShiftC.testInt avgt 15 7.650 ? 0.024 ns/op > AddIdeal_XPlusX_LShiftC.testLong avgt 15 7.655 ? 0.087 ns/op > > > A ~1.2ns improvement - or 1.15x speed-up after removing as much infrastructural overhead as possible. I think this looks in line with the observable change in the generated assembly (while we get rid of more than one instruction, they are likely pretty nicely pipelined on my Intel CPU). I think this looks like a reasonable improvement for such a straightforward change to C2. > Thank you @cl4es so much for taking time to review my pull request and deeply investigate the microbenchmark! I learnt a lot on writing better JMH microbenchmarks by reading your comment. My pleasure. I was intrigued since you picked up on my suggestion to enable folding of constant shift transforms so fast and wanted to see for myself that it ended up compiling down to something more compact/optimal. > If you get a chance, could you please share the modifications you made to the microbenchmark, if you made other changes than removing `-TieredCompilation` and testing only a single value, so I can test on my end, for more data points? Also, the examples in JMH repository seem out-of-date (e.g., the `Blackhole` autodetect feature) I think your example will be a good study material for me. Thank you. No other code changes, just added in those two simplified benchmarks and commented out the existing ones. I did tweak the command line to print using `ns` (you can change the annotation to get the same result) and picked the smallest number of iterations that still gave stable results on my machine while trying things out: `-i 15 -wi 15 -tu ns`. ------------- PR: https://git.openjdk.java.net/jdk/pull/6675 From gli at openjdk.java.net Sat Dec 18 04:19:25 2021 From: gli at openjdk.java.net (Guoxiong Li) Date: Sat, 18 Dec 2021 04:19:25 GMT Subject: RFR: 8278104: C1 should support the compiler directive 'BreakAtExecute' In-Reply-To: References: Message-ID: On Fri, 17 Dec 2021 19:41:45 GMT, Paul Hohensee wrote: >> Hi all, >> >> Currently, the directive `BreakAtExecute` is not effective at C1. And the `CompileCommand=break` doesn't break the compiled method, too. This patch unifies the `BreakAtExecute` and `CompileCommand=break` to the directive 'BreakAtExecute' and uses the directive 'BreakAtExecute' to identify whether a breakpoint should be added. >> >> The test group `hotspot_compiler` passed locally.(linux x86_64 fastdebug) >> And the pre-submit tests passed before submitting the PR. >> >> Thanks for taking the time to review. >> >> Best Regards, >> -- Guoxiong > > Lgtm. @phohensee @vnkozlov Thanks for the review. ------------- PR: https://git.openjdk.java.net/jdk/pull/6807 From gli at openjdk.java.net Sat Dec 18 04:19:25 2021 From: gli at openjdk.java.net (Guoxiong Li) Date: Sat, 18 Dec 2021 04:19:25 GMT Subject: Integrated: 8278104: C1 should support the compiler directive 'BreakAtExecute' In-Reply-To: References: Message-ID: <4RmSHTgkXs6SkKKvbY4AwF7095PnsgeG69mV-DDObnQ=.e6c0993f-4d26-4329-846d-114f2387c462@github.com> On Sun, 12 Dec 2021 04:21:51 GMT, Guoxiong Li wrote: > Hi all, > > Currently, the directive `BreakAtExecute` is not effective at C1. And the `CompileCommand=break` doesn't break the compiled method, too. This patch unifies the `BreakAtExecute` and `CompileCommand=break` to the directive 'BreakAtExecute' and uses the directive 'BreakAtExecute' to identify whether a breakpoint should be added. > > The test group `hotspot_compiler` passed locally.(linux x86_64 fastdebug) > And the pre-submit tests passed before submitting the PR. > > Thanks for taking the time to review. > > Best Regards, > -- Guoxiong This pull request has now been integrated. Changeset: 3c10b5db Author: Guoxiong Li URL: https://git.openjdk.java.net/jdk/commit/3c10b5db38455b8aed88599f5743fd846bd0913e Stats: 26 lines in 8 files changed: 12 ins; 0 del; 14 mod 8278104: C1 should support the compiler directive 'BreakAtExecute' Reviewed-by: xliu, phh, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/6807 From kvn at openjdk.java.net Sat Dec 18 05:37:25 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 18 Dec 2021 05:37:25 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 [v3] In-Reply-To: References: Message-ID: <8_Y6CvYYStYChu9yHY9wKwXHiIao6ZbZ2gdZiYAnaDA=.f7238bd8-e556-4515-b7cf-9c141197eea7@github.com> On Sat, 18 Dec 2021 03:23:56 GMT, Smita Kamath wrote: >> The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Added alignment to stack allocation, resolved indentation issue src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 4456: > 4454: __ aesgcm_encrypt(in, len, ct, out, key, state, subkeyHtbl, avx512_subkeyHtbl, counter); > 4455: > 4456: __ addptr(rsp, 96 * longSize); I don't think you need this instruction since you restore `RSP` in the next. Otherwise looks good. Testing passed fine. ------------- PR: https://git.openjdk.java.net/jdk18/pull/19 From duke at openjdk.java.net Sat Dec 18 07:23:20 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Sat, 18 Dec 2021 07:23:20 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v3] In-Reply-To: References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> <_MskcNL92fUY822sDIaFRMiZ55gDq3JpMKYKa58jjYA=.5f5376cc-8dd7-484f-81ed-21111e1252f1@github.com> Message-ID: <0HEbkCi3sHJyzDMxm-yGYxFqYduAQW1xoqX5Dnxt0dg=.168f7474-36ea-4cb1-95ac-71c2ae93d930@github.com> On Sat, 18 Dec 2021 03:43:08 GMT, Claes Redestad wrote: > No other code changes, just added in those two simplified benchmarks and commented out the existing ones. I did tweak the command line to print using `ns` (you can change the annotation to get the same result) and picked the smallest number of iterations that still gave stable results on my machine while trying things out: `-i 15 -wi 15 -tu ns`. Is there special configuration you were using? I was not able to reproduce the similar results. This is when I used 20 iterations. Baseline: Benchmark Mode Cnt Score Error Units AddIdeal_XPlusX_LShiftC.testInt avgt 60 3.846 ? 0.164 ns/op AddIdeal_XPlusX_LShiftC.testLong avgt 60 3.332 ? 0.008 ns/op Patch: Benchmark Mode Cnt Score Error Units AddIdeal_XPlusX_LShiftC.testInt avgt 60 3.743 ? 0.245 ns/op AddIdeal_XPlusX_LShiftC.testLong avgt 60 3.664 ? 0.224 ns/op Then I tried 100 interations to reduce noise. Baseline: Benchmark Mode Cnt Score Error Units AddIdeal_XPlusX_LShiftC.testInt avgt 300 3.566 ? 0.068 ns/op AddIdeal_XPlusX_LShiftC.testLong avgt 300 3.310 ? 0.004 ns/op Patch: Benchmark Mode Cnt Score Error Units AddIdeal_XPlusX_LShiftC.testInt avgt 300 3.724 ? 0.103 ns/op AddIdeal_XPlusX_LShiftC.testLong avgt 300 3.605 ? 0.086 ns/op I were not able to print assembly by `-prof perfasm` on my end because of some errors from `perf` which I do not have permission to configure, but I checked native coverage and it showed this transformation did happen when the microbenchmark was running. ------------- PR: https://git.openjdk.java.net/jdk/pull/6675 From duke at openjdk.java.net Sat Dec 18 12:17:25 2021 From: duke at openjdk.java.net (Quan Anh Mai) Date: Sat, 18 Dec 2021 12:17:25 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v6] In-Reply-To: References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> Message-ID: On Fri, 17 Dec 2021 22:18:06 GMT, Zhiqiang Zang wrote: >> A new ideal optimization can be introduced for addnode: converting "x + x" into "x << 1". >> >> >> // Convert "x + x" into "x << 1" >> if (in1 == in2) { >> return new LShiftINode(in1, phase->intcon(1)); >> } > > Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: > > rename the ir test. Alternatively, you could use `-XX:CompileCommand=print,*AddIdeal_XPlusX_LShiftC.testInt` which will print every detail about compilation including opto assembly. In this case, you could use `-XX:-TieredCompilation` to not receive details regarding C1 compiled method. Note that without `hsdis` the JVM will only output machine code which is very hard to analyse. ------------- PR: https://git.openjdk.java.net/jdk/pull/6675 From jiefu at openjdk.java.net Sat Dec 18 15:05:23 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Sat, 18 Dec 2021 15:05:23 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v6] In-Reply-To: References: Message-ID: <7f-CqEaprhPTYL6iSBehcZiKgzn2n9BqF19fWEKVWhU=.a113c51f-ec24-4ff2-94c9-17cec9d64f5f@github.com> On Fri, 17 Dec 2021 11:29:56 GMT, Fei Gao wrote: >> The patch aims to help optimize Math.abs() mainly from these three parts: >> 1) Remove redundant instructions for abs with constant values >> 2) Remove redundant instructions for abs with char type >> 3) Convert some common abs operations to ideal forms >> >> 1. Remove redundant instructions for abs with constant values >> >> If we can decide the value of the input node for function Math.abs() >> at compile-time, we can substitute the Abs node with the absolute >> value of the constant and don't have to calculate it at runtime. >> >> For example, >> int[] a >> for (int i = 0; i < SIZE; i++) { >> a[i] = Math.abs(-38); >> } >> >> Before the patch, the generated code for the testcase above is: >> ... >> mov w10, #0xffffffda >> cmp w10, wzr >> cneg w17, w10, lt >> dup v16.8h, w17 >> ... >> After the patch, the generated code for the testcase above is : >> ... >> movi v16.4s, #0x26 >> ... >> >> 2. Remove redundant instructions for abs with char type >> >> In Java semantics, as the char type is always non-negative, we >> could actually remove the absI node in the C2 middle end. >> >> As for vectorization part, in current SLP, the vectorization of >> Math.abs() with char type is intentionally disabled after >> JDK-8261022 because it generates incorrect result before. After >> removing the AbsI node in the middle end, Math.abs(char) can be >> vectorized naturally. >> >> For example, >> >> char[] a; >> char[] b; >> for (int i = 0; i < SIZE; i++) { >> b[i] = (char) Math.abs(a[i]); >> } >> >> Before the patch, the generated assembly code for the testcase >> above is: >> >> B15: >> add x13, x21, w20, sxtw #1 >> ldrh w11, [x13, #16] >> cmp w11, wzr >> cneg w10, w11, lt >> strh w10, [x13, #16] >> ldrh w10, [x13, #18] >> cmp w10, wzr >> cneg w10, w10, lt >> strh w10, [x13, #18] >> ... >> add w20, w20, #0x1 >> cmp w20, w17 >> b.lt B15 >> >> After the patch, the generated assembly code is: >> B15: >> sbfiz x18, x19, #1, #32 >> add x0, x14, x18 >> ldr q16, [x0, #16] >> add x18, x21, x18 >> str q16, [x18, #16] >> ldr q16, [x0, #32] >> str q16, [x18, #32] >> ... >> add w19, w19, #0x40 >> cmp w19, w17 >> b.lt B15 >> >> 3. Convert some common abs operations to ideal forms >> >> The patch overrides some virtual support functions for AbsNode >> so that optimization of gvn can work on it. Here are the optimizable >> forms: >> >> a) abs(0 - x) => abs(x) >> >> Before the patch: >> ... >> ldr w13, [x13, #16] >> neg w13, w13 >> cmp w13, wzr >> cneg w14, w13, lt >> ... >> After the patch: >> ... >> ldr w13, [x13, #16] >> cmp w13, wzr >> cneg w13, w13, lt >> ... >> >> b) abs(abs(x)) => abs(x) >> >> Before the patch: >> ... >> ldr w12, [x12, #16] >> cmp w12, wzr >> cneg w12, w12, lt >> cmp w12, wzr >> cneg w12, w12, lt >> ... >> After the patch: >> ... >> ldr w13, [x13, #16] >> cmp w13, wzr >> cneg w13, w13, lt >> ... > > Fei Gao has updated the pull request incrementally with one additional commit since the last revision: > > Use uabs() to calculate the absolute value of constant > > Change-Id: Ie6f37ab159fb7092e1443b9af8d620562a45ae47 I discussed this opt with @theRealAph offline. To clarify from my point of view: 1. I have no objection to this PR. 2. I'd like to see a benchmark which people would write in real programs. 3. But if the OpenJDK experts think it's already good enough, please go ahead. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6755 From duke at openjdk.java.net Sat Dec 18 18:19:25 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Sat, 18 Dec 2021 18:19:25 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v6] In-Reply-To: References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> Message-ID: On Sat, 18 Dec 2021 12:14:06 GMT, Quan Anh Mai wrote: > Alternatively, you could use `-XX:CompileCommand=print,*AddIdeal_XPlusX_LShiftC.testInt` which will print every detail about compilation including opto assembly. In this case, you could use `-XX:-TieredCompilation` to not receive details regarding C1 compiled method. Note that without `hsdis` the JVM will only output machine code which is very hard to analyse. It seems `make install-hsdis` does not install `hsdis` into the correct place. In my case, I have to manually copy it from `build/cov/jdk/lib` to the right place which is `build/cov/images/jdk/lib/server` so it can work. I don't know if this is some bug in makefile. This is the output what I got, given `@Fork(value = 1, jvmArgsAppend = {"-XX:-TieredCompilation", "-XX:CompileCommand=print,*AddIdeal_XPlusX_LShiftC.helper"})`. I have two questions: - The assmbly looks weird to me because I did not see any shift instruction? - Why did it print at and **only** at the first warmup iteration? I did not expect c2 compilation to happen so early. Thanks in advance. Running test 'micro:AddIdeal_XPlusX_LShiftC' # JMH version: 1.33 # VM version: JDK 19-internal, OpenJDK 64-Bit Server VM, 19-internal+0-adhoc..my-jdk # VM invoker: /home/zzq/jog/testing/my-jdk/build/cov/images/jdk/bin/java # VM options: -Djava.library.path=/home/zzq/jog/testing/my-jdk/build/cov/images/test/micro/native -XX:-TieredCompilation -XX:CompileCommand=print,*AddIdeal_XPlusX_LShiftC.helper # Blackhole mode: full + dont-inline hint (default, use -Djmh.blackhole.autoDetect=true to auto-detect) # Warmup: 20 iterations, 1 s each # Measurement: 20 iterations, 1 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Average time, time/op # Benchmark: org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC.testInt # Run progress: 0.00% complete, ETA 00:00:40 # Fork: 1 of 1 CompileCommand: print *AddIdeal_XPlusX_LShiftC.helper bool print = true # Warmup Iteration 1: ============================= C2-compiled nmethod ============================== ----------------------- MetaData before Compile_id = 188 ------------------------ {method} - this oop: 0x00007f977c476548 - method holder: 'org/openjdk/bench/vm/compiler/AddIdeal_XPlusX_LShiftC' - constants: 0x00007f977c475f18 constant pool [76] {0x00007f977c475f18} for 'org/openjdk/bench/vm/compiler/AddIdeal_XPlusX_LShiftC' cache=0x00007f977c4ab6e0 - access: 0x8100000a private static - name: 'helper' - signature: '(I)I' - max stack: 3 - max locals: 1 - size of params: 1 - method size: 13 - vtable index: -2 - i2i entry: 0x00007f97b92f9c00 - adapters: AHE at 0x00007f97c41125b0: 0xa i2c: 0x00007f97b9406660 c2i: 0x00007f97b9406719 c2iUV: 0x00007f97b94066e3 c2iNCI: 0x00007f97b9406756 - compiled entry 0x00007f97b9406719 - code size: 10 - code start: 0x00007f977c476520 - code end (excl): 0x00007f977c47652a - method data: 0x00007f977c4b4418 - checked ex length: 0 - linenumber start: 0x00007f977c47652a - localvar length: 1 - localvar start: 0x00007f977c476532 ------------------------ OptoAssembly for Compile_id = 188 ----------------------- # # int ( int ) # #r018 rsi : parm 0: int # -- Old rsp -- Framesize: 32 -- #r591 rsp+28: in_preserve #r590 rsp+24: return address #r589 rsp+20: in_preserve #r588 rsp+16: saved fp register #r587 rsp+12: pad2, stack alignment #r586 rsp+ 8: pad2, stack alignment #r585 rsp+ 4: Fixed slot 1 #r584 rsp+ 0: Fixed slot 0 # 000 N1: # out( B1 ) <- in( B1 ) Freq: 1 000 B1: # out( N1 ) <- BLOCK HEAD IS JUNK Freq: 1 000 # stack bang (96 bytes) pushq rbp # Save rbp subq rsp, #16 # Create frame 00c movl RAX, RSI # spill 00e andl RAX, #2097151 # int 014 addq rsp, 16 # Destroy frame popq rbp cmpq rsp, poll_offset[r15_thread] ja #safepoint_stub # Safepoint: poll for GC 026 ret -------------------------------------------------------------------------------- ----------------------------------- Assembly ----------------------------------- Compiled method (c2) 1974 188 org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper (10 bytes) total in heap [0x00007f97b9418c10,0x00007f97b9418e60] = 592 relocation [0x00007f97b9418d80,0x00007f97b9418d90] = 16 main code [0x00007f97b9418da0,0x00007f97b9418de0] = 64 stub code [0x00007f97b9418de0,0x00007f97b9418df8] = 24 oops [0x00007f97b9418df8,0x00007f97b9418e00] = 8 metadata [0x00007f97b9418e00,0x00007f97b9418e08] = 8 scopes data [0x00007f97b9418e08,0x00007f97b9418e18] = 16 scopes pcs [0x00007f97b9418e18,0x00007f97b9418e58] = 64 dependencies [0x00007f97b9418e58,0x00007f97b9418e60] = 8 [Disassembly] -------------------------------------------------------------------------------- [Constant Pool (empty)] -------------------------------------------------------------------------------- [Verified Entry Point] # {method} {0x00007f977c476548} 'helper' '(I)I' in 'org/openjdk/bench/vm/compiler/AddIdeal_XPlusX_LShiftC' # parm0: rsi = int # [sp+0x20] (sp of caller) ;; N1: # out( B1 ) <- in( B1 ) Freq: 1 ;; B1: # out( N1 ) <- BLOCK HEAD IS JUNK Freq: 1 0x00007f97b9418da0: mov %eax,-0x16000(%rsp) 0x00007f97b9418da7: push %rbp 0x00007f97b9418da8: sub $0x10,%rsp ;*synchronization entry ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at -1 (line 98) 0x00007f97b9418dac: mov %esi,%eax 0x00007f97b9418dae: and $0x1fffff,%eax ;*iushr {reexecute=0 rethrow=0 return_oop=0} ; - org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 98) 0x00007f97b9418db4: add $0x10,%rsp 0x00007f97b9418db8: pop %rbp 0x00007f97b9418db9: cmp 0x388(%r15),%rsp ; {poll_return} 0x00007f97b9418dc0: ja 0x00007f97b9418dc7 0x00007f97b9418dc6: ret 0x00007f97b9418dc7: movabs $0x7f97b9418db9,%r10 ; {internal_word} 0x00007f97b9418dd1: mov %r10,0x3a0(%r15) 0x00007f97b9418dd8: jmp 0x00007f97b9408120 ; {runtime_call SafepointBlob} 0x00007f97b9418ddd: hlt 0x00007f97b9418dde: hlt 0x00007f97b9418ddf: hlt [Exception Handler] 0x00007f97b9418de0: jmp 0x00007f97b936a6a0 ; {no_reloc} [Deopt Handler Code] 0x00007f97b9418de5: call 0x00007f97b9418dea 0x00007f97b9418dea: subq $0x5,(%rsp) 0x00007f97b9418def: jmp 0x00007f97b9407240 ; {runtime_call DeoptimizationBlob} 0x00007f97b9418df4: hlt 0x00007f97b9418df5: hlt 0x00007f97b9418df6: hlt 0x00007f97b9418df7: hlt -------------------------------------------------------------------------------- [/Disassembly] - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Oops: 0x00007f97b9418df8: 0x00000007ff755d38 a 'jdk/internal/loader/ClassLoaders$AppClassLoader'{0x00000007ff755d38} - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Metadata: 0x00007f97b9418e00: 0x00007f977c476548 {method} {0x00007f977c476548} 'helper' '(I)I' in 'org/openjdk/bench/vm/compiler/AddIdeal_XPlusX_LShiftC' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - pc-bytecode offsets: PcDesc(pc=0x00007f97b9418d9f offset=ffffffff bits=0): PcDesc(pc=0x00007f97b9418dac offset=c bits=0): org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at -1 (line 98) PcDesc(pc=0x00007f97b9418db4 offset=14 bits=0): org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 98) PcDesc(pc=0x00007f97b9418df9 offset=59 bits=0): - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - oop maps:ImmutableOopMapSet contains 0 OopMaps - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - scopes: ScopeDesc(pc=0x00007f97b9418dac offset=c): org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at -1 (line 98) ScopeDesc(pc=0x00007f97b9418db4 offset=14): org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper at 8 (line 98) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - relocations: @0x00007f97b9418d80: b019 relocInfo at 0x00007f97b9418d80 [type=11(poll_return) addr=0x00007f97b9418db9 offset=25] @0x00007f97b9418d82: f00e800e relocInfo at 0x00007f97b9418d84 [type=8(internal_word) addr=0x00007f97b9418dc7 offset=14 data=14] | [target=0x00007f97b9418db9] @0x00007f97b9418d86: 6411 relocInfo at 0x00007f97b9418d86 [type=6(runtime_call) addr=0x00007f97b9418dd8 offset=17 format=1] | [destination=0x00007f97b9408120] @0x00007f97b9418d88: 0008 relocInfo at 0x00007f97b9418d88 [type=0(none) addr=0x00007f97b9418de0 offset=8] @0x00007f97b9418d8a: 6400 relocInfo at 0x00007f97b9418d8a [type=6(runtime_call) addr=0x00007f97b9418de0 offset=0 format=1] | [destination=0x00007f97b936a6a0] @0x00007f97b9418d8c: 640f relocInfo at 0x00007f97b9418d8c [type=6(runtime_call) addr=0x00007f97b9418def offset=15 format=1] | [destination=0x00007f97b9407240] @0x00007f97b9418d8e: 0000 relocInfo at 0x00007f97b9418d8e [type=0(none) addr=0x00007f97b9418def offset=0] @0x00007f97b9418d90: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Dependencies: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ExceptionHandlerTable (size = 0 bytes) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ImplicitExceptionTable is empty - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Recorded oops: #0: 0x0000000000000000 NULL-oop #1: 0x00000007ff755d38 a 'jdk/internal/loader/ClassLoaders$AppClassLoader'{0x00000007ff755d38} - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Recorded metadata: #0: 0x0000000000000000 NULL-oop #1: 0x00007f977c476548 {method} {0x00007f977c476548} 'helper' '(I)I' in 'org/openjdk/bench/vm/compiler/AddIdeal_XPlusX_LShiftC' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 3.711 ns/op # Warmup Iteration 2: 3.515 ns/op # Warmup Iteration 3: 3.313 ns/op # Warmup Iteration 4: 3.433 ns/op # Warmup Iteration 5: 3.526 ns/op # Warmup Iteration 6: 3.458 ns/op # Warmup Iteration 7: 3.446 ns/op # Warmup Iteration 8: 3.497 ns/op # Warmup Iteration 9: 3.471 ns/op # Warmup Iteration 10: 3.461 ns/op # Warmup Iteration 11: 3.429 ns/op # Warmup Iteration 12: 3.533 ns/op # Warmup Iteration 13: 3.545 ns/op # Warmup Iteration 14: 3.613 ns/op # Warmup Iteration 15: 3.534 ns/op # Warmup Iteration 16: 3.495 ns/op # Warmup Iteration 17: 3.567 ns/op # Warmup Iteration 18: 3.466 ns/op # Warmup Iteration 19: 3.310 ns/op # Warmup Iteration 20: 3.305 ns/op Iteration 1: 3.285 ns/op Iteration 2: 3.368 ns/op Iteration 3: 3.315 ns/op Iteration 4: 3.304 ns/op Iteration 5: 3.359 ns/op Iteration 6: 3.340 ns/op Iteration 7: 3.330 ns/op Iteration 8: 3.331 ns/op Iteration 9: 3.447 ns/op Iteration 10: 3.370 ns/op Iteration 11: 3.389 ns/op Iteration 12: 3.342 ns/op Iteration 13: 3.332 ns/op Iteration 14: 3.349 ns/op Iteration 15: 3.344 ns/op Iteration 16: 3.330 ns/op Iteration 17: 3.327 ns/op Iteration 18: 3.419 ns/op Iteration 19: 3.341 ns/op Iteration 20: 3.346 ns/op ------------------------------------------------------------------------ static org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC::helper(I)I interpreter_invocation_count: 93898 invocation_counter: 93898 backedge_counter: 0 decompile_count: 0 mdo size: 392 bytes 0 iload_0 1 iload_0 2 iadd 3 bipush 10 5 ishl 6 bipush 11 8 iushr 9 ireturn ------------------------------------------------------------------------ Total MDO size: 392 bytes Result "org.openjdk.bench.vm.compiler.AddIdeal_XPlusX_LShiftC.testInt": 3.348 ?(99.9%) 0.032 ns/op [Average] (min, avg, max) = (3.285, 3.348, 3.447), stdev = 0.037 CI (99.9%): [3.316, 3.381] (assumes normal distribution) # Run complete. Total time: 00:00:42 REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts. Do not assume the numbers tell you what you want them to tell. Benchmark Mode Cnt Score Error Units AddIdeal_XPlusX_LShiftC.testInt avgt 20 3.348 ? 0.032 ns/op Finished running test 'micro:AddIdeal_XPlusX_LShiftC' ------------- PR: https://git.openjdk.java.net/jdk/pull/6675 From iveresov at openjdk.java.net Sat Dec 18 20:04:24 2021 From: iveresov at openjdk.java.net (Igor Veresov) Date: Sat, 18 Dec 2021 20:04:24 GMT Subject: RFR: 8271202: C1: assert(false) failed: live_in set of first block must be empty [v2] In-Reply-To: References: Message-ID: On Tue, 7 Dec 2021 23:25:03 GMT, Martin Doerr wrote: >> I have written a checker which detects usage of the illegal phi function. In case of the reproducer provided in the JBS bug ("Reduced.java"), it finds the following and bails out: >> >> invalidating local 8 because of type mismatch (new_value is NULL) >> Bailing out because StoreIndexed (id 98) uses illegal phi (id 68) >> >> I haven't checked why that node uses the illegal phi. That still seems to be a bug. Maybe there's a better solution to the underlying problem, but I hope my checker is useful to analyze bugs and to make C1 more resilient. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add test. I took a deeper look and I added a comment to bug report. The root cause seems to be because of the irreducible loops (and therefore an unusual block traversal order when inserting phis) the phi invalidation logic in try_merge() doesn't invalidate phis that have invalid locals as inputs. I've attached a drawing: [8271202.pdf](https://github.com/openjdk/jdk/files/7739876/8271202.pdf). Notice that i54 = phi (i43, 96) is not invalidated even though 96 is illegal. Transitively, i43, it should be illegal too. I would propose that we add a check for that and bailout in move_phi(). Suggested fix: diff --git a/src/hotspot/share/c1/c1_LIRGenerator.cpp b/src/hotspot/share/c1/c1_LIRGenerator.cpp index c064558b458..b386b541f89 100644 --- a/src/hotspot/share/c1/c1_LIRGenerator.cpp +++ b/src/hotspot/share/c1/c1_LIRGenerator.cpp @@ -963,6 +963,14 @@ void LIRGenerator::move_to_phi(PhiResolver* resolver, Value cur_val, Value sux_v Phi* phi = sux_val->as_Phi(); // cur_val can be null without phi being null in conjunction with inlining if (phi != NULL && cur_val != NULL && cur_val != phi && !phi->is_illegal()) { + if (phi->is_local()) { + for (int i = 0; i < phi->operand_count(); i++) { + Value op = phi->operand_at(i); + if (op != NULL && op->type()->is_illegal()) { + bailout("illegal phi operand"); + } + } + } Phi* cur_phi = cur_val->as_Phi(); if (cur_phi != NULL && cur_phi->is_illegal()) { // Phi and local would need to get invalidated ------------- PR: https://git.openjdk.java.net/jdk/pull/6683 From duke at openjdk.java.net Sat Dec 18 21:07:53 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Sat, 18 Dec 2021 21:07:53 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v7] In-Reply-To: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> Message-ID: > A new ideal optimization can be introduced for addnode: converting "x + x" into "x << 1". > > > // Convert "x + x" into "x << 1" > if (in1 == in2) { > return new LShiftINode(in1, phase->intcon(1)); > } Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: slightly update microbenchmark ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6675/files - new: https://git.openjdk.java.net/jdk/pull/6675/files/290e9f20..16580b27 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6675&range=06 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6675&range=05-06 Stats: 15 lines in 1 file changed: 10 ins; 0 del; 5 mod Patch: https://git.openjdk.java.net/jdk/pull/6675.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6675/head:pull/6675 PR: https://git.openjdk.java.net/jdk/pull/6675 From duke at openjdk.java.net Sat Dec 18 21:51:28 2021 From: duke at openjdk.java.net (amirhadadi) Date: Sat, 18 Dec 2021 21:51:28 GMT Subject: RFR: 8278518: String(byte[], int, int, Charset) constructor and String.translateEscapes() miss bounds check elimination In-Reply-To: References: <2IcCYyqxBU_oi_i9n1LZzTivcLD7QWxAlZga6kiGOPg=.a09fd12b-d1aa-4124-9bb2-31f86da2f706@github.com> Message-ID: On Fri, 17 Dec 2021 22:33:30 GMT, Dean Long wrote: >> Originally this was spotted by by Amir Hadadi in https://stackoverflow.com/questions/70272651/missing-bounds-checking-elimination-in-string-constructor >> >> It looks like in the following code in `String(byte[], int, int, Charset)` >> >> while (offset < sl) { >> int b1 = bytes[offset]; >> if (b1 >= 0) { >> dst[dp++] = (byte)b1; >> offset++; // <--- >> continue; >> } >> if ((b1 == (byte)0xc2 || b1 == (byte)0xc3) && >> offset + 1 < sl) { >> int b2 = bytes[offset + 1]; >> if (!isNotContinuation(b2)) { >> dst[dp++] = (byte)decode2(b1, b2); >> offset += 2; >> continue; >> } >> } >> // anything not a latin1, including the repl >> // we have to go with the utf16 >> break; >> } >> >> bounds check elimination is not executed when accessing byte array via `bytes[offset]. >> >> The reason, I guess, is that offset variable is modified within the loop (marked with arrow). >> >> Possible fix for this could be changing: >> >> `while (offset < sl)` ---> `while (offset >= 0 && offset < sl)` >> >> However the best is to invest in C2 optimization to handle all such cases. >> >> The following benchmark demonstrates good improvement: >> >> @State(Scope.Thread) >> @BenchmarkMode(Mode.AverageTime) >> @OutputTimeUnit(TimeUnit.NANOSECONDS) >> public class StringConstructorBenchmark { >> private byte[] array; >> private String str; >> >> @Setup >> public void setup() { >> str = "Quizdeltagerne spiste jordb?r med fl?de, mens cirkusklovnen. ?";//Latin1 ending with Russian >> array = str.getBytes(StandardCharsets.UTF_8); >> } >> >> @Benchmark >> public String newString() { >> return new String(array, 0, array.length, StandardCharsets.UTF_8); >> } >> >> @Benchmark >> public String translateEscapes() { >> return str.translateEscapes(); >> } >> } >> >> Results: >> >> //baseline >> Benchmark Mode Cnt Score Error Units >> StringConstructorBenchmark.newString avgt 50 173,092 ? 3,048 ns/op >> >> //patched >> Benchmark Mode Cnt Score Error Units >> StringConstructorBenchmark.newString avgt 50 126,908 ? 2,355 ns/op >> >> The same is observed in String.translateEscapes() for the same String as in the benchmark above: >> >> //baseline >> Benchmark Mode Cnt Score Error Units >> StringConstructorBenchmark.translateEscapes avgt 100 53,627 ? 0,850 ns/op >> >> //patched >> Benchmark Mode Cnt Score Error Units >> StringConstructorBenchmark.translateEscapes avgt 100 48,087 ? 1,129 ns/op >> >> Also I've looked into this with `LinuxPerfAsmProfiler`, full output for baseline is available here https://gist.github.com/stsypanov/d2524f98477d633fb1d4a2510fedeea6 and for patched code here https://gist.github.com/stsypanov/16c787e4f9fa3dd122522f16331b68b7. >> >> Here's the part of baseline assembly responsible for `while` loop: >> >> 3.62% ?? ? 0x00007fed70eb4c1c: mov %ebx,%ecx >> 2.29% ?? ? 0x00007fed70eb4c1e: mov %edx,%r9d >> 2.22% ?? ? 0x00007fed70eb4c21: mov (%rsp),%r8 ;*iload_2 {reexecute=0 rethrow=0 return_oop=0} >> ?? ? ; - java.lang.String::<init>@107 (line 537) >> 2.32% ?? ? 0x00007fed70eb4c25: cmp %r13d,%ecx >> ? ? 0x00007fed70eb4c28: jge 0x00007fed70eb5388 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} >> ? ? ; - java.lang.String::<init>@110 (line 537) >> 3.05% ? ? 0x00007fed70eb4c2e: cmp 0x8(%rsp),%ecx >> ? ? 0x00007fed70eb4c32: jae 0x00007fed70eb5319 >> 2.38% ? ? 0x00007fed70eb4c38: mov %r8,(%rsp) >> 2.64% ? ? 0x00007fed70eb4c3c: movslq %ecx,%r8 >> 2.46% ? ? 0x00007fed70eb4c3f: mov %rax,%rbx >> 3.44% ? ? 0x00007fed70eb4c42: sub %r8,%rbx >> 2.62% ? ? 0x00007fed70eb4c45: add $0x1,%rbx >> 2.64% ? ? 0x00007fed70eb4c49: and $0xfffffffffffffffe,%rbx >> 2.30% ? ? 0x00007fed70eb4c4d: mov %ebx,%r8d >> 3.08% ? ? 0x00007fed70eb4c50: add %ecx,%r8d >> 2.55% ? ? 0x00007fed70eb4c53: movslq %r8d,%r8 >> 2.45% ? ? 0x00007fed70eb4c56: add $0xfffffffffffffffe,%r8 >> 2.13% ? ? 0x00007fed70eb4c5a: cmp (%rsp),%r8 >> ? ? 0x00007fed70eb4c5e: jae 0x00007fed70eb5319 >> 3.36% ? ? 0x00007fed70eb4c64: mov %ecx,%edi ;*aload_1 {reexecute=0 rethrow=0 return_oop=0} >> ? ? ; - java.lang.String::<init>@113 (line 538) >> 2.86% ? ?? 0x00007fed70eb4c66: movsbl 0x10(%r14,%rdi,1),%r8d ;*baload {reexecute=0 rethrow=0 return_oop=0} >> ? ?? ; - java.lang.String::<init>@115 (line 538) >> 2.48% ? ?? 0x00007fed70eb4c6c: mov %r9d,%edx >> 2.26% ? ?? 0x00007fed70eb4c6f: inc %edx ;*iinc {reexecute=0 rethrow=0 return_oop=0} >> ? ?? ; - java.lang.String::<init>@127 (line 540) >> 3.28% ? ?? 0x00007fed70eb4c71: mov %edi,%ebx >> 2.44% ? ?? 0x00007fed70eb4c73: inc %ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0} >> ? ?? ; - java.lang.String::<init>@134 (line 541) >> 2.35% ? ?? 0x00007fed70eb4c75: test %r8d,%r8d >> ? ?? 0x00007fed70eb4c78: jge 0x00007fed70eb4c04 ;*iflt {reexecute=0 rethrow=0 return_oop=0} >> ?? ; - java.lang.String::<init>@120 (line 539) >> >> and this one is for patched code: >> >> 17.28% ?? 0x00007f6b88eb6061: mov %edx,%r10d ;*iload_2 {reexecute=0 rethrow=0 return_oop=0} >> ?? ; - java.lang.String::<init>@107 (line 537) >> 0.11% ?? 0x00007f6b88eb6064: test %r10d,%r10d >> ? 0x00007f6b88eb6067: jl 0x00007f6b88eb669c ;*iflt {reexecute=0 rethrow=0 return_oop=0} >> ? ; - java.lang.String::<init>@108 (line 537) >> 0.39% ? 0x00007f6b88eb606d: cmp %r13d,%r10d >> ? 0x00007f6b88eb6070: jge 0x00007f6b88eb66d0 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} >> ? ; - java.lang.String::<init>@114 (line 537) >> 0.66% ? 0x00007f6b88eb6076: mov %ebx,%r9d >> 13.70% ? 0x00007f6b88eb6079: cmp 0x8(%rsp),%r10d >> 0.01% ? 0x00007f6b88eb607e: jae 0x00007f6b88eb6671 >> 0.14% ? 0x00007f6b88eb6084: movsbl 0x10(%r14,%r10,1),%edi ;*baload {reexecute=0 rethrow=0 return_oop=0} >> ? ; - java.lang.String::<init>@119 (line 538) >> 0.37% ? 0x00007f6b88eb608a: mov %r9d,%ebx >> 0.99% ? 0x00007f6b88eb608d: inc %ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0} >> ? ; - java.lang.String::<init>@131 (line 540) >> 12.88% ? 0x00007f6b88eb608f: movslq %r9d,%rsi ;*bastore {reexecute=0 rethrow=0 return_oop=0} >> ? ; - java.lang.String::<init>@196 (line 548) >> 0.17% ? 0x00007f6b88eb6092: mov %r10d,%edx >> 0.39% ? 0x00007f6b88eb6095: inc %edx ;*iinc {reexecute=0 rethrow=0 return_oop=0} >> ? ; - java.lang.String::<init>@138 (line 541) >> 0.96% ? 0x00007f6b88eb6097: test %edi,%edi >> 0.02% ? 0x00007f6b88eb6099: jl 0x00007f6b88eb60dc ;*iflt {reexecute=0 rethrow=0 return_oop=0} >> >> Between instructions mapped to `if_icmpge` and `aload_1` in baseline we have bounds check which is missing from patched code. > > This does look like a HotSpot JIT compiler issue to me. My guess is that it is related to how checkBoundsOffCount() checks for offset < 0: > > 396 if ((length | fromIndex | size) < 0 || size > length - fromIndex) > > using | to combine three values. @dean-long actually the issue reproduces with Java 17 where `checkBoundsOffCount` was implemented in a more straight forward way: static void checkBoundsOffCount(int offset, int count, int length) { if (offset < 0 || count < 0 || offset > length - count) { throw new StringIndexOutOfBoundsException( "offset " + offset + ", count " + count + ", length " + length); } } Here's a [gist](https://gist.github.com/amirhadadi/9505c3f5d9ad68cad2fbfd1b9e01f0b8) with a benchmark you can run. In this benchmark I'm comparing safe and unsafe reads from the byte array (I didn't modify the code to add the offset >= 0 condition). Here are the results: Benchmark Mode Cnt Score Error Units StringBenchmark.safeDecoding avgt 20 120.312 ? 11.674 ns/op StringBenchmark.unsafeDecoding avgt 20 72.628 ? 0.479 ns/op ------------- PR: https://git.openjdk.java.net/jdk/pull/6812 From duke at openjdk.java.net Sun Dec 19 01:46:06 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Sun, 19 Dec 2021 01:46:06 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v8] In-Reply-To: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> Message-ID: <4wKAoNo9T2VxvsU_N4oZFB89buBl2_d5BqAWu9J5pbQ=.17768c8c-90b5-4c87-b51c-1249bc9048bd@github.com> > A new ideal optimization can be introduced for addnode: converting "x + x" into "x << 1". > > > // Convert "x + x" into "x << 1" > if (in1 == in2) { > return new LShiftINode(in1, phase->intcon(1)); > } Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: clean microbenchmark. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6675/files - new: https://git.openjdk.java.net/jdk/pull/6675/files/16580b27..2bb824d6 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6675&range=07 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6675&range=06-07 Stats: 34 lines in 1 file changed: 0 ins; 24 del; 10 mod Patch: https://git.openjdk.java.net/jdk/pull/6675.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6675/head:pull/6675 PR: https://git.openjdk.java.net/jdk/pull/6675 From duke at openjdk.java.net Sun Dec 19 11:09:24 2021 From: duke at openjdk.java.net (Quan Anh Mai) Date: Sun, 19 Dec 2021 11:09:24 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v8] In-Reply-To: <4wKAoNo9T2VxvsU_N4oZFB89buBl2_d5BqAWu9J5pbQ=.17768c8c-90b5-4c87-b51c-1249bc9048bd@github.com> References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> <4wKAoNo9T2VxvsU_N4oZFB89buBl2_d5BqAWu9J5pbQ=.17768c8c-90b5-4c87-b51c-1249bc9048bd@github.com> Message-ID: <2PNpmhlSlf4Zpa9w9Wyh4VwQ1vGyIPIIxf_mLNxcLIY=.403cd319-e1c8-4079-9dd2-afeddaa3d63c@github.com> On Sun, 19 Dec 2021 01:46:06 GMT, Zhiqiang Zang wrote: >> A new ideal optimization can be introduced for addnode: converting "x + x" into "x << 1". >> >> >> // Convert "x + x" into "x << 1" >> if (in1 == in2) { >> return new LShiftINode(in1, phase->intcon(1)); >> } > > Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: > > clean microbenchmark. The reason you don't see any shift instruction is that `x << 11 >>> 11` is transformed into `x & 0x1FFFFF` which is the instruction you see in `0x00007f97b9418dae: and $0x1fffff,%eax`. As you can see for these kinds of microbenchmarks the cost of calling functions dominates the cost of the actual transformation, you can mitigate this by pulling the helper method out to be the benchmark (I don't see the reason to have a separate helper method, either), and putting the operation in a loop that has the results being sunk in a compiler blackhole (full blackhole or a non-inlined sink method won't work as effective in this case simply because it is also a function call) in each iteration. ------------- PR: https://git.openjdk.java.net/jdk/pull/6675 From jrose at openjdk.java.net Sun Dec 19 20:06:20 2021 From: jrose at openjdk.java.net (John R Rose) Date: Sun, 19 Dec 2021 20:06:20 GMT Subject: RFR: 8278949: Cleanups for 8277850 In-Reply-To: References: Message-ID: On Fri, 17 Dec 2021 08:03:19 GMT, Roland Westrelin wrote: > When 8277850 (C2: optimize mask checks in counted loops) was reviewed, > John made a number of comments and suggestions after the change was > integrated. This change includes all of his comments, extra tests to > cover all cases. I also moved the AndIL_add_shift_and_mask() call in > AndXNode::Ideal() up so the expression with a non constant mask can be > optimized as well. Looks good. One naming suggestion about `AndIL_shift_and_mask` src/hotspot/share/opto/mulnode.cpp line 1730: > 1728: // Because the optimization might work for a non-constant > 1729: // mask M, we check the AndX for both operand orders. > 1730: bool MulNode::AndIL_shift_and_mask(PhaseGVN* phase, Node* shift, Node* mask, BasicType bt, bool check_reverse) { Since this is a boolean function, perhaps its name should indicate what question it is answering. Suggestion: `AndIL_shift_and_mask_is_always_zero` ------------- Marked as reviewed by jrose (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6876 From duke at openjdk.java.net Sun Dec 19 20:28:07 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Sun, 19 Dec 2021 20:28:07 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v9] In-Reply-To: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> Message-ID: <7GTdrykHxvlcM_-gJENv_ZiyI-EQst9FV8ieJXOM-p8=.8f1f2335-8ef1-446e-addc-9869f5a70ce2@github.com> > A new ideal optimization can be introduced for addnode: converting "x + x" into "x << 1". > > > // Convert "x + x" into "x << 1" > if (in1 == in2) { > return new LShiftINode(in1, phase->intcon(1)); > } Zhiqiang Zang has updated the pull request incrementally with three additional commits since the last revision: - refactor the ir test. - rename tests. - use compiler mode blackhole in microbenchmark to prevent function calling from dominating time. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6675/files - new: https://git.openjdk.java.net/jdk/pull/6675/files/2bb824d6..306c95a3 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6675&range=08 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6675&range=07-08 Stats: 499 lines in 4 files changed: 244 ins; 255 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6675.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6675/head:pull/6675 PR: https://git.openjdk.java.net/jdk/pull/6675 From duke at openjdk.java.net Sun Dec 19 20:30:24 2021 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Sun, 19 Dec 2021 20:30:24 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v8] In-Reply-To: <4wKAoNo9T2VxvsU_N4oZFB89buBl2_d5BqAWu9J5pbQ=.17768c8c-90b5-4c87-b51c-1249bc9048bd@github.com> References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> <4wKAoNo9T2VxvsU_N4oZFB89buBl2_d5BqAWu9J5pbQ=.17768c8c-90b5-4c87-b51c-1249bc9048bd@github.com> Message-ID: <55Y6O9SotySkkbpXLOJ5B_HopkTckx-zJzt-9lAcopA=.b8e7289a-98e7-4fe5-92d3-b739c2d6035f@github.com> On Sun, 19 Dec 2021 01:46:06 GMT, Zhiqiang Zang wrote: >> A new ideal optimization can be introduced for addnode: converting "x + x" into "x << 1". >> >> >> // Convert "x + x" into "x << 1" >> if (in1 == in2) { >> return new LShiftINode(in1, phase->intcon(1)); >> } > > Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: > > clean microbenchmark. Thank you very much! I adapt the microbenmark according to your advice and I am able to observed a notable difference now! I ran `make test TEST="micro:LShiftIdeal_XPlusX_LShiftC" MICRO="JAVA_OPTIONS=-Djmh.blackhole.mode=COMPILER" CONF_NAME="cov"` and this is the result: Baseline: Benchmark Mode Cnt Score Error Units AddIdeal_XPlusX_LShiftC.baselineInt avgt 60 1.457 ? 0.006 ns/op AddIdeal_XPlusX_LShiftC.baselineLong avgt 60 1.453 ? 0.004 ns/op AddIdeal_XPlusX_LShiftC.testInt avgt 60 2.521 ? 0.010 ns/op AddIdeal_XPlusX_LShiftC.testLong avgt 60 2.518 ? 0.009 ns/op Patch: Benchmark Mode Cnt Score Error Units AddIdeal_XPlusX_LShiftC.baselineInt avgt 60 1.455 ? 0.005 ns/op AddIdeal_XPlusX_LShiftC.baselineLong avgt 60 1.453 ? 0.005 ns/op AddIdeal_XPlusX_LShiftC.testInt avgt 60 1.455 ? 0.006 ns/op AddIdeal_XPlusX_LShiftC.testLong avgt 60 1.677 ? 0.005 ns/op > As you can see for these kinds of microbenchmarks the cost of calling functions dominates the cost of the actual transformation, you can mitigate this by pulling the helper method out to be the benchmark (I don't see the reason to have a separate helper method, either), and putting the operation in a loop that has the results being sunk in a compiler blackhole (full blackhole or a non-inlined sink method won't work as effective in this case simply because it is also a function call) in each iteration. ------------- PR: https://git.openjdk.java.net/jdk/pull/6675 From fgao at openjdk.java.net Mon Dec 20 04:39:20 2021 From: fgao at openjdk.java.net (Fei Gao) Date: Mon, 20 Dec 2021 04:39:20 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v6] In-Reply-To: <7f-CqEaprhPTYL6iSBehcZiKgzn2n9BqF19fWEKVWhU=.a113c51f-ec24-4ff2-94c9-17cec9d64f5f@github.com> References: <7f-CqEaprhPTYL6iSBehcZiKgzn2n9BqF19fWEKVWhU=.a113c51f-ec24-4ff2-94c9-17cec9d64f5f@github.com> Message-ID: On Sat, 18 Dec 2021 15:01:53 GMT, Jie Fu wrote: > I discussed this opt with @theRealAph offline. > > To clarify from my point of view: > > 1. I have no objection to this PR. > 2. I'd like to see a benchmark which people would write in real programs. > 3. But if the OpenJDK experts think it's already good enough, please go ahead. > > Thanks. Hi, @DamonFool I ran the jtreg test internally with some logging info to verify if the optimization works in real java program. The results shows that these patterns are hit in the following cases: ? java/lang/StackWalker/LocalsAndOperands.java#id0 ? java/lang/StackWalker/LocalsAndOperands.java#id1 ? java/lang/invoke/LFCaching/LFSingleThreadCachingTest.java ? java/util/concurrent/tck/JSR166TestCase.java ? javax/management/timer/MissingNotificationTest.java ? jdk/incubator/vector/Double128VectorTests.java ? jdk/incubator/vector/Double256VectorTests.java ? jdk/incubator/vector/Double512VectorTests.java ? jdk/incubator/vector/Double64VectorTests.java ? jdk/incubator/vector/DoubleMaxVectorTests.java ? jdk/incubator/vector/Float128VectorTests.java ? jdk/incubator/vector/Float256VectorTests.java ? jdk/incubator/vector/Float512VectorTests.java ? jdk/incubator/vector/Float64VectorTests.java ? jdk/incubator/vector/FloatMaxVectorTests.java ? jdk/incubator/vector/Vector128ConversionTests.java ? jdk/incubator/vector/Vector256ConversionTests.java ? jdk/incubator/vector/Vector64ConversionTests.java#id0 ? jdk/incubator/vector/VectorMaxConversionTests.java It?s not easy to identify these patterns from original java code by our eyes. Since the added code lines are hit, the patterns must occur after many rounds of optimization. Definitely, it benefits all platforms, whether x86 or aarch64. As for the current benchmark, it?s not to show the real performance gain but to illustrate that the opto benefits x86 as well in case you wonder. If you need a real java program, the case won?t be light-weight or straightforward. Maybe I can?t provide you with a satisfying micro benchmark. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6755 From jiefu at openjdk.java.net Mon Dec 20 06:16:23 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 20 Dec 2021 06:16:23 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v6] In-Reply-To: References: <7f-CqEaprhPTYL6iSBehcZiKgzn2n9BqF19fWEKVWhU=.a113c51f-ec24-4ff2-94c9-17cec9d64f5f@github.com> Message-ID: On Mon, 20 Dec 2021 04:35:37 GMT, Fei Gao wrote: > > I discussed this opt with @theRealAph offline. > > To clarify from my point of view: > > > > 1. I have no objection to this PR. > > 2. I'd like to see a benchmark which people would write in real programs. > > 3. But if the OpenJDK experts think it's already good enough, please go ahead. > > > > Thanks. > > Hi, @DamonFool > > I ran the jtreg test internally with some logging info to verify if the optimization works in real java program. The results shows that these patterns are hit in the following cases: > > ? java/lang/StackWalker/LocalsAndOperands.java#id0 ? java/lang/StackWalker/LocalsAndOperands.java#id1 ? java/lang/invoke/LFCaching/LFSingleThreadCachingTest.java ? java/util/concurrent/tck/JSR166TestCase.java ? javax/management/timer/MissingNotificationTest.java ? jdk/incubator/vector/Double128VectorTests.java ? jdk/incubator/vector/Double256VectorTests.java ? jdk/incubator/vector/Double512VectorTests.java ? jdk/incubator/vector/Double64VectorTests.java ? jdk/incubator/vector/DoubleMaxVectorTests.java ? jdk/incubator/vector/Float128VectorTests.java ? jdk/incubator/vector/Float256VectorTests.java ? jdk/incubator/vector/Float512VectorTests.java ? jdk/incubator/vector/Float64VectorTests.java ? jdk/incubator/vector/FloatMaxVectorTests.java ? jdk/incubator/vector/Vector128ConversionTests.java ? jdk/incubator/vector/Vector256ConversionTests.java ? jdk/incubator/vector/Vector64ConversionTests.java#id0 ? jdk/incubator/vector/VectorMaxConversionTests.ja va > > It?s not easy to identify these patterns from original java code by our eyes. Since the added code lines are hit, the patterns must occur after many rounds of optimization. Definitely, it benefits all platforms, whether x86 or aarch64. > > As for the current benchmark, it?s not to show the real performance gain but to illustrate that the opto benefits x86 as well in case you wonder. If you need a real java program, the case won?t be light-weight or straightforward. Maybe I can?t provide you with a satisfying micro benchmark. > > Thanks. Good news! But can you show us an example with more detailed analysis which pattern is applied in the test? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6755 From duke at openjdk.java.net Mon Dec 20 07:53:22 2021 From: duke at openjdk.java.net (Danil Bubnov) Date: Mon, 20 Dec 2021 07:53:22 GMT Subject: RFR: 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> In-Reply-To: <-Wzb3RXPFeEY0E7HDT-7lWw1_2mxReE-J8FPeI3UKl8=.4211598e-f032-46b8-8aa4-6f5254785193@github.com> References: <-Wzb3RXPFeEY0E7HDT-7lWw1_2mxReE-J8FPeI3UKl8=.4211598e-f032-46b8-8aa4-6f5254785193@github.com> Message-ID: On Wed, 1 Dec 2021 18:17:41 GMT, Andrew Haley wrote: >> This is the fix of aarch64 jvmci calling convention. >> >> On MacOS/aarch64 "Function arguments may consume slots on the stack that are not multiples of 8 bytes" [1], but current approach uses only wordsize or bigger slots, which is incorrect (that is why tests were failing [4]). Now arguments consume the right amount of bytes. >> >> Another problem is that current approach don't make 16-byte alignment of Stack Pointer [1][2][3]. However, tests not fail on Linux/aarch64 and Windows/aarch64. They pass because in this tests all functions have even number of argumets, that is why 16-byte alignment comes automatically. But if you try to add or delete one argumets, tests will fail with SIGBUS. >> >> I've tested this patch on MacOS/aarch64 and Linux/aarch64, all tests have passed. >> >> Also I don't understand, why current tests (NativeCallTest) use only int, long, float and double as arguments types. Is it possible to add functions with another types like byte or short? I tried, but it fails on every platform. >> >> [1] https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms >> [2] https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#the-stack >> [3] https://docs.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-160#stack >> [4] https://bugs.openjdk.java.net/browse/JDK-8262901 > > Please add some test with an odd number of arguments. @theRealAph Can you review changes? ------------- PR: https://git.openjdk.java.net/jdk/pull/6641 From duke at openjdk.java.net Mon Dec 20 08:54:29 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Mon, 20 Dec 2021 08:54:29 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive [v5] In-Reply-To: <8Vz4DDiP2vqHU5nfppPyKjmFkDY8RwIWweapHTCx4vo=.e21f9de5-ed2b-4792-954b-a0936036e4ea@github.com> References: <8Vz4DDiP2vqHU5nfppPyKjmFkDY8RwIWweapHTCx4vo=.e21f9de5-ed2b-4792-954b-a0936036e4ea@github.com> Message-ID: On Fri, 17 Dec 2021 19:50:05 GMT, Vladimir Kozlov wrote: >> Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: >> >> flip ifs > > So the whole block of code in `c1_IR.cpp` is under `#ifndef PRODUCT`. > But is it used in **optimized** build or only in debug? > > I suggest to change `#ifndef PRODUCT` at line 1224 to `#ifdef ASSERT` and try to build optimized VM to see if it is used in it. > I see only prints and asserts. @vnkozlov I don't know what an optimized build is, you'll have to point me to some instructions if you want me to build one. I think changing assertions around BlockPrinter is outside the scope of this PR. Perhaps another PR to clean up optimized builds is in order for the future? ------------- PR: https://git.openjdk.java.net/jdk/pull/6850 From duke at openjdk.java.net Mon Dec 20 08:59:26 2021 From: duke at openjdk.java.net (Ludvig Janiuk) Date: Mon, 20 Dec 2021 08:59:26 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive [v5] In-Reply-To: References: Message-ID: On Fri, 17 Dec 2021 22:07:43 GMT, Albert Mingkun Yang wrote: >> Ludvig Janiuk has updated the pull request incrementally with one additional commit since the last revision: >> >> flip ifs > > I wonder if "map fusion" works here, as in improving the perf of verification. > > Using `IR::verify_local` as an example: > > > { > VerifyClosure cl; > blocks.iterate_forward(&cl); > } > > class VerifyClosure : public BlockClosure { > > void block_do(BlockBegin* block) override { > verify_end_not_null(block); > > verify_edge_mutuality(block); > > verify_block_begin_field(block); > > // more verifier > } > } > > > This should cut down #iteration to 1. @albertnetymk It would work for almost all of the verifiers, but there is a tradeoff with code flexibility. I'd opt to leave as-is. ------------- PR: https://git.openjdk.java.net/jdk/pull/6850 From jiefu at openjdk.java.net Mon Dec 20 09:18:25 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 20 Dec 2021 09:18:25 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v6] In-Reply-To: References: Message-ID: On Fri, 17 Dec 2021 11:29:56 GMT, Fei Gao wrote: >> The patch aims to help optimize Math.abs() mainly from these three parts: >> 1) Remove redundant instructions for abs with constant values >> 2) Remove redundant instructions for abs with char type >> 3) Convert some common abs operations to ideal forms >> >> 1. Remove redundant instructions for abs with constant values >> >> If we can decide the value of the input node for function Math.abs() >> at compile-time, we can substitute the Abs node with the absolute >> value of the constant and don't have to calculate it at runtime. >> >> For example, >> int[] a >> for (int i = 0; i < SIZE; i++) { >> a[i] = Math.abs(-38); >> } >> >> Before the patch, the generated code for the testcase above is: >> ... >> mov w10, #0xffffffda >> cmp w10, wzr >> cneg w17, w10, lt >> dup v16.8h, w17 >> ... >> After the patch, the generated code for the testcase above is : >> ... >> movi v16.4s, #0x26 >> ... >> >> 2. Remove redundant instructions for abs with char type >> >> In Java semantics, as the char type is always non-negative, we >> could actually remove the absI node in the C2 middle end. >> >> As for vectorization part, in current SLP, the vectorization of >> Math.abs() with char type is intentionally disabled after >> JDK-8261022 because it generates incorrect result before. After >> removing the AbsI node in the middle end, Math.abs(char) can be >> vectorized naturally. >> >> For example, >> >> char[] a; >> char[] b; >> for (int i = 0; i < SIZE; i++) { >> b[i] = (char) Math.abs(a[i]); >> } >> >> Before the patch, the generated assembly code for the testcase >> above is: >> >> B15: >> add x13, x21, w20, sxtw #1 >> ldrh w11, [x13, #16] >> cmp w11, wzr >> cneg w10, w11, lt >> strh w10, [x13, #16] >> ldrh w10, [x13, #18] >> cmp w10, wzr >> cneg w10, w10, lt >> strh w10, [x13, #18] >> ... >> add w20, w20, #0x1 >> cmp w20, w17 >> b.lt B15 >> >> After the patch, the generated assembly code is: >> B15: >> sbfiz x18, x19, #1, #32 >> add x0, x14, x18 >> ldr q16, [x0, #16] >> add x18, x21, x18 >> str q16, [x18, #16] >> ldr q16, [x0, #32] >> str q16, [x18, #32] >> ... >> add w19, w19, #0x40 >> cmp w19, w17 >> b.lt B15 >> >> 3. Convert some common abs operations to ideal forms >> >> The patch overrides some virtual support functions for AbsNode >> so that optimization of gvn can work on it. Here are the optimizable >> forms: >> >> a) abs(0 - x) => abs(x) >> >> Before the patch: >> ... >> ldr w13, [x13, #16] >> neg w13, w13 >> cmp w13, wzr >> cneg w14, w13, lt >> ... >> After the patch: >> ... >> ldr w13, [x13, #16] >> cmp w13, wzr >> cneg w13, w13, lt >> ... >> >> b) abs(abs(x)) => abs(x) >> >> Before the patch: >> ... >> ldr w12, [x12, #16] >> cmp w12, wzr >> cneg w12, w12, lt >> cmp w12, wzr >> cneg w12, w12, lt >> ... >> After the patch: >> ... >> ldr w13, [x13, #16] >> cmp w13, wzr >> cneg w13, w13, lt >> ... > > Fei Gao has updated the pull request incrementally with one additional commit since the last revision: > > Use uabs() to calculate the absolute value of constant > > Change-Id: Ie6f37ab159fb7092e1443b9af8d620562a45ae47 test/hotspot/jtreg/compiler/c2/TestAbs.java line 39: > 37: > 38: public class TestAbs { > 39: private static int SIZE = 500; Not used? test/hotspot/jtreg/compiler/c2/TestAbs.java line 114: > 112: > 113: // Test abs(constant) optimization for float > 114: Asserts.assertEquals(Float.NaN, Math.abs(Float.NaN)); I would suggest something like: assertTrue(Float.isNaN(Math.abs(Float.NaN))) test/hotspot/jtreg/compiler/c2/TestAbs.java line 136: > 134: } > 135: > 136: private static void testAbsTransformInt(int[] a) { If you want to verify C2's transformation, probably we should use C2's IR test framework. ------------- PR: https://git.openjdk.java.net/jdk/pull/6755 From chagedorn at openjdk.java.net Mon Dec 20 09:31:28 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Mon, 20 Dec 2021 09:31:28 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive [v5] In-Reply-To: <8Vz4DDiP2vqHU5nfppPyKjmFkDY8RwIWweapHTCx4vo=.e21f9de5-ed2b-4792-954b-a0936036e4ea@github.com> References: <8Vz4DDiP2vqHU5nfppPyKjmFkDY8RwIWweapHTCx4vo=.e21f9de5-ed2b-4792-954b-a0936036e4ea@github.com> Message-ID: <-8KQmMm4qVNnktitnRmmmVKKDdm8vBlRoGxLJiaqzKw=.bc118c49-aee2-4cdf-8551-f33c134b65ae@github.com> On Fri, 17 Dec 2021 19:50:05 GMT, Vladimir Kozlov wrote: > So the whole block of code in `c1_IR.cpp` is under `#ifndef PRODUCT`. But is it used in **optimized** build or only in debug? > > I suggest to change `#ifndef PRODUCT` at line 1224 to `#ifdef ASSERT` and try to build optimized VM to see if it is used in it. I see only prints and asserts. Shouldn't we leave `BlockPrinter` and `IR::print()` under `#ifndef PRODUCT`? These seem to be about printing things only which is probably good to have in optimized builds as well. We could instead just add an `#ifdef ASSERT` on L1263 for the validators and the IR verification and change the places using this code accordingly to `#ifdef ASSERT/DEBUG_ONLY()`. Then we would have everything under `ASSERT` instead of `!PRODUCT`. This would be in line with the original code which used `#ifdef ASSERT` in `IR::verify()`. ------------- PR: https://git.openjdk.java.net/jdk/pull/6850 From fgao at openjdk.java.net Mon Dec 20 09:43:25 2021 From: fgao at openjdk.java.net (Fei Gao) Date: Mon, 20 Dec 2021 09:43:25 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v6] In-Reply-To: References: <7f-CqEaprhPTYL6iSBehcZiKgzn2n9BqF19fWEKVWhU=.a113c51f-ec24-4ff2-94c9-17cec9d64f5f@github.com> Message-ID: On Mon, 20 Dec 2021 06:13:34 GMT, Jie Fu wrote: > But can you show us an example with more detailed analysis which pattern is applied in the test? Thanks. Hi, @DamonFool For example, `jdk/incubator/vector/Float512VectorTests.java` calls `java.lang.FdLibm$Hypot::compute`. You can check math classes in `Fdlibm.java`, like `Hypot` or a common one `Pow`, which call `Math.abs()`. After inline and many optimizations, such as constant propagation, the input value of `Math.abs()` is probably constant or `(0-x)`. We can optimize it using this patch. I learnt the optimization technique from the patch of my colleague, https://github.com/openjdk/jdk/pull/2776#issuecomment-789756226 The similar question was answered by Tobias in the conversation, and you can refer to it. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6755 From jiefu at openjdk.java.net Mon Dec 20 09:43:25 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 20 Dec 2021 09:43:25 GMT Subject: RFR: 8276673: Optimize abs operations in C2 compiler [v6] In-Reply-To: References: Message-ID: On Fri, 17 Dec 2021 11:29:56 GMT, Fei Gao wrote: >> The patch aims to help optimize Math.abs() mainly from these three parts: >> 1) Remove redundant instructions for abs with constant values >> 2) Remove redundant instructions for abs with char type >> 3) Convert some common abs operations to ideal forms >> >> 1. Remove redundant instructions for abs with constant values >> >> If we can decide the value of the input node for function Math.abs() >> at compile-time, we can substitute the Abs node with the absolute >> value of the constant and don't have to calculate it at runtime. >> >> For example, >> int[] a >> for (int i = 0; i < SIZE; i++) { >> a[i] = Math.abs(-38); >> } >> >> Before the patch, the generated code for the testcase above is: >> ... >> mov w10, #0xffffffda >> cmp w10, wzr >> cneg w17, w10, lt >> dup v16.8h, w17 >> ... >> After the patch, the generated code for the testcase above is : >> ... >> movi v16.4s, #0x26 >> ... >> >> 2. Remove redundant instructions for abs with char type >> >> In Java semantics, as the char type is always non-negative, we >> could actually remove the absI node in the C2 middle end. >> >> As for vectorization part, in current SLP, the vectorization of >> Math.abs() with char type is intentionally disabled after >> JDK-8261022 because it generates incorrect result before. After >> removing the AbsI node in the middle end, Math.abs(char) can be >> vectorized naturally. >> >> For example, >> >> char[] a; >> char[] b; >> for (int i = 0; i < SIZE; i++) { >> b[i] = (char) Math.abs(a[i]); >> } >> >> Before the patch, the generated assembly code for the testcase >> above is: >> >> B15: >> add x13, x21, w20, sxtw #1 >> ldrh w11, [x13, #16] >> cmp w11, wzr >> cneg w10, w11, lt >> strh w10, [x13, #16] >> ldrh w10, [x13, #18] >> cmp w10, wzr >> cneg w10, w10, lt >> strh w10, [x13, #18] >> ... >> add w20, w20, #0x1 >> cmp w20, w17 >> b.lt B15 >> >> After the patch, the generated assembly code is: >> B15: >> sbfiz x18, x19, #1, #32 >> add x0, x14, x18 >> ldr q16, [x0, #16] >> add x18, x21, x18 >> str q16, [x18, #16] >> ldr q16, [x0, #32] >> str q16, [x18, #32] >> ... >> add w19, w19, #0x40 >> cmp w19, w17 >> b.lt B15 >> >> 3. Convert some common abs operations to ideal forms >> >> The patch overrides some virtual support functions for AbsNode >> so that optimization of gvn can work on it. Here are the optimizable >> forms: >> >> a) abs(0 - x) => abs(x) >> >> Before the patch: >> ... >> ldr w13, [x13, #16] >> neg w13, w13 >> cmp w13, wzr >> cneg w14, w13, lt >> ... >> After the patch: >> ... >> ldr w13, [x13, #16] >> cmp w13, wzr >> cneg w13, w13, lt >> ... >> >> b) abs(abs(x)) => abs(x) >> >> Before the patch: >> ... >> ldr w12, [x12, #16] >> cmp w12, wzr >> cneg w12, w12, lt >> cmp w12, wzr >> cneg w12, w12, lt >> ... >> After the patch: >> ... >> ldr w13, [x13, #16] >> cmp w13, wzr >> cneg w13, w13, lt >> ... > > Fei Gao has updated the pull request incrementally with one additional commit since the last revision: > > Use uabs() to calculate the absolute value of constant > > Change-Id: Ie6f37ab159fb7092e1443b9af8d620562a45ae47 > > But can you show us an example with more detailed analysis which pattern is applied in the test? Thanks. > > Hi, @DamonFool > > For example, `jdk/incubator/vector/Float512VectorTests.java` calls `java.lang.FdLibm$Hypot::compute`. You can check math classes in `Fdlibm.java`, like `Hypot` or a common one `Pow`, which calls `Math.abs()`. After inline and many optimizations, such as constant propagation, the input value of `Math.abs()` is probably constant or `(0-x)`. We can optimize it using this patch. > > I learnt the optimization technique from the patch of my colleague, [#2776 (comment)](https://github.com/openjdk/jdk/pull/2776#issuecomment-789756226) The similar question was answered by Tobias in the conversation, and you can refer to it. > > Thanks. Very good! You had proved that these patterns do exist in C2's opt passes, so this patch makes sense to me. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6755 From roland at openjdk.java.net Mon Dec 20 09:44:49 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 20 Dec 2021 09:44:49 GMT Subject: [jdk18] RFR: 8278413: C2 crash when allocating array of size too large In-Reply-To: References: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> Message-ID: On Fri, 17 Dec 2021 10:42:48 GMT, Nils Eliasson wrote: >> On the fallthrough path from an AllocateArray, the length of the >> allocated array is casted (with a CastII) to [0, max_size] with >> max_size some number that depends on the array type and can be less >> than max_jint. >> >> Allocating an array of a length that's not in [0, max_size] causes the >> CastII to become top. The fallthrough path must be killed as well in >> that case otherwise this can lead to a broken graph. Currently c2 has >> logic to protect against an allocation of array of negative size in >> AllocateArrayNode::Ideal(). That call replaces the fallthrough path >> with an Halt node. But if the size is too big, then the fallthrough >> path is left as is. >> >> This patch fixes that issues. It also reworks the length negative >> case. I added a Bool/CmpU input to the AllocateArray that tests for a >> valid length. If that input becomes false, CatchNode::Value() kills >> the fallthrough path. That logic is similar to that for a virtual call >> with a null receiver. I also removed AllocateArrayNode::Ideal() now >> that CatchNode::Value() takes care of the same corner case. The code >> in AllocateArrayNode::Ideal() was added by Vladimir and he told me he >> tried extending CatchNode::Value() at the time but that caused test >> failures. I had no issues in my testing so I assume doing it that way >> is ok now. >> >> The new input to AllocateArray is moved to the CallStaticJava runtime >> call for array allocation on macro expansion as a precedence edge. The >> reason for that is that final graph reshape needs a way to tell >> whether the missing path out of the allocation is legal or not. final >> graph reshape then removes the then useless precedence edge. > > Passes testing tier1-3 @neliasso @vnkozlov thanks for the reviews. ------------- PR: https://git.openjdk.java.net/jdk18/pull/30 From roland at openjdk.java.net Mon Dec 20 09:44:52 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 20 Dec 2021 09:44:52 GMT Subject: [jdk18] Integrated: 8278413: C2 crash when allocating array of size too large In-Reply-To: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> References: <8R8EXB3nE4Y9EKNp954R45IwY7LiQ2YC56tFXnVGI_E=.8734a69e-4ed8-43cc-ba78-689bea43dc35@github.com> Message-ID: On Wed, 15 Dec 2021 12:36:17 GMT, Roland Westrelin wrote: > On the fallthrough path from an AllocateArray, the length of the > allocated array is casted (with a CastII) to [0, max_size] with > max_size some number that depends on the array type and can be less > than max_jint. > > Allocating an array of a length that's not in [0, max_size] causes the > CastII to become top. The fallthrough path must be killed as well in > that case otherwise this can lead to a broken graph. Currently c2 has > logic to protect against an allocation of array of negative size in > AllocateArrayNode::Ideal(). That call replaces the fallthrough path > with an Halt node. But if the size is too big, then the fallthrough > path is left as is. > > This patch fixes that issues. It also reworks the length negative > case. I added a Bool/CmpU input to the AllocateArray that tests for a > valid length. If that input becomes false, CatchNode::Value() kills > the fallthrough path. That logic is similar to that for a virtual call > with a null receiver. I also removed AllocateArrayNode::Ideal() now > that CatchNode::Value() takes care of the same corner case. The code > in AllocateArrayNode::Ideal() was added by Vladimir and he told me he > tried extending CatchNode::Value() at the time but that caused test > failures. I had no issues in my testing so I assume doing it that way > is ok now. > > The new input to AllocateArray is moved to the CallStaticJava runtime > call for array allocation on macro expansion as a precedence edge. The > reason for that is that final graph reshape needs a way to tell > whether the missing path out of the allocation is legal or not. final > graph reshape then removes the then useless precedence edge. This pull request has now been integrated. Changeset: deaf75a5 Author: Roland Westrelin URL: https://git.openjdk.java.net/jdk18/commit/deaf75a58587f80046204de7559ff50b3b770bed Stats: 200 lines in 9 files changed: 125 ins; 53 del; 22 mod 8278413: C2 crash when allocating array of size too large Reviewed-by: neliasso, kvn ------------- PR: https://git.openjdk.java.net/jdk18/pull/30 From roland at openjdk.java.net Mon Dec 20 09:57:26 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 20 Dec 2021 09:57:26 GMT Subject: RFR: 8278949: Cleanups for 8277850 In-Reply-To: References: Message-ID: On Sun, 19 Dec 2021 20:03:27 GMT, John R Rose wrote: >> When 8277850 (C2: optimize mask checks in counted loops) was reviewed, >> John made a number of comments and suggestions after the change was >> integrated. This change includes all of his comments, extra tests to >> cover all cases. I also moved the AndIL_add_shift_and_mask() call in >> AndXNode::Ideal() up so the expression with a non constant mask can be >> optimized as well. > > Looks good. One naming suggestion about `AndIL_shift_and_mask` @rose00 @vnkozlov thanks for the reviews. > src/hotspot/share/opto/mulnode.cpp line 1730: > >> 1728: // Because the optimization might work for a non-constant >> 1729: // mask M, we check the AndX for both operand orders. >> 1730: bool MulNode::AndIL_shift_and_mask(PhaseGVN* phase, Node* shift, Node* mask, BasicType bt, bool check_reverse) { > > Since this is a boolean function, perhaps its name should indicate what question it is answering. > > Suggestion: `AndIL_shift_and_mask_is_always_zero` I will make that change before I push it. ------------- PR: https://git.openjdk.java.net/jdk/pull/6876 From roland at openjdk.java.net Mon Dec 20 10:04:11 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 20 Dec 2021 10:04:11 GMT Subject: RFR: 8278949: Cleanups for 8277850 [v2] In-Reply-To: References: Message-ID: > When 8277850 (C2: optimize mask checks in counted loops) was reviewed, > John made a number of comments and suggestions after the change was > integrated. This change includes all of his comments, extra tests to > cover all cases. I also moved the AndIL_add_shift_and_mask() call in > AndXNode::Ideal() up so the expression with a non constant mask can be > optimized as well. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - renaming AndIL_shift_and_mask -> AndIL_shift_and_mask_is_always_zero - Merge branch 'master' into JDK-8278949 - whitespaces - John's comments ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6876/files - new: https://git.openjdk.java.net/jdk/pull/6876/files/45dab951..e2933fac Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6876&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6876&range=00-01 Stats: 6073 lines in 233 files changed: 2747 ins; 2337 del; 989 mod Patch: https://git.openjdk.java.net/jdk/pull/6876.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6876/head:pull/6876 PR: https://git.openjdk.java.net/jdk/pull/6876 From roland at openjdk.java.net Mon Dec 20 10:04:13 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 20 Dec 2021 10:04:13 GMT Subject: Integrated: 8278949: Cleanups for 8277850 In-Reply-To: References: Message-ID: On Fri, 17 Dec 2021 08:03:19 GMT, Roland Westrelin wrote: > When 8277850 (C2: optimize mask checks in counted loops) was reviewed, > John made a number of comments and suggestions after the change was > integrated. This change includes all of his comments, extra tests to > cover all cases. I also moved the AndIL_add_shift_and_mask() call in > AndXNode::Ideal() up so the expression with a non constant mask can be > optimized as well. This pull request has now been integrated. Changeset: 06206c71 Author: Roland Westrelin URL: https://git.openjdk.java.net/jdk/commit/06206c7199e9b49382d5f489ed5733525a95a535 Stats: 333 lines in 3 files changed: 291 ins; 16 del; 26 mod 8278949: Cleanups for 8277850 Co-authored-by: John R Rose Reviewed-by: kvn, jrose ------------- PR: https://git.openjdk.java.net/jdk/pull/6876 From roland at openjdk.java.net Mon Dec 20 10:09:59 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 20 Dec 2021 10:09:59 GMT Subject: RFR: 8278784: C2: Refactor PhaseIdealLoop::remix_address_expressions() so it operates on longs Message-ID: The logic in PhaseIdealLoop::remix_address_expressions() that's specific to int nodes apply equally to long nodes. This change refactors that code so it applies to both integer types. This improves performance on some of panama's micro-benchmarks. ------------- Commit messages: - test - remix address Changes: https://git.openjdk.java.net/jdk/pull/6892/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6892&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278784 Stats: 264 lines in 5 files changed: 178 ins; 34 del; 52 mod Patch: https://git.openjdk.java.net/jdk/pull/6892.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6892/head:pull/6892 PR: https://git.openjdk.java.net/jdk/pull/6892 From shade at openjdk.java.net Mon Dec 20 11:59:05 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 20 Dec 2021 11:59:05 GMT Subject: RFR: 8277893: Arraycopy stress tests [v5] In-Reply-To: References: Message-ID: > I would like to fork the new tests off the JDK-8150730. These tests were instrumental in capturing many bugs in my arraycopy work, and I think they are good on their own merit, because they provide a test for the current baseline and on-going minor improvements in arraycopy on all platforms, not only x86_64, and they might be cleanly backportable. > > A brief tour of these tests: > > - Tests all data types; > - Tests small arrays exhaustively, which captures conjoint/disjoint cases, errors near the edges, etc; > - Tests large arrays with fuzzing around powers of two and powers of ten, both conjoint and disjoint cases; > - Tests all available compilation modes for arraycopy stubs; for example, running on AVX-512 enabled machine runs all versions down to `-XX:UseAVX=0 -XX:UseSSE=0` cases; > - Tests with/without compressed oops mode -- theoretically only needed for `Object` copies, but Hotspot cobbles together int+coops and long+no-coops loops, so I decided to alternate coops mode for all data types; > > My previous version used individual `@run` clauses for all configurations, but I think the Java driver is cleaner and easier to maintain. > > Test times: > > > # x86_64 (TR 3970X) > real 4m6.192s > user 52m50.523s > sys 0m13.755s > > # x86_64 (TR 3970X) -XX:+UseZGC > real 6m2.573s > user 72m43.541s > sys 0m25.697s > > # x86_32 (TR 3970X) > real 6m56.405s > user 92m56.377s > sys 0m6.677s > > # x86_64 (i5-11500) > real 29m19.024s > user 103m52.925s > sys 1m7.175s > > # AArch64 (ThunderX2) > real 2m59.623s > user 26m14.624s > sys 0m9.771s > > > Since these tests are quite long, especially on small machines, I hooked them up to `hotspot:tier3`. > > Additional testing: > - [x] Linux x86_64 fastdebug `compiler/stress/arraycopy` > - [x] Linux x86_32 fastdebug `compiler/stress/arraycopy` > - [x] Linux AArch64 fastdebug `compiler/stress/arraycopy` Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 14 additional commits since the last revision: - Merge branch 'master' into JDK-8277893-arraycopy-tests - Bump timeout to 7200 - Merge branch 'master' into JDK-8277893-arraycopy-tests - Package declarations - Add safety check for small systems - Renames - Single driver for all the tests - Safer timeout settings - Post-merge TEST.groups cleanup - Merge branch 'master' into JDK-8277893-arraycopy-tests - ... and 4 more: https://git.openjdk.java.net/jdk/compare/07d279a8...6789eb8b ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6594/files - new: https://git.openjdk.java.net/jdk/pull/6594/files/b749c367..6789eb8b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6594&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6594&range=03-04 Stats: 20094 lines in 535 files changed: 14663 ins; 3580 del; 1851 mod Patch: https://git.openjdk.java.net/jdk/pull/6594.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6594/head:pull/6594 PR: https://git.openjdk.java.net/jdk/pull/6594 From shade at openjdk.java.net Mon Dec 20 11:59:14 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 20 Dec 2021 11:59:14 GMT Subject: RFR: 8277893: Arraycopy stress tests [v4] In-Reply-To: References: Message-ID: On Thu, 9 Dec 2021 07:12:47 GMT, Aleksey Shipilev wrote: >> I would like to fork the new tests off the JDK-8150730. These tests were instrumental in capturing many bugs in my arraycopy work, and I think they are good on their own merit, because they provide a test for the current baseline and on-going minor improvements in arraycopy on all platforms, not only x86_64, and they might be cleanly backportable. >> >> A brief tour of these tests: >> >> - Tests all data types; >> - Tests small arrays exhaustively, which captures conjoint/disjoint cases, errors near the edges, etc; >> - Tests large arrays with fuzzing around powers of two and powers of ten, both conjoint and disjoint cases; >> - Tests all available compilation modes for arraycopy stubs; for example, running on AVX-512 enabled machine runs all versions down to `-XX:UseAVX=0 -XX:UseSSE=0` cases; >> - Tests with/without compressed oops mode -- theoretically only needed for `Object` copies, but Hotspot cobbles together int+coops and long+no-coops loops, so I decided to alternate coops mode for all data types; >> >> My previous version used individual `@run` clauses for all configurations, but I think the Java driver is cleaner and easier to maintain. >> >> Test times: >> >> >> # x86_64 (TR 3970X) >> real 4m6.192s >> user 52m50.523s >> sys 0m13.755s >> >> # x86_64 (TR 3970X) -XX:+UseZGC >> real 6m2.573s >> user 72m43.541s >> sys 0m25.697s >> >> # x86_32 (TR 3970X) >> real 6m56.405s >> user 92m56.377s >> sys 0m6.677s >> >> # x86_64 (i5-11500) >> real 29m19.024s >> user 103m52.925s >> sys 1m7.175s >> >> # AArch64 (ThunderX2) >> real 2m59.623s >> user 26m14.624s >> sys 0m9.771s >> >> >> Since these tests are quite long, especially on small machines, I hooked them up to `hotspot:tier3`. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `compiler/stress/arraycopy` >> - [x] Linux x86_32 fastdebug `compiler/stress/arraycopy` >> - [x] Linux AArch64 fastdebug `compiler/stress/arraycopy` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 13 additional commits since the last revision: > > - Bump timeout to 7200 > - Merge branch 'master' into JDK-8277893-arraycopy-tests > - Package declarations > - Add safety check for small systems > - Renames > - Single driver for all the tests > - Safer timeout settings > - Post-merge TEST.groups cleanup > - Merge branch 'master' into JDK-8277893-arraycopy-tests > - Merge branch 'master' into JDK-8277893-arraycopy-tests > - ... and 3 more: https://git.openjdk.java.net/jdk/compare/fbdbb5b1...b749c367 Rebased to current master. Tests still pass. I think I need a second (R)eviewer to push this, please. ------------- PR: https://git.openjdk.java.net/jdk/pull/6594 From jbhateja at openjdk.java.net Mon Dec 20 13:39:45 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 20 Dec 2021 13:39:45 GMT Subject: RFR: 8273322: Enhance macro logic optimization for masked logic operations. Message-ID: Patch extends existing macrologic inferencing algorithm to handle masked logic operations. Existing algorithm: 1. Identify logic cone roots. 2. Packs parent and logic child nodes into a MacroLogic node in bottom up traversal if input constraint are met. i.e. maximum number of inputs which a macro logic node can have. 3. Perform symbolic evaluation of logic expression tree by assigning value corresponding to a truth table column to each input. 4. Inputs along with encoded function together represents a macro logic node which mimics a truth table. Modification: Extended the packing algorithm to operate on both predicated or non-predicated logic nodes. Following rules define the criteria under which nodes gets packed into a macro logic node:- 1. Parent and both child nodes are all unmasked or masked with same predicates. 2. Masked parent can be packed with left child if it is predicated and both have same prediates. 3. Masked parent can be packed with right child if its un-predicated or has matching predication condition. 4. An unmasked parent can be packed with an unmasked child. New jtreg test case added with the patch exhaustively covers all the different combinations of predications of parent and child nodes. Following are the performance number for JMH benchmark included with the patch. Machine Configuration: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) Benchmark | ARRAYLEN | Baseline (ops/s) | Withopt (ops/s) | Gain ( withopt/baseline) -- | -- | -- | -- | -- o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 64 | 2365.421 | 5136.283 | 2.171403315 o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 128 | 2034.1 | 4073.381 | 2.002547072 o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 256 | 1568.694 | 2811.975 | 1.792558013 o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 512 | 883.261 | 1662.771 | 1.882536419 o.o.b.vm.compiler.MacroLogicOpt.workload1_caller | 1024 | 469.513 | 732.81 | 1.560787454 o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 64 | 273.049 | 552.106 | 2.022003377 o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 128 | 219.624 | 359.775 | 1.63814064 o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 256 | 131.649 | 182.23 | 1.384211046 o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 512 | 71.452 | 81.522 | 1.140933774 o.o.b.vm.compiler.MacroLogicOpt.workload2_caller | 1024 | 37.427 | 41.966 | 1.121276084 o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 64 | 2805.759 | 3383.16 | 1.205791374 o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 128 | 2069.012 | 2250.37 | 1.087654397 o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 256 | 1098.766 | 1101.996 | 1.002939661 o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 512 | 470.035 | 484.732 | 1.031267884 o.o.b.vm.compiler.MacroLogicOpt.workload3_caller | 1024 | 202.827 | 209.073 | 1.030794717 o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 256 | 3435.989 | 4418.09 | 1.285827749 o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 512 | 1524.803 | 1678.201 | 1.100601848 o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt128 | 1024 | 972.501 | 1166.734 | 1.199725244 o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 256 | 5980.85 | 7584.17 | 1.268075608 o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 512 | 3258.108 | 3939.23 | 1.209054457 o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt256 | 1024 | 1475.365 | 1511.159 | 1.024261115 o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 256 | 4208.766 | 4220.678 | 1.002830283 o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 512 | 2056.651 | 2049.489 | 0.99651764 o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationInt512 | 1024 | 1110.461 | 1116.448 | 1.005391455 o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 256 | 3259.348 | 3947.94 | 1.211266793 o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 512 | 1515.147 | 1536.647 | 1.014190042 o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong256 | 1024 | 911.58 | 1030.54 | 1.130498695 o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 256 | 2034.611 | 2073.764 | 1.019243482 o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 512 | 1110.659 | 1116.093 | 1.004892591 o.o.b.jdk.incubator.vector.MaskedLogicOpts.bitwiseBlendOperationLong512 | 1024 | 559.269 | 559.651 | 1.000683034 o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 256 | 3636.141 | 4446.505 | 1.222863745 o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 512 | 1433.145 | 1681.261 | 1.173126934 o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt128 | 1024 | 1000.107 | 1172.866 | 1.172740517 o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt256 | 256 | 5568.313 | 7670.259 | 1.37748345 o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt256 | 512 | 3350.108 | 3927.803 | 1.172440709 o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt256 | 1024 | 1495.966 | 1541.56 | 1.030477965 o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt512 | 256 | 4230.379 | 4282.154 | 1.012238856 o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt512 | 512 | 2029.801 | 2049.638 | 1.009772879 o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsInt512 | 1024 | 1108.738 | 1118.897 | 1.00916267 o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsLong256 | 256 | 3802.801 | 3783.537 | 0.99493426 o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsLong256 | 512 | 1546.244 | 1552.691 | 1.004169458 o.o.b.jdk.incubator.vector.MaskedLogicOpts.maskedLogicOperationsLong256 | 1024 | 1017.512 | 1020.075 | 1.002518889 o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt128 | 256 | 4159.835 | 4527.676 | 1.088426825 o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt128 | 512 | 1665.335 | 1733.04 | 1.040655484 o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt128 | 1024 | 1150.319 | 1181.935 | 1.02748455 o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt256 | 256 | 6989.791 | 7382.883 | 1.056238019 o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt256 | 512 | 3711.362 | 3911.921 | 1.054039191 o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt256 | 1024 | 1540.341 | 1554.175 | 1.008981128 o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt512 | 256 | 4164.559 | 4213.546 | 1.01176283 o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt512 | 512 | 2072.91 | 2079.105 | 1.002988552 o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsInt512 | 1024 | 1112.678 | 1116.675 | 1.003592234 o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong256 | 256 | 3702.998 | 3906.093 | 1.0548461 o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong256 | 512 | 1536.571 | 1546.043 | 1.006164375 o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong256 | 1024 | 996.906 | 1013.649 | 1.016794964 o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong512 | 256 | 2045.594 | 2048.966 | 1.001648421 o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong512 | 512 | 1111.933 | 1117.689 | 1.005176571 o.o.b.jdk.incubator.vector.MaskedLogicOpts.partiallyMaskedLogicOperationsLong512 | 1024 | 559.971 | 561.144 | 1.002094751 Kindly review and share your feedback. Best Regards, Jatin ------------- Commit messages: - 8273322: Enhance macro logic optimization for masked logic operations. Changes: https://git.openjdk.java.net/jdk/pull/6893/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6893&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8273322 Stats: 1413 lines in 12 files changed: 1370 ins; 6 del; 37 mod Patch: https://git.openjdk.java.net/jdk/pull/6893.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6893/head:pull/6893 PR: https://git.openjdk.java.net/jdk/pull/6893 From chagedorn at openjdk.java.net Mon Dec 20 15:18:27 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Mon, 20 Dec 2021 15:18:27 GMT Subject: [jdk18] RFR: 8278267: ARM32: several vector test failures for ASHR In-Reply-To: References: Message-ID: On Fri, 17 Dec 2021 00:38:07 GMT, Hao Sun wrote: > In ARM32, "VSHL (register)" instruction [1] is shared by vector left > shift and vector right shift, and the condition to distinguish them is > whether the shift count value is positve or negative. Hence, negation > operation is needed before conducting vector right shift. > > For vector right shift, the shift count can be a RShiftCntV or a normal > vector node. Take test case Byte64VectorTests.java [2][3] as an example. > Note that RShiftCntV is already negated via rules "vsrcntD" and > "vsrcntX" whereas the normal vector node is NOT, since we don't know > whether a normal vector node is used as a vector shift count or not. > This is the root cause for these vector test failures. > > The fix is simple, moving the negation from "vsrcntD|X" to the > corresponding vector right shift rules. > > Affected rules are vsrlBB_reg and vsraBB_reg. Note that vector shift > related rules are in form of "vsAABB_CC", where > 1) AA can be l (left shift), rl (logical right shift) and ra (arithmetic > right shift). > 2) BB can be 8B/16B (byte type), 4S/8S (short type), 2I/4I (int type) > and 2L (long type). > 3) CC can be reg (register case) and immI (immediate case). > > Minor updates: > 1) Merge "vslcntD" and "vsrcntD" into rule "vscntD", as these two rules > conduct the same duplication operation now. > 2) Update the "match" primitive for vsraBB_immI rules. > 3) Style issue: remove the surrounding space for "ins_pipe" primitive. > > Tests: > We ran tier 1~3 tests on ARM32 platform. With this patch, previously > failed vector test cases can pass now without introducing test > regression. > > [1] https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Instruction-Details/Alphabetical-list-of-instructions/VSHL--register-?lang=en > [2] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2237 > [3] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2425 @nick-arm could you or someone else at ARM help to review this? That would be great. ------------- PR: https://git.openjdk.java.net/jdk18/pull/41 From shade at openjdk.java.net Mon Dec 20 15:32:28 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 20 Dec 2021 15:32:28 GMT Subject: [jdk18] RFR: 8278489: Preserve result in native wrapper with +UseHeavyMonitors [v2] In-Reply-To: References: <7DC3fQUgJZu-PrJPhcuTvHzSpRuuV4vQfe5pe6p3aqs=.9a0a3a5d-3c02-4f01-a239-8542790bd281@github.com> Message-ID: On Tue, 14 Dec 2021 10:46:54 GMT, Roman Kennke wrote: >> Testing observed a few failures after JDK-8276901. The reason for the failures is in the native-wrappers, in the +UseHeavyMonitors paths, we don't preserve the result register after the native call. >> >> Testing: >> - [x] java/awt/color, sun/java2d/cmm tests (x86_32, x86_64 x -UseHeavyMonitors, +UseHeavyMonitors) >> - [x] tier1 (x86_32, x86_64 x -UseHeavyMonitors, +UseHeavyMonitors) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > Fix typos and intendation The same thing affects `sharedRuntime_aarch64.cpp`, is it not? Otherwise the change makes sense to me. ------------- Changes requested by shade (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/16 From eliu at openjdk.java.net Mon Dec 20 15:43:56 2021 From: eliu at openjdk.java.net (Eric Liu) Date: Mon, 20 Dec 2021 15:43:56 GMT Subject: [jdk18] RFR: 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail Message-ID: This bug appears intermittently and it's caused by vmaskAll_immI[1] when the vector mask size is smaller than max predicate size of running machine. It generates an all-true predicate without considering those inactive bits. That may result in the wrong result of VectorMask.toLong. The problematic code is as below: ShortVector.SPECIES_64.MaskAll(true).toLong() assembly: ptrue p0.h <= MaskAll(true) mov z16.h, p0/z, #1 mov z17.h, #0 uzp1 z16.b, z16.b, z17.b fmov x10, d16 orr x10, x10, x10, lsr #7 orr x10, x10, x10, lsr #14 orr x10, x10, x10, lsr #28 and x10, x10, #0xff (gdb) p/x $p0 # on an SVE machine with vector length as 64 in bytes $1 = {0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55} Expected: (gdb) p/x $p0 $1 = {0x55, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00} Considering MaskAll is used in VectorMask.fromLong() only for a special case and relies on the mechanism of inline and intrinsification, even it could be optimized out, this patch also adds test cases for MaskAll to reproduce this issue stably. Also fix a small issue on register utilization for sve_reduce_[max|min][D|F]. [1] https://github.com/openjdk/jdk18/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L416 hotspot/compiler/vectorapi, jdk/incubator/vector passed on SVE enabled system. Change-Id: I9631f26f9232ffe7a28b74f14062d945c32fa1fb ------------- Commit messages: - 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail Changes: https://git.openjdk.java.net/jdk18/pull/49/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=49&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278889 Stats: 391 lines in 37 files changed: 300 ins; 0 del; 91 mod Patch: https://git.openjdk.java.net/jdk18/pull/49.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/49/head:pull/49 PR: https://git.openjdk.java.net/jdk18/pull/49 From svkamath at openjdk.java.net Mon Dec 20 18:40:20 2021 From: svkamath at openjdk.java.net (Smita Kamath) Date: Mon, 20 Dec 2021 18:40:20 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 [v4] In-Reply-To: References: Message-ID: > The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: Removed __addptr instruction as it was not needed ------------- Changes: - all: https://git.openjdk.java.net/jdk18/pull/19/files - new: https://git.openjdk.java.net/jdk18/pull/19/files/12b16ac9..0140f672 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk18&pr=19&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk18&pr=19&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.java.net/jdk18/pull/19.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/19/head:pull/19 PR: https://git.openjdk.java.net/jdk18/pull/19 From sviswanathan at openjdk.java.net Mon Dec 20 18:40:22 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Mon, 20 Dec 2021 18:40:22 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 [v4] In-Reply-To: References: Message-ID: On Mon, 20 Dec 2021 18:36:46 GMT, Smita Kamath wrote: >> The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Removed __addptr instruction as it was not needed Changes look good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/19 From kvn at openjdk.java.net Mon Dec 20 18:43:20 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 20 Dec 2021 18:43:20 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 [v4] In-Reply-To: References: Message-ID: On Mon, 20 Dec 2021 18:40:20 GMT, Smita Kamath wrote: >> The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. > > Smita Kamath has updated the pull request incrementally with one additional commit since the last revision: > > Removed __addptr instruction as it was not needed I tested without `addptr` instructions and it passed. It is good for integration. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/19 From svkamath at openjdk.java.net Mon Dec 20 18:48:09 2021 From: svkamath at openjdk.java.net (Smita Kamath) Date: Mon, 20 Dec 2021 18:48:09 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 In-Reply-To: References: Message-ID: On Fri, 17 Dec 2021 03:14:07 GMT, Vladimir Kozlov wrote: >> The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. > > Note, we can't change code in `PredicatedIntrinsicGenerator` because it is used by all intrinsics with predicate and non do re-execution. @vnkozlov Thanks a lot for approving this PR. I appreciate your help. ------------- PR: https://git.openjdk.java.net/jdk18/pull/19 From duke at openjdk.java.net Mon Dec 20 19:49:09 2021 From: duke at openjdk.java.net (Vamsi Parasa) Date: Mon, 20 Dec 2021 19:49:09 GMT Subject: RFR: 8278868: Add x86 vectorization support for Long.bitCount() [v2] In-Reply-To: References: Message-ID: > Vectorization support of Integer.bitCount() already exists but currently the same support is lacking for Long.bitCount(). Similar to the C2 PopCountVI node, we created a C2 PopCountVL node and used vpopcntq x86 instruction to enable vectorized Long.bitCount(). This patch shows 2.57x improvement in performance on a JMH micro benchmark due to x86 vectorization. Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: Add JMH micro benchmark to measure performance ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6857/files - new: https://git.openjdk.java.net/jdk/pull/6857/files/60d976f3..4567eab8 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6857&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6857&range=00-01 Stats: 87 lines in 1 file changed: 87 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6857.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6857/head:pull/6857 PR: https://git.openjdk.java.net/jdk/pull/6857 From duke at openjdk.java.net Mon Dec 20 19:49:10 2021 From: duke at openjdk.java.net (Vamsi Parasa) Date: Mon, 20 Dec 2021 19:49:10 GMT Subject: RFR: 8278868: Add x86 vectorization support for Long.bitCount() In-Reply-To: References: Message-ID: On Wed, 15 Dec 2021 23:51:19 GMT, Vamsi Parasa wrote: > Vectorization support of Integer.bitCount() already exists but currently the same support is lacking for Long.bitCount(). Similar to the C2 PopCountVI node, we created a C2 PopCountVL node and used vpopcntq x86 instruction to enable vectorized Long.bitCount(). This patch shows 2.57x improvement in performance on a JMH micro benchmark due to x86 vectorization. This patch shows 2.57x improvement in performance on a JMH micro benchmark due to x86 vectorization. ------------- PR: https://git.openjdk.java.net/jdk/pull/6857 From sviswanathan at openjdk.java.net Mon Dec 20 20:12:42 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Mon, 20 Dec 2021 20:12:42 GMT Subject: [jdk18] RFR: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 In-Reply-To: References: Message-ID: On Fri, 17 Dec 2021 03:14:07 GMT, Vladimir Kozlov wrote: >> The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. > > Note, we can't change code in `PredicatedIntrinsicGenerator` because it is used by all intrinsics with predicate and non do re-execution. @vnkozlov @dean-long Thanks a lot for guiding Smita through this fix. ------------- PR: https://git.openjdk.java.net/jdk18/pull/19 From svkamath at openjdk.java.net Mon Dec 20 20:12:43 2021 From: svkamath at openjdk.java.net (Smita Kamath) Date: Mon, 20 Dec 2021 20:12:43 GMT Subject: [jdk18] Integrated: 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 06:16:23 GMT, Smita Kamath wrote: > The failure happens with XX:+DeoptimizeAlot option. I've set reexecute bit and reset the appropriate state for the interpreter to execute the code when deoptimization occurs. This pull request has now been integrated. Changeset: 819f9bd0 Author: Smita Kamath Committer: Sandhya Viswanathan URL: https://git.openjdk.java.net/jdk18/commit/819f9bd084fa49222a4310fbcf4933005e9f0ca4 Stats: 63 lines in 12 files changed: 11 ins; 43 del; 9 mod 8274323: compiler/codegen/aes/TestAESMain.java failed with "Error: invalid offset: -1434443640" after 8273297 Reviewed-by: sviswanathan, kvn ------------- PR: https://git.openjdk.java.net/jdk18/pull/19 From coleenp at openjdk.java.net Mon Dec 20 23:05:58 2021 From: coleenp at openjdk.java.net (Coleen Phillimore) Date: Mon, 20 Dec 2021 23:05:58 GMT Subject: RFR: 8278239: vmTestbase/nsk/jvmti/RedefineClasses/StressRedefine failed with EXCEPTION_ACCESS_VIOLATION at 0x000000000000000d Message-ID: Thanks to @stefank and @fisk for the diagnosis. ZGC is looking at metadata in unloaded nmethods, but redefinition doesn't keep this metadata from being deallocated. Change the iterators that mark metadata in use to walk unloaded nmethods as well as other alive nmethods. The test case reproduces the crash on windows if run in an 100 iteration loop. This fix does not crash in the same test. Also ran tier1-6. ------------- Commit messages: - 8278239: vmTestbase/nsk/jvmti/RedefineClasses/StressRedefine failed with EXCEPTION_ACCESS_VIOLATION at 0x000000000000000d Changes: https://git.openjdk.java.net/jdk/pull/6900/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6900&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278239 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/6900.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6900/head:pull/6900 PR: https://git.openjdk.java.net/jdk/pull/6900 From svkamath at openjdk.java.net Tue Dec 21 01:00:42 2021 From: svkamath at openjdk.java.net (Smita Kamath) Date: Tue, 21 Dec 2021 01:00:42 GMT Subject: [jdk18] RFR: 8279045: Intrinsics missing vzeroupper instruction Message-ID: <27V2F0jiu0-jjxsrjhA4U0idx3yY0d2zeMaN2_TDFcY=.92b56e97-d251-4f70-b3d2-a290913a8d59@github.com> Adding vzeroupper instruction to aes and shift intrinsics. ------------- Commit messages: - Adding vzeroupper instruction to aes and shift intrinsics Changes: https://git.openjdk.java.net/jdk18/pull/52/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=52&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8279045 Stats: 7 lines in 1 file changed: 7 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk18/pull/52.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/52/head:pull/52 PR: https://git.openjdk.java.net/jdk18/pull/52 From jbhateja at openjdk.java.net Tue Dec 21 05:18:16 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 21 Dec 2021 05:18:16 GMT Subject: RFR: 8278868: Add x86 vectorization support for Long.bitCount() [v2] In-Reply-To: References: Message-ID: On Mon, 20 Dec 2021 19:49:09 GMT, Vamsi Parasa wrote: >> Vectorization support of Integer.bitCount() already exists but currently the same support is lacking for Long.bitCount(). Similar to the C2 PopCountVI node, we created a C2 PopCountVL node and used vpopcntq x86 instruction to enable vectorized Long.bitCount(). This patch shows 2.57x improvement in performance on a JMH micro benchmark due to x86 vectorization. > > Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Add JMH micro benchmark to measure performance src/hotspot/share/opto/superword.cpp line 2951: > 2949: if (VectorNode::is_vpopcntq(use)) { > 2950: // VPOPCNTQ takes longs and produces ints - hence the special checks > 2951: // on alignment and size. Use IR node reference instead of target specific instruction. src/hotspot/share/opto/vectornode.hpp line 517: > 515: }; > 516: > 517: //------------------------------SqrtVFNode-------------------------------------- I think we can remove "I" specialization from existing PopCountVI and make the IR node generic. It already has a type which should be sufficient to emit type specific instruction. There are many vector node which are common across types. test/hotspot/jtreg/compiler/vectorization/TestPopCountVectorLong.java line 65: > 63: } > 64: > 65: public void vectorizeBitCount() { We can add check based on new IR framework here. ------------- PR: https://git.openjdk.java.net/jdk/pull/6857 From duke at openjdk.java.net Tue Dec 21 05:48:16 2021 From: duke at openjdk.java.net (Vamsi Parasa) Date: Tue, 21 Dec 2021 05:48:16 GMT Subject: RFR: 8278868: Add x86 vectorization support for Long.bitCount() [v2] In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 05:06:26 GMT, Jatin Bhateja wrote: >> Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> Add JMH micro benchmark to measure performance > > src/hotspot/share/opto/superword.cpp line 2951: > >> 2949: if (VectorNode::is_vpopcntq(use)) { >> 2950: // VPOPCNTQ takes longs and produces ints - hence the special checks >> 2951: // on alignment and size. > > Use IR node reference instead of target specific instruction. Thanks Jatin for noticing that! Will rename the functions to have generic names instead of target specific names. Will also modify the comment to be generic... ------------- PR: https://git.openjdk.java.net/jdk/pull/6857 From njian at openjdk.java.net Tue Dec 21 06:44:24 2021 From: njian at openjdk.java.net (Ningsheng Jian) Date: Tue, 21 Dec 2021 06:44:24 GMT Subject: [jdk18] RFR: 8278267: ARM32: several vector test failures for ASHR In-Reply-To: References: Message-ID: On Fri, 17 Dec 2021 00:38:07 GMT, Hao Sun wrote: > In ARM32, "VSHL (register)" instruction [1] is shared by vector left > shift and vector right shift, and the condition to distinguish them is > whether the shift count value is positve or negative. Hence, negation > operation is needed before conducting vector right shift. > > For vector right shift, the shift count can be a RShiftCntV or a normal > vector node. Take test case Byte64VectorTests.java [2][3] as an example. > Note that RShiftCntV is already negated via rules "vsrcntD" and > "vsrcntX" whereas the normal vector node is NOT, since we don't know > whether a normal vector node is used as a vector shift count or not. > This is the root cause for these vector test failures. > > The fix is simple, moving the negation from "vsrcntD|X" to the > corresponding vector right shift rules. > > Affected rules are vsrlBB_reg and vsraBB_reg. Note that vector shift > related rules are in form of "vsAABB_CC", where > 1) AA can be l (left shift), rl (logical right shift) and ra (arithmetic > right shift). > 2) BB can be 8B/16B (byte type), 4S/8S (short type), 2I/4I (int type) > and 2L (long type). > 3) CC can be reg (register case) and immI (immediate case). > > Minor updates: > 1) Merge "vslcntD" and "vsrcntD" into rule "vscntD", as these two rules > conduct the same duplication operation now. > 2) Update the "match" primitive for vsraBB_immI rules. > 3) Style issue: remove the surrounding space for "ins_pipe" primitive. > > Tests: > We ran tier 1~3 tests on ARM32 platform. With this patch, previously > failed vector test cases can pass now without introducing test > regression. > > [1] https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Instruction-Details/Alphabetical-list-of-instructions/VSHL--register-?lang=en > [2] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2237 > [3] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2425 The fix looks good to me (not a Reviewer). ------------- Marked as reviewed by njian (Committer). PR: https://git.openjdk.java.net/jdk18/pull/41 From jbhateja at openjdk.java.net Tue Dec 21 06:50:15 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 21 Dec 2021 06:50:15 GMT Subject: RFR: 8278868: Add x86 vectorization support for Long.bitCount() [v2] In-Reply-To: References: Message-ID: <5IK_9IPsVuhhBDHnJWZ6xa58QpTf0ydnLH7nlo97imw=.6b51dff4-cf21-4f91-b6f1-38175404a74a@github.com> On Tue, 21 Dec 2021 04:53:29 GMT, Jatin Bhateja wrote: >> Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> Add JMH micro benchmark to measure performance > > src/hotspot/share/opto/vectornode.hpp line 517: > >> 515: }; >> 516: >> 517: //------------------------------SqrtVFNode-------------------------------------- > > I think we can remove "I" specialization from existing PopCountVI and make the IR node generic. It already has a type which should be sufficient to emit type specific instruction. There are many vector node which are common across types. Since its already supported by AARCH64/PPC so we can keep your existing changes as is. ------------- PR: https://git.openjdk.java.net/jdk/pull/6857 From njian at openjdk.java.net Tue Dec 21 07:01:25 2021 From: njian at openjdk.java.net (Ningsheng Jian) Date: Tue, 21 Dec 2021 07:01:25 GMT Subject: [jdk18] RFR: 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail In-Reply-To: References: Message-ID: On Mon, 20 Dec 2021 15:35:47 GMT, Eric Liu wrote: > This bug appears intermittently and it's caused by vmaskAll_immI[1] > when the vector mask size is smaller than max predicate size of running > machine. It generates an all-true predicate without considering those > inactive bits. That may result in the wrong result of VectorMask.toLong. > The problematic code is as below: > > > ShortVector.SPECIES_64.MaskAll(true).toLong() > > assembly: > > ptrue p0.h <= MaskAll(true) > mov z16.h, p0/z, #1 > mov z17.h, #0 > uzp1 z16.b, z16.b, z17.b > fmov x10, d16 > orr x10, x10, x10, lsr #7 > orr x10, x10, x10, lsr #14 > orr x10, x10, x10, lsr #28 > and x10, x10, #0xff > > (gdb) p/x $p0 # on an SVE machine with vector length as 64 in bytes > $1 = {0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55} > > Expected: > (gdb) p/x $p0 > $1 = {0x55, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00} > > > > Considering MaskAll is used in VectorMask.fromLong() only for a special > case and relies on the mechanism of inline and intrinsification, even it > could be optimized out, this patch also adds test cases for MaskAll to > reproduce this issue stably. > > Also fix a small issue on register utilization for > sve_reduce_[max|min][D|F]. > > [1] https://github.com/openjdk/jdk18/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L416 > > hotspot/compiler/vectorapi, jdk/incubator/vector passed on SVE enabled > system. > > Change-Id: I9631f26f9232ffe7a28b74f14062d945c32fa1fb Thanks for the fix! Overall looks good to me. Just one enhancement suggestion. src/hotspot/cpu/aarch64/aarch64_sve_ad.m4 line 389: > 387: Assembler::SIMD_RegVariant size = __ elemType_to_regVariant(bt); > 388: __ sve_dup(as_FloatRegister($tmp$$reg), size, as_Register($src$$reg)); > 389: __ sve_ptrue_lanecnt(as_PRegister($dst$$reg), size, Matcher::vector_length(this)); I think you can generate this insn conditionally, only when current vector size is not MaxVectorSize. ------------- Marked as reviewed by njian (Committer). PR: https://git.openjdk.java.net/jdk18/pull/49 From mli at openjdk.java.net Tue Dec 21 07:35:26 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Tue, 21 Dec 2021 07:35:26 GMT Subject: RFR: 8277893: Arraycopy stress tests [v5] In-Reply-To: References: Message-ID: On Mon, 20 Dec 2021 11:59:05 GMT, Aleksey Shipilev wrote: >> I would like to fork the new tests off the JDK-8150730. These tests were instrumental in capturing many bugs in my arraycopy work, and I think they are good on their own merit, because they provide a test for the current baseline and on-going minor improvements in arraycopy on all platforms, not only x86_64, and they might be cleanly backportable. >> >> A brief tour of these tests: >> >> - Tests all data types; >> - Tests small arrays exhaustively, which captures conjoint/disjoint cases, errors near the edges, etc; >> - Tests large arrays with fuzzing around powers of two and powers of ten, both conjoint and disjoint cases; >> - Tests all available compilation modes for arraycopy stubs; for example, running on AVX-512 enabled machine runs all versions down to `-XX:UseAVX=0 -XX:UseSSE=0` cases; >> - Tests with/without compressed oops mode -- theoretically only needed for `Object` copies, but Hotspot cobbles together int+coops and long+no-coops loops, so I decided to alternate coops mode for all data types; >> >> My previous version used individual `@run` clauses for all configurations, but I think the Java driver is cleaner and easier to maintain. >> >> Test times: >> >> >> # x86_64 (TR 3970X) >> real 4m6.192s >> user 52m50.523s >> sys 0m13.755s >> >> # x86_64 (TR 3970X) -XX:+UseZGC >> real 6m2.573s >> user 72m43.541s >> sys 0m25.697s >> >> # x86_32 (TR 3970X) >> real 6m56.405s >> user 92m56.377s >> sys 0m6.677s >> >> # x86_64 (i5-11500) >> real 29m19.024s >> user 103m52.925s >> sys 1m7.175s >> >> # AArch64 (ThunderX2) >> real 2m59.623s >> user 26m14.624s >> sys 0m9.771s >> >> >> Since these tests are quite long, especially on small machines, I hooked them up to `hotspot:tier3`. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `compiler/stress/arraycopy` >> - [x] Linux x86_32 fastdebug `compiler/stress/arraycopy` >> - [x] Linux AArch64 fastdebug `compiler/stress/arraycopy` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 14 additional commits since the last revision: > > - Merge branch 'master' into JDK-8277893-arraycopy-tests > - Bump timeout to 7200 > - Merge branch 'master' into JDK-8277893-arraycopy-tests > - Package declarations > - Add safety check for small systems > - Renames > - Single driver for all the tests > - Safer timeout settings > - Post-merge TEST.groups cleanup > - Merge branch 'master' into JDK-8277893-arraycopy-tests > - ... and 4 more: https://git.openjdk.java.net/jdk/compare/88e52bed...6789eb8b Lgtm. Just some minor comment in test code. test/hotspot/jtreg/compiler/arraycopy/stress/AbstractStressArrayCopy.java line 125: > 123: > 124: testWith(size, l, r, len); > 125: testWith(size, r, l, len); The checks and testWith invocations of disjoint and conjoint are almost same except of "Not disjoint" and "Not conjoint" assert, it could be consolidated. ------------- Marked as reviewed by mli (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6594 From eosterlund at openjdk.java.net Tue Dec 21 08:12:15 2021 From: eosterlund at openjdk.java.net (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Tue, 21 Dec 2021 08:12:15 GMT Subject: RFR: 8278239: vmTestbase/nsk/jvmti/RedefineClasses/StressRedefine failed with EXCEPTION_ACCESS_VIOLATION at 0x000000000000000d In-Reply-To: References: Message-ID: On Mon, 20 Dec 2021 22:59:39 GMT, Coleen Phillimore wrote: > Thanks to @stefank and @fisk for the diagnosis. ZGC is looking at metadata in unloaded nmethods, but redefinition doesn't keep this metadata from being deallocated. Change the iterators that mark metadata in use to walk unloaded nmethods as well as other alive nmethods. > > The test case reproduces the crash on windows if run in an 100 iteration loop. This fix does not crash in the same test. Also ran tier1-6. Looks good, thanks for fixing this! ------------- Marked as reviewed by eosterlund (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6900 From shade at openjdk.java.net Tue Dec 21 09:15:46 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 21 Dec 2021 09:15:46 GMT Subject: RFR: 8277893: Arraycopy stress tests [v6] In-Reply-To: References: Message-ID: > I would like to fork the new tests off the JDK-8150730. These tests were instrumental in capturing many bugs in my arraycopy work, and I think they are good on their own merit, because they provide a test for the current baseline and on-going minor improvements in arraycopy on all platforms, not only x86_64, and they might be cleanly backportable. > > A brief tour of these tests: > > - Tests all data types; > - Tests small arrays exhaustively, which captures conjoint/disjoint cases, errors near the edges, etc; > - Tests large arrays with fuzzing around powers of two and powers of ten, both conjoint and disjoint cases; > - Tests all available compilation modes for arraycopy stubs; for example, running on AVX-512 enabled machine runs all versions down to `-XX:UseAVX=0 -XX:UseSSE=0` cases; > - Tests with/without compressed oops mode -- theoretically only needed for `Object` copies, but Hotspot cobbles together int+coops and long+no-coops loops, so I decided to alternate coops mode for all data types; > > My previous version used individual `@run` clauses for all configurations, but I think the Java driver is cleaner and easier to maintain. > > Test times: > > > # x86_64 (TR 3970X) > real 4m6.192s > user 52m50.523s > sys 0m13.755s > > # x86_64 (TR 3970X) -XX:+UseZGC > real 6m2.573s > user 72m43.541s > sys 0m25.697s > > # x86_32 (TR 3970X) > real 6m56.405s > user 92m56.377s > sys 0m6.677s > > # x86_64 (i5-11500) > real 29m19.024s > user 103m52.925s > sys 1m7.175s > > # AArch64 (ThunderX2) > real 2m59.623s > user 26m14.624s > sys 0m9.771s > > > Since these tests are quite long, especially on small machines, I hooked them up to `hotspot:tier3`. > > Additional testing: > - [x] Linux x86_64 fastdebug `compiler/stress/arraycopy` > - [x] Linux x86_32 fastdebug `compiler/stress/arraycopy` > - [x] Linux AArch64 fastdebug `compiler/stress/arraycopy` Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: Peel out check{Bounds,Conjoint,Disjoint} ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6594/files - new: https://git.openjdk.java.net/jdk/pull/6594/files/6789eb8b..bb823dbd Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6594&range=05 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6594&range=04-05 Stats: 31 lines in 1 file changed: 20 ins; 6 del; 5 mod Patch: https://git.openjdk.java.net/jdk/pull/6594.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6594/head:pull/6594 PR: https://git.openjdk.java.net/jdk/pull/6594 From shade at openjdk.java.net Tue Dec 21 09:15:54 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 21 Dec 2021 09:15:54 GMT Subject: RFR: 8277893: Arraycopy stress tests [v5] In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 07:31:26 GMT, Hamlin Li wrote: >> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 14 additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8277893-arraycopy-tests >> - Bump timeout to 7200 >> - Merge branch 'master' into JDK-8277893-arraycopy-tests >> - Package declarations >> - Add safety check for small systems >> - Renames >> - Single driver for all the tests >> - Safer timeout settings >> - Post-merge TEST.groups cleanup >> - Merge branch 'master' into JDK-8277893-arraycopy-tests >> - ... and 4 more: https://git.openjdk.java.net/jdk/compare/2b18a990...6789eb8b > > test/hotspot/jtreg/compiler/arraycopy/stress/AbstractStressArrayCopy.java line 125: > >> 123: >> 124: testWith(size, l, r, len); >> 125: testWith(size, r, l, len); > > The checks and testWith invocations of disjoint and conjoint are almost same except of "Not disjoint" and "Not conjoint" assert, it could be consolidated. Yes, good suggestion. See new commit. ------------- PR: https://git.openjdk.java.net/jdk/pull/6594 From mli at openjdk.java.net Tue Dec 21 09:54:17 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Tue, 21 Dec 2021 09:54:17 GMT Subject: RFR: 8277893: Arraycopy stress tests [v6] In-Reply-To: References: Message-ID: <7_iDLPnePED0yvcfLbN0-Q17YxYuLsloQvQeygo9yo0=.288dd122-064d-4318-a437-0d69e6912a58@github.com> On Tue, 21 Dec 2021 09:15:46 GMT, Aleksey Shipilev wrote: >> I would like to fork the new tests off the JDK-8150730. These tests were instrumental in capturing many bugs in my arraycopy work, and I think they are good on their own merit, because they provide a test for the current baseline and on-going minor improvements in arraycopy on all platforms, not only x86_64, and they might be cleanly backportable. >> >> A brief tour of these tests: >> >> - Tests all data types; >> - Tests small arrays exhaustively, which captures conjoint/disjoint cases, errors near the edges, etc; >> - Tests large arrays with fuzzing around powers of two and powers of ten, both conjoint and disjoint cases; >> - Tests all available compilation modes for arraycopy stubs; for example, running on AVX-512 enabled machine runs all versions down to `-XX:UseAVX=0 -XX:UseSSE=0` cases; >> - Tests with/without compressed oops mode -- theoretically only needed for `Object` copies, but Hotspot cobbles together int+coops and long+no-coops loops, so I decided to alternate coops mode for all data types; >> >> My previous version used individual `@run` clauses for all configurations, but I think the Java driver is cleaner and easier to maintain. >> >> Test times: >> >> >> # x86_64 (TR 3970X) >> real 4m6.192s >> user 52m50.523s >> sys 0m13.755s >> >> # x86_64 (TR 3970X) -XX:+UseZGC >> real 6m2.573s >> user 72m43.541s >> sys 0m25.697s >> >> # x86_32 (TR 3970X) >> real 6m56.405s >> user 92m56.377s >> sys 0m6.677s >> >> # x86_64 (i5-11500) >> real 29m19.024s >> user 103m52.925s >> sys 1m7.175s >> >> # AArch64 (ThunderX2) >> real 2m59.623s >> user 26m14.624s >> sys 0m9.771s >> >> >> Since these tests are quite long, especially on small machines, I hooked them up to `hotspot:tier3`. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `compiler/stress/arraycopy` >> - [x] Linux x86_32 fastdebug `compiler/stress/arraycopy` >> - [x] Linux AArch64 fastdebug `compiler/stress/arraycopy` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Peel out check{Bounds,Conjoint,Disjoint} Thanks for updating, looks good. ------------- PR: https://git.openjdk.java.net/jdk/pull/6594 From dlong at openjdk.java.net Tue Dec 21 09:56:18 2021 From: dlong at openjdk.java.net (Dean Long) Date: Tue, 21 Dec 2021 09:56:18 GMT Subject: [jdk18] RFR: 8278267: ARM32: several vector test failures for ASHR In-Reply-To: References: Message-ID: On Fri, 17 Dec 2021 00:38:07 GMT, Hao Sun wrote: > In ARM32, "VSHL (register)" instruction [1] is shared by vector left > shift and vector right shift, and the condition to distinguish them is > whether the shift count value is positve or negative. Hence, negation > operation is needed before conducting vector right shift. > > For vector right shift, the shift count can be a RShiftCntV or a normal > vector node. Take test case Byte64VectorTests.java [2][3] as an example. > Note that RShiftCntV is already negated via rules "vsrcntD" and > "vsrcntX" whereas the normal vector node is NOT, since we don't know > whether a normal vector node is used as a vector shift count or not. > This is the root cause for these vector test failures. > > The fix is simple, moving the negation from "vsrcntD|X" to the > corresponding vector right shift rules. > > Affected rules are vsrlBB_reg and vsraBB_reg. Note that vector shift > related rules are in form of "vsAABB_CC", where > 1) AA can be l (left shift), rl (logical right shift) and ra (arithmetic > right shift). > 2) BB can be 8B/16B (byte type), 4S/8S (short type), 2I/4I (int type) > and 2L (long type). > 3) CC can be reg (register case) and immI (immediate case). > > Minor updates: > 1) Merge "vslcntD" and "vsrcntD" into rule "vscntD", as these two rules > conduct the same duplication operation now. > 2) Update the "match" primitive for vsraBB_immI rules. > 3) Style issue: remove the surrounding space for "ins_pipe" primitive. > > Tests: > We ran tier 1~3 tests on ARM32 platform. With this patch, previously > failed vector test cases can pass now without introducing test > regression. > > [1] https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Instruction-Details/Alphabetical-list-of-instructions/VSHL--register-?lang=en > [2] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2237 > [3] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2425 There seems to be an interesting history here. Method 1: negate in RShiftCntV rule Method 2: negate in RShiftV* rules For aarch64, the negate was moved into the shift instruction in JDK-8213134 (Method 1 --> Method 2). Then JDK-8262916 proposed to move it back out of the shift instruction again. In that PR, the opinion that Method 1 (arm32) generated better code than Method 2 (aarch64) was expressed. Now it looks like this PR proposes for arm32 to move to Method 2 like aarch64, so I suspect that there will be a performance impact. I think there is a simpler fix that doesn't require moving where the negate happens. JDK-8277239 fixed a similar problem by adding a flag on vector shift nodes to indicate variable shift, then checking the flag in the predicate. Perhaps arm32 could do the same? ------------- PR: https://git.openjdk.java.net/jdk18/pull/41 From njian at openjdk.java.net Tue Dec 21 10:29:25 2021 From: njian at openjdk.java.net (Ningsheng Jian) Date: Tue, 21 Dec 2021 10:29:25 GMT Subject: [jdk18] RFR: 8278267: ARM32: several vector test failures for ASHR In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 09:53:07 GMT, Dean Long wrote: > > I think there is a simpler fix that doesn't require moving where the negate happens. JDK-8277239 fixed a similar problem by adding a flag on vector shift nodes to indicate variable shift, then checking the flag in the predicate. Perhaps arm32 could do the same? Hmm, I didn't notice JDK-8277239 before. That looks like a good solution. If it works fine on arm32, we could also do it on aarch64. ------------- PR: https://git.openjdk.java.net/jdk18/pull/41 From duke at openjdk.java.net Tue Dec 21 11:33:38 2021 From: duke at openjdk.java.net (Quan Anh Mai) Date: Tue, 21 Dec 2021 11:33:38 GMT Subject: RFR: 8278947: Support for array constants in constant table Message-ID: <2cu-iUFYs-hPu5QX_9Y9LwulgCczVK82oyDUj5ovI5c=.2a3c7529-dec1-4b8b-aaac-ff4368fb0d8f@github.com> Hi, This patch adds support for arrays in compiled code constant tables and uses it for various vector replicate operations on x86. Test: GHA, linux x64 tier 1-3 Thank you very much. ------------- Commit messages: - missing type in array_constant - Merge branch 'master' into constantVector - use constant table for remaining types - Merge branch 'master' into constantVector - refactor - replicate using constant - Merge branch 'master' into constantVector - initial commit Changes: https://git.openjdk.java.net/jdk/pull/6908/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6908&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278947 Stats: 292 lines in 6 files changed: 123 ins; 95 del; 74 mod Patch: https://git.openjdk.java.net/jdk/pull/6908.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6908/head:pull/6908 PR: https://git.openjdk.java.net/jdk/pull/6908 From roland at openjdk.java.net Tue Dec 21 12:54:16 2021 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 21 Dec 2021 12:54:16 GMT Subject: RFR: 8278518: String(byte[], int, int, Charset) constructor and String.translateEscapes() miss bounds check elimination In-Reply-To: References: <2IcCYyqxBU_oi_i9n1LZzTivcLD7QWxAlZga6kiGOPg=.a09fd12b-d1aa-4124-9bb2-31f86da2f706@github.com> Message-ID: On Sat, 18 Dec 2021 21:48:33 GMT, amirhadadi wrote: >> This does look like a HotSpot JIT compiler issue to me. My guess is that it is related to how checkBoundsOffCount() checks for offset < 0: >> >> 396 if ((length | fromIndex | size) < 0 || size > length - fromIndex) >> >> using | to combine three values. > > @dean-long actually the issue reproduces with Java 17 where `checkBoundsOffCount` was implemented in a more straight forward manner: > > > static void checkBoundsOffCount(int offset, int count, int length) { > if (offset < 0 || count < 0 || offset > length - count) { > throw new StringIndexOutOfBoundsException( > "offset " + offset + ", count " + count + ", length " + length); > } > } > > > > Here's a [gist](https://gist.github.com/amirhadadi/9505c3f5d9ad68cad2fbfd1b9e01f0b8) with a benchmark you can run. This benchmark compares safe and unsafe reads from the byte array (In this gist I didn't modify the code to add the offset >= 0 condition). > > Here are the results: > > > OpenJDK 17.0.1+12 > OSX with 2.9 GHz Quad-Core Intel Core i7 > > > > Benchmark Mode Cnt Score Error Units > StringBenchmark.safeDecoding avgt 20 120.312 ? 11.674 ns/op > StringBenchmark.unsafeDecoding avgt 20 72.628 ? 0.479 ns/op @amirhadadi unsafeDecode() is buggy I think. Offsets in the array when read with unsafe should be computed as `offset * unsafe.ARRAY_BYTE_INDEX_SCALE + unsafe.ARRAY_BYTE_BASE_OFFSET`. ------------- PR: https://git.openjdk.java.net/jdk/pull/6812 From eliu at openjdk.java.net Tue Dec 21 13:14:59 2021 From: eliu at openjdk.java.net (Eric Liu) Date: Tue, 21 Dec 2021 13:14:59 GMT Subject: [jdk18] RFR: 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail [v2] In-Reply-To: References: Message-ID: > This bug appears intermittently and it's caused by vmaskAll_immI[1] > when the vector mask size is smaller than max predicate size of running > machine. It generates an all-true predicate without considering those > inactive bits. That may result in the wrong result of VectorMask.toLong. > The problematic code is as below: > > > ShortVector.SPECIES_64.MaskAll(true).toLong() > > assembly: > > ptrue p0.h <= MaskAll(true) > mov z16.h, p0/z, #1 > mov z17.h, #0 > uzp1 z16.b, z16.b, z17.b > fmov x10, d16 > orr x10, x10, x10, lsr #7 > orr x10, x10, x10, lsr #14 > orr x10, x10, x10, lsr #28 > and x10, x10, #0xff > > (gdb) p/x $p0 # on an SVE machine with vector length as 64 in bytes > $1 = {0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55} > > Expected: > (gdb) p/x $p0 > $1 = {0x55, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00} > > > > Considering MaskAll is used in VectorMask.fromLong() only for a special > case and relies on the mechanism of inline and intrinsification, even it > could be optimized out, this patch also adds test cases for MaskAll to > reproduce this issue stably. > > Also fix a small issue on register utilization for > sve_reduce_[max|min][D|F]. > > [1] https://github.com/openjdk/jdk18/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L416 > > hotspot/compiler/vectorapi, jdk/incubator/vector passed on SVE enabled > system. > > Change-Id: I9631f26f9232ffe7a28b74f14062d945c32fa1fb Eric Liu has updated the pull request incrementally with one additional commit since the last revision: small fix Change-Id: Id71ebe5161fac08a689ee3ec538b485f6c172186 ------------- Changes: - all: https://git.openjdk.java.net/jdk18/pull/49/files - new: https://git.openjdk.java.net/jdk18/pull/49/files/109d8838..24fb4fc9 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk18&pr=49&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk18&pr=49&range=00-01 Stats: 24 lines in 2 files changed: 18 ins; 0 del; 6 mod Patch: https://git.openjdk.java.net/jdk18/pull/49.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/49/head:pull/49 PR: https://git.openjdk.java.net/jdk18/pull/49 From eliu at openjdk.java.net Tue Dec 21 13:15:04 2021 From: eliu at openjdk.java.net (Eric Liu) Date: Tue, 21 Dec 2021 13:15:04 GMT Subject: [jdk18] RFR: 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail [v2] In-Reply-To: References: Message-ID: <5mewgyi7avtXqIuC8TzTJnSQWhaTYBIfpYgSiaiHNzI=.b0718e86-b195-47d3-a39b-4779c6125a43@github.com> On Tue, 21 Dec 2021 06:54:50 GMT, Ningsheng Jian wrote: >> Eric Liu has updated the pull request incrementally with one additional commit since the last revision: >> >> small fix >> >> Change-Id: Id71ebe5161fac08a689ee3ec538b485f6c172186 > > src/hotspot/cpu/aarch64/aarch64_sve_ad.m4 line 389: > >> 387: Assembler::SIMD_RegVariant size = __ elemType_to_regVariant(bt); >> 388: __ sve_dup(as_FloatRegister($tmp$$reg), size, as_Register($src$$reg)); >> 389: __ sve_ptrue_lanecnt(as_PRegister($dst$$reg), size, Matcher::vector_length(this)); > > I think you can generate this insn conditionally, only when current vector size is not MaxVectorSize. Fixed. Thanks for your review. ------------- PR: https://git.openjdk.java.net/jdk18/pull/49 From shade at openjdk.java.net Tue Dec 21 14:05:24 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 21 Dec 2021 14:05:24 GMT Subject: RFR: 8277893: Arraycopy stress tests [v6] In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 09:15:46 GMT, Aleksey Shipilev wrote: >> I would like to fork the new tests off the JDK-8150730. These tests were instrumental in capturing many bugs in my arraycopy work, and I think they are good on their own merit, because they provide a test for the current baseline and on-going minor improvements in arraycopy on all platforms, not only x86_64, and they might be cleanly backportable. >> >> A brief tour of these tests: >> >> - Tests all data types; >> - Tests small arrays exhaustively, which captures conjoint/disjoint cases, errors near the edges, etc; >> - Tests large arrays with fuzzing around powers of two and powers of ten, both conjoint and disjoint cases; >> - Tests all available compilation modes for arraycopy stubs; for example, running on AVX-512 enabled machine runs all versions down to `-XX:UseAVX=0 -XX:UseSSE=0` cases; >> - Tests with/without compressed oops mode -- theoretically only needed for `Object` copies, but Hotspot cobbles together int+coops and long+no-coops loops, so I decided to alternate coops mode for all data types; >> >> My previous version used individual `@run` clauses for all configurations, but I think the Java driver is cleaner and easier to maintain. >> >> Test times: >> >> >> # x86_64 (TR 3970X) >> real 4m6.192s >> user 52m50.523s >> sys 0m13.755s >> >> # x86_64 (TR 3970X) -XX:+UseZGC >> real 6m2.573s >> user 72m43.541s >> sys 0m25.697s >> >> # x86_32 (TR 3970X) >> real 6m56.405s >> user 92m56.377s >> sys 0m6.677s >> >> # x86_64 (i5-11500) >> real 29m19.024s >> user 103m52.925s >> sys 1m7.175s >> >> # AArch64 (ThunderX2) >> real 2m59.623s >> user 26m14.624s >> sys 0m9.771s >> >> >> Since these tests are quite long, especially on small machines, I hooked them up to `hotspot:tier3`. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `compiler/stress/arraycopy` >> - [x] Linux x86_32 fastdebug `compiler/stress/arraycopy` >> - [x] Linux AArch64 fastdebug `compiler/stress/arraycopy` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Peel out check{Bounds,Conjoint,Disjoint} GHA are clean for the latest revision. This is test-only PR for JDK 19, so any future problems with it can be resolved next year. Meanwhile, we can get more testing over the end-of-the-year break. I am integrating. ------------- PR: https://git.openjdk.java.net/jdk/pull/6594 From shade at openjdk.java.net Tue Dec 21 14:05:24 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 21 Dec 2021 14:05:24 GMT Subject: Integrated: 8277893: Arraycopy stress tests In-Reply-To: References: Message-ID: On Mon, 29 Nov 2021 13:28:33 GMT, Aleksey Shipilev wrote: > I would like to fork the new tests off the JDK-8150730. These tests were instrumental in capturing many bugs in my arraycopy work, and I think they are good on their own merit, because they provide a test for the current baseline and on-going minor improvements in arraycopy on all platforms, not only x86_64, and they might be cleanly backportable. > > A brief tour of these tests: > > - Tests all data types; > - Tests small arrays exhaustively, which captures conjoint/disjoint cases, errors near the edges, etc; > - Tests large arrays with fuzzing around powers of two and powers of ten, both conjoint and disjoint cases; > - Tests all available compilation modes for arraycopy stubs; for example, running on AVX-512 enabled machine runs all versions down to `-XX:UseAVX=0 -XX:UseSSE=0` cases; > - Tests with/without compressed oops mode -- theoretically only needed for `Object` copies, but Hotspot cobbles together int+coops and long+no-coops loops, so I decided to alternate coops mode for all data types; > > My previous version used individual `@run` clauses for all configurations, but I think the Java driver is cleaner and easier to maintain. > > Test times: > > > # x86_64 (TR 3970X) > real 4m6.192s > user 52m50.523s > sys 0m13.755s > > # x86_64 (TR 3970X) -XX:+UseZGC > real 6m2.573s > user 72m43.541s > sys 0m25.697s > > # x86_32 (TR 3970X) > real 6m56.405s > user 92m56.377s > sys 0m6.677s > > # x86_64 (i5-11500) > real 29m19.024s > user 103m52.925s > sys 1m7.175s > > # AArch64 (ThunderX2) > real 2m59.623s > user 26m14.624s > sys 0m9.771s > > > Since these tests are quite long, especially on small machines, I hooked them up to `hotspot:tier3`. > > Additional testing: > - [x] Linux x86_64 fastdebug `compiler/stress/arraycopy` > - [x] Linux x86_32 fastdebug `compiler/stress/arraycopy` > - [x] Linux AArch64 fastdebug `compiler/stress/arraycopy` This pull request has now been integrated. Changeset: 29bd7363 Author: Aleksey Shipilev URL: https://git.openjdk.java.net/jdk/commit/29bd73638a22d341767a1266723a7d7263e17093 Stats: 1154 lines in 12 files changed: 1153 ins; 0 del; 1 mod 8277893: Arraycopy stress tests Reviewed-by: kvn, mli ------------- PR: https://git.openjdk.java.net/jdk/pull/6594 From neliasso at openjdk.java.net Tue Dec 21 14:32:25 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Tue, 21 Dec 2021 14:32:25 GMT Subject: [jdk18] RFR: 8279045: Intrinsics missing vzeroupper instruction In-Reply-To: <27V2F0jiu0-jjxsrjhA4U0idx3yY0d2zeMaN2_TDFcY=.92b56e97-d251-4f70-b3d2-a290913a8d59@github.com> References: <27V2F0jiu0-jjxsrjhA4U0idx3yY0d2zeMaN2_TDFcY=.92b56e97-d251-4f70-b3d2-a290913a8d59@github.com> Message-ID: On Tue, 21 Dec 2021 00:51:51 GMT, Smita Kamath wrote: > Adding vzeroupper instruction to aes and shift intrinsics. Looks good. ------------- Marked as reviewed by neliasso (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/52 From rkennke at openjdk.java.net Tue Dec 21 16:13:04 2021 From: rkennke at openjdk.java.net (Roman Kennke) Date: Tue, 21 Dec 2021 16:13:04 GMT Subject: [jdk18] RFR: 8278489: Preserve result in native wrapper with +UseHeavyMonitors [v3] In-Reply-To: <7DC3fQUgJZu-PrJPhcuTvHzSpRuuV4vQfe5pe6p3aqs=.9a0a3a5d-3c02-4f01-a239-8542790bd281@github.com> References: <7DC3fQUgJZu-PrJPhcuTvHzSpRuuV4vQfe5pe6p3aqs=.9a0a3a5d-3c02-4f01-a239-8542790bd281@github.com> Message-ID: > Testing observed a few failures after JDK-8276901. The reason for the failures is in the native-wrappers, in the +UseHeavyMonitors paths, we don't preserve the result register after the native call. > > Testing: > - [x] java/awt/color, sun/java2d/cmm tests (x86_32, x86_64, aarch64 x -UseHeavyMonitors, +UseHeavyMonitors) > - [x] tier1 (x86_32, x86_64, aarch64 x -UseHeavyMonitors, +UseHeavyMonitors) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: AArch64 port ------------- Changes: - all: https://git.openjdk.java.net/jdk18/pull/16/files - new: https://git.openjdk.java.net/jdk18/pull/16/files/c77c877d..7a44dcc7 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk18&pr=16&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk18&pr=16&range=01-02 Stats: 7 lines in 1 file changed: 2 ins; 1 del; 4 mod Patch: https://git.openjdk.java.net/jdk18/pull/16.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/16/head:pull/16 PR: https://git.openjdk.java.net/jdk18/pull/16 From rkennke at openjdk.java.net Tue Dec 21 16:13:06 2021 From: rkennke at openjdk.java.net (Roman Kennke) Date: Tue, 21 Dec 2021 16:13:06 GMT Subject: [jdk18] RFR: 8278489: Preserve result in native wrapper with +UseHeavyMonitors [v2] In-Reply-To: References: <7DC3fQUgJZu-PrJPhcuTvHzSpRuuV4vQfe5pe6p3aqs=.9a0a3a5d-3c02-4f01-a239-8542790bd281@github.com> Message-ID: On Mon, 20 Dec 2021 15:29:31 GMT, Aleksey Shipilev wrote: > The same thing affects `sharedRuntime_aarch64.cpp`, is it not? Otherwise the change makes sense to me. Right. I did the same fix there. ------------- PR: https://git.openjdk.java.net/jdk18/pull/16 From jbhateja at openjdk.java.net Tue Dec 21 16:28:53 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 21 Dec 2021 16:28:53 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. [v3] In-Reply-To: References: Message-ID: > - Vector.maskAll was accelerated for AVX-512 target, but x86 existing backend implementation does not enable maskAll instruction patterns for 32 bit JVM, due to which operations fall backs over replicateB operation which broadcasts the mask value in a vector. > - In some cases after unboxing-boxing optimization this vector eventually reaches to XorVMask which has different operands one held in opmask register and other in vector. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8278508: Review comments resolved. ------------- Changes: - all: https://git.openjdk.java.net/jdk18/pull/24/files - new: https://git.openjdk.java.net/jdk18/pull/24/files/4fb1ea1b..611b943a Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk18&pr=24&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk18&pr=24&range=01-02 Stats: 60 lines in 6 files changed: 16 ins; 40 del; 4 mod Patch: https://git.openjdk.java.net/jdk18/pull/24.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/24/head:pull/24 PR: https://git.openjdk.java.net/jdk18/pull/24 From jbhateja at openjdk.java.net Tue Dec 21 16:28:59 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 21 Dec 2021 16:28:59 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. [v2] In-Reply-To: References: <7xWcxp4p6wK4ccQdTb95R-vVlZtWixADO9ouDhRu0fY=.8b30e77c-4cc6-4f86-a255-7f6aa24a8122@github.com> Message-ID: On Sat, 18 Dec 2021 00:16:35 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8278508: Review comments resolution. > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4281: > >> 4279: if (mask_len > 32) { >> 4280: kmovql(dst, src); >> 4281: kshiftrql(dst, dst, 64 - mask_len); > > Here masklen is 64, so kshiftrql is not needed here? src operand carry either a 0 (false mask) or 0XFFFFFFFFFFFFFFFF (true mask) value. Thus a right shift here ensures that destination bits whose size matches with mask_len are set. I have introduced special case handling when mask_len is 64, 32 and 16. > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4299: > >> 4297: kshiftlql(dst, tmp, 32); >> 4298: korql(dst, dst, tmp); >> 4299: kshiftrql(dst, dst, 64 - mask_len); > > Do we need the kshiftrql here? The masklen is 64 here. > You could alternatively use: > kmovdl dst, src > kunpckdq dst, dst, dst DONE > src/hotspot/cpu/x86/x86.ad line 9457: > >> 9455: predicate(Matcher::vector_length(n) <= 32); >> 9456: match(Set dst (MaskAll cnt)); >> 9457: effect(TEMP dst, TEMP tmp); > > TEMP dst is not needed here. dst is both read and written to in macro assembly routine. ------------- PR: https://git.openjdk.java.net/jdk18/pull/24 From shade at openjdk.java.net Tue Dec 21 16:40:11 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 21 Dec 2021 16:40:11 GMT Subject: [jdk18] RFR: 8278489: Preserve result in native wrapper with +UseHeavyMonitors [v3] In-Reply-To: References: <7DC3fQUgJZu-PrJPhcuTvHzSpRuuV4vQfe5pe6p3aqs=.9a0a3a5d-3c02-4f01-a239-8542790bd281@github.com> Message-ID: <2kiS1aIrrXC-rS8j1kMYZNipBtFQtb6MzquYrqBdsQE=.406cb82a-b3c6-4a46-bd54-5a48a685a6dc@github.com> On Tue, 21 Dec 2021 16:13:04 GMT, Roman Kennke wrote: >> Testing observed a few failures after JDK-8276901. The reason for the failures is in the native-wrappers, in the +UseHeavyMonitors paths, we don't preserve the result register after the native call. >> >> Testing: >> - [x] java/awt/color, sun/java2d/cmm tests (x86_32, x86_64, aarch64 x -UseHeavyMonitors, +UseHeavyMonitors) >> - [x] tier1 (x86_32, x86_64, aarch64 x -UseHeavyMonitors, +UseHeavyMonitors) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > AArch64 port Looks fine to me, but someone else should take a look too. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/16 From duke at openjdk.java.net Tue Dec 21 17:04:13 2021 From: duke at openjdk.java.net (amirhadadi) Date: Tue, 21 Dec 2021 17:04:13 GMT Subject: RFR: 8278518: String(byte[], int, int, Charset) constructor and String.translateEscapes() miss bounds check elimination In-Reply-To: References: <2IcCYyqxBU_oi_i9n1LZzTivcLD7QWxAlZga6kiGOPg=.a09fd12b-d1aa-4124-9bb2-31f86da2f706@github.com> Message-ID: On Tue, 21 Dec 2021 12:50:36 GMT, Roland Westrelin wrote: >> @dean-long actually the issue reproduces with Java 17 where `checkBoundsOffCount` was implemented in a more straight forward manner: >> >> >> static void checkBoundsOffCount(int offset, int count, int length) { >> if (offset < 0 || count < 0 || offset > length - count) { >> throw new StringIndexOutOfBoundsException( >> "offset " + offset + ", count " + count + ", length " + length); >> } >> } >> >> >> >> Here's a [gist](https://gist.github.com/amirhadadi/9505c3f5d9ad68cad2fbfd1b9e01f0b8) with a benchmark you can run. This benchmark compares safe and unsafe reads from the byte array (In this gist I didn't modify the code to add the offset >= 0 condition). >> >> Here are the results: >> >> >> OpenJDK 17.0.1+12 >> OSX with 2.9 GHz Quad-Core Intel Core i7 >> >> >> >> Benchmark Mode Cnt Score Error Units >> StringBenchmark.safeDecoding avgt 20 120.312 ? 11.674 ns/op >> StringBenchmark.unsafeDecoding avgt 20 72.628 ? 0.479 ns/op > > @amirhadadi unsafeDecode() is buggy I think. Offsets in the array when read with unsafe should be computed as `offset * unsafe.ARRAY_BYTE_INDEX_SCALE + unsafe.ARRAY_BYTE_BASE_OFFSET`. @rwestrel thanks for the correction! Here are the updated results: Benchmark Mode Cnt Score Error Units StringBenchmark.safeDecoding avgt 20 113.849 ? 1.609 ns/op StringBenchmark.unsafeDecoding avgt 20 85.272 ? 1.462 ns/op ------------- PR: https://git.openjdk.java.net/jdk/pull/6812 From iveresov at openjdk.java.net Tue Dec 21 17:17:10 2021 From: iveresov at openjdk.java.net (Igor Veresov) Date: Tue, 21 Dec 2021 17:17:10 GMT Subject: RFR: 8271202: C1: assert(false) failed: live_in set of first block must be empty [v2] In-Reply-To: References: Message-ID: On Tue, 7 Dec 2021 23:25:03 GMT, Martin Doerr wrote: >> I have written a checker which detects usage of the illegal phi function. In case of the reproducer provided in the JBS bug ("Reduced.java"), it finds the following and bails out: >> >> invalidating local 8 because of type mismatch (new_value is NULL) >> Bailing out because StoreIndexed (id 98) uses illegal phi (id 68) >> >> I haven't checked why that node uses the illegal phi. That still seems to be a bug. Maybe there's a better solution to the underlying problem, but I hope my checker is useful to analyze bugs and to make C1 more resilient. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add test. Do you want me to take over this issue? ------------- PR: https://git.openjdk.java.net/jdk/pull/6683 From dcubed at openjdk.java.net Tue Dec 21 17:29:39 2021 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Tue, 21 Dec 2021 17:29:39 GMT Subject: [jdk18] RFR: 8279074: ProblemList compiler/codecache/jmx/PoolsIndependenceTest.java on macosx-aarch64 Message-ID: A trivial fix to ProblemList compiler/codecache/jmx/PoolsIndependenceTest.java on macosx-aarch64. ------------- Commit messages: - 8279074: ProblemList compiler/codecache/jmx/PoolsIndependenceTest.java on macosx-aarch64 Changes: https://git.openjdk.java.net/jdk18/pull/59/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=59&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8279074 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk18/pull/59.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/59/head:pull/59 PR: https://git.openjdk.java.net/jdk18/pull/59 From sviswanathan at openjdk.java.net Tue Dec 21 17:30:22 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 21 Dec 2021 17:30:22 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. [v3] In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 16:28:53 GMT, Jatin Bhateja wrote: >> - Vector.maskAll was accelerated for AVX-512 target, but x86 existing backend implementation does not enable maskAll instruction patterns for 32 bit JVM, due to which operations fall backs over replicateB operation which broadcasts the mask value in a vector. >> - In some cases after unboxing-boxing optimization this vector eventually reaches to XorVMask which has different operands one held in opmask register and other in vector. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8278508: Review comments resolved. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4283: > 4281: if (mask_len != 64) { > 4282: kshiftrql(dst, dst, 64 - mask_len); > 4283: } We are only supporting vector lengths of power of 2 for x86 (8,16,32,64) for byte vector. The only case that comes here is mask_len == 64. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4304: > 4302: kmovdl(tmp, src); > 4303: kunpckdql(dst, tmp, tmp); > 4304: kshiftrql(dst, dst, 64 - mask_len); We are only supporting vector lengths of power of 2 for x86 (8,16,32,64) for byte vector. The only case that comes here is mask_len == 64 so we dont need kshiftrql. ------------- PR: https://git.openjdk.java.net/jdk18/pull/24 From kvn at openjdk.java.net Tue Dec 21 17:32:15 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 21 Dec 2021 17:32:15 GMT Subject: [jdk18] RFR: 8278948: compiler/vectorapi/reshape/TestVectorCastAVX1.java crashes in assembler In-Reply-To: References: Message-ID: On Sat, 18 Dec 2021 02:15:00 GMT, Quan Anh Mai wrote: > This patch fixes a crash spotted in `compiler/vectorapi/reshape/TestVectorCastAVX1.java` in mainline. The reason for the failure is the incorrect vector encoding of integer promotion operation leads to unsupported instruction `vpmovsxbd/vpmovsxwd ymm, xmm` on AVX1. For the same reason we currently cannot cast a short or byte vector to a 256-bit float vector on AVX1, so I also fixed that. @merykitty please, add comment in bug report how to reproduced it and how failure looks like. We do run AVX1 testing with these tests and I did not see failure. @jatin-bhateja or @sviswa7 please look on it. ------------- PR: https://git.openjdk.java.net/jdk18/pull/46 From sviswanathan at openjdk.java.net Tue Dec 21 17:33:22 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 21 Dec 2021 17:33:22 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. [v2] In-Reply-To: References: <7xWcxp4p6wK4ccQdTb95R-vVlZtWixADO9ouDhRu0fY=.8b30e77c-4cc6-4f86-a255-7f6aa24a8122@github.com> Message-ID: On Tue, 21 Dec 2021 16:24:01 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/x86.ad line 9457: >> >>> 9455: predicate(Matcher::vector_length(n) <= 32); >>> 9456: match(Set dst (MaskAll cnt)); >>> 9457: effect(TEMP dst, TEMP tmp); >> >> TEMP dst is not needed here. > > dst is both read and written to in macro assembly routine. dst both written and read is not the criteria for TEMP dst effect. Only if the src is used after dest is written in the encoding for instruct, TEMP dst is needed. ------------- PR: https://git.openjdk.java.net/jdk18/pull/24 From sviswanathan at openjdk.java.net Tue Dec 21 17:39:25 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 21 Dec 2021 17:39:25 GMT Subject: [jdk18] RFR: 8279045: Intrinsics missing vzeroupper instruction In-Reply-To: <27V2F0jiu0-jjxsrjhA4U0idx3yY0d2zeMaN2_TDFcY=.92b56e97-d251-4f70-b3d2-a290913a8d59@github.com> References: <27V2F0jiu0-jjxsrjhA4U0idx3yY0d2zeMaN2_TDFcY=.92b56e97-d251-4f70-b3d2-a290913a8d59@github.com> Message-ID: On Tue, 21 Dec 2021 00:51:51 GMT, Smita Kamath wrote: > Adding vzeroupper instruction to aes and shift intrinsics. The patch looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/52 From ccheung at openjdk.java.net Tue Dec 21 17:42:17 2021 From: ccheung at openjdk.java.net (Calvin Cheung) Date: Tue, 21 Dec 2021 17:42:17 GMT Subject: [jdk18] RFR: 8279074: ProblemList compiler/codecache/jmx/PoolsIndependenceTest.java on macosx-aarch64 In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 17:14:50 GMT, Daniel D. Daugherty wrote: > A trivial fix to ProblemList compiler/codecache/jmx/PoolsIndependenceTest.java on macosx-aarch64. LGTM. ------------- Marked as reviewed by ccheung (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/59 From dcubed at openjdk.java.net Tue Dec 21 17:42:17 2021 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Tue, 21 Dec 2021 17:42:17 GMT Subject: [jdk18] RFR: 8279074: ProblemList compiler/codecache/jmx/PoolsIndependenceTest.java on macosx-aarch64 In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 17:39:00 GMT, Calvin Cheung wrote: >> A trivial fix to ProblemList compiler/codecache/jmx/PoolsIndependenceTest.java on macosx-aarch64. > > LGTM. @calvinccheung - Thanks for the review! ------------- PR: https://git.openjdk.java.net/jdk18/pull/59 From dcubed at openjdk.java.net Tue Dec 21 17:46:18 2021 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Tue, 21 Dec 2021 17:46:18 GMT Subject: [jdk18] Integrated: 8279074: ProblemList compiler/codecache/jmx/PoolsIndependenceTest.java on macosx-aarch64 In-Reply-To: References: Message-ID: <5WVYUKB0NSQx1wq_Tk_giapnvx_6KmXvqDk9qwGjZoQ=.b6a4b574-34e7-438d-83a5-79e97fa78775@github.com> On Tue, 21 Dec 2021 17:14:50 GMT, Daniel D. Daugherty wrote: > A trivial fix to ProblemList compiler/codecache/jmx/PoolsIndependenceTest.java on macosx-aarch64. This pull request has now been integrated. Changeset: 54517fa3 Author: Daniel D. Daugherty URL: https://git.openjdk.java.net/jdk18/commit/54517fa3d80b50bfa8a4f6b7937b95e379a1dfeb Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8279074: ProblemList compiler/codecache/jmx/PoolsIndependenceTest.java on macosx-aarch64 Reviewed-by: ccheung ------------- PR: https://git.openjdk.java.net/jdk18/pull/59 From duke at openjdk.java.net Tue Dec 21 17:49:14 2021 From: duke at openjdk.java.net (Quan Anh Mai) Date: Tue, 21 Dec 2021 17:49:14 GMT Subject: [jdk18] RFR: 8278948: compiler/vectorapi/reshape/TestVectorCastAVX1.java crashes in assembler In-Reply-To: References: Message-ID: On Sat, 18 Dec 2021 02:15:00 GMT, Quan Anh Mai wrote: > This patch fixes a crash spotted in `compiler/vectorapi/reshape/TestVectorCastAVX1.java` in mainline. The reason for the failure is the incorrect vector encoding of integer promotion operation leads to unsupported instruction `vpmovsxbd/vpmovsxwd ymm, xmm` on AVX1. For the same reason we currently cannot cast a short or byte vector to a 256-bit float vector on AVX1, so I also fixed that. Hi, I have commented in the bug report, the crash can be observed by uncommenting line 51 in `compiler/vectorapi/reshape/utils/TestCastMethods.java` which tells `TestVectorCastAVX1.java` to perform a cast from `Short64Vector` to `Double256Vector`. https://github.com/openjdk/jdk/blob/f7309060ded0edb1e614663572f876d83b77c28e/test/hotspot/jtreg/compiler/vectorapi/reshape/utils/TestCastMethods.java#L51 The crash happens at `void Assembler::vpmovsxwd(XMMRegister dst, XMMRegister src, int vector_len)` due to `assert(vector_len == AVX_128bit ? VM_Version::supports_avx() : vector_len == AVX_256bit ? VM_Version::supports_avx2() : VM_Version::supports_evex(), "");` Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk18/pull/46 From kvn at openjdk.java.net Tue Dec 21 18:09:11 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 21 Dec 2021 18:09:11 GMT Subject: [jdk18] RFR: 8278948: compiler/vectorapi/reshape/TestVectorCastAVX1.java crashes in assembler In-Reply-To: References: Message-ID: On Sat, 18 Dec 2021 02:15:00 GMT, Quan Anh Mai wrote: > This patch fixes a crash spotted in `compiler/vectorapi/reshape/TestVectorCastAVX1.java` in mainline. The reason for the failure is the incorrect vector encoding of integer promotion operation leads to unsupported instruction `vpmovsxbd/vpmovsxwd ymm, xmm` on AVX1. For the same reason we currently cannot cast a short or byte vector to a 256-bit float vector on AVX1, so I also fixed that. Thank you for explaining. These tests are not present in JDK18 and that confused me. I suggest to add new simple regression test with your changes for JDK 18 so we can verify the fix. And file a separate RFE for JDK 19 to uncomment lines in TestCastMethods.java when the fix is auto-pushed into JDK 19. And what about commented line 42 `makePair(BSPEC64, DSPEC256)1` ? ------------- PR: https://git.openjdk.java.net/jdk18/pull/46 From duke at openjdk.java.net Tue Dec 21 18:17:13 2021 From: duke at openjdk.java.net (Quan Anh Mai) Date: Tue, 21 Dec 2021 18:17:13 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. [v2] In-Reply-To: References: <7xWcxp4p6wK4ccQdTb95R-vVlZtWixADO9ouDhRu0fY=.8b30e77c-4cc6-4f86-a255-7f6aa24a8122@github.com> Message-ID: On Tue, 21 Dec 2021 17:30:31 GMT, Sandhya Viswanathan wrote: >> dst is both read and written to in macro assembly routine. > > dst both written and read is not the criteria for TEMP dst effect. Only if the src is used after dest is written in the encoding for instruct, TEMP dst is needed. Hi, do we need `TEMP dst` if some `tmp` is used after `dst` is written, as without it `tmp` and `dst` are implied to have non-overlapping lifetime and can lead to conflict. This is a pure question unrelated to the PR. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk18/pull/24 From dlong at openjdk.java.net Tue Dec 21 18:20:27 2021 From: dlong at openjdk.java.net (Dean Long) Date: Tue, 21 Dec 2021 18:20:27 GMT Subject: [jdk18] RFR: 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail [v2] In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 13:14:59 GMT, Eric Liu wrote: >> This bug appears intermittently and it's caused by vmaskAll_immI[1] >> when the vector mask size is smaller than max predicate size of running >> machine. It generates an all-true predicate without considering those >> inactive bits. That may result in the wrong result of VectorMask.toLong. >> The problematic code is as below: >> >> >> ShortVector.SPECIES_64.MaskAll(true).toLong() >> >> assembly: >> >> ptrue p0.h <= MaskAll(true) >> mov z16.h, p0/z, #1 >> mov z17.h, #0 >> uzp1 z16.b, z16.b, z17.b >> fmov x10, d16 >> orr x10, x10, x10, lsr #7 >> orr x10, x10, x10, lsr #14 >> orr x10, x10, x10, lsr #28 >> and x10, x10, #0xff >> >> (gdb) p/x $p0 # on an SVE machine with vector length as 64 in bytes >> $1 = {0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55} >> >> Expected: >> (gdb) p/x $p0 >> $1 = {0x55, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00} >> >> >> >> Considering MaskAll is used in VectorMask.fromLong() only for a special >> case and relies on the mechanism of inline and intrinsification, even it >> could be optimized out, this patch also adds test cases for MaskAll to >> reproduce this issue stably. >> >> Also fix a small issue on register utilization for >> sve_reduce_[max|min][D|F]. >> >> [1] https://github.com/openjdk/jdk18/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L416 >> >> hotspot/compiler/vectorapi, jdk/incubator/vector passed on SVE enabled >> system. >> >> Change-Id: I9631f26f9232ffe7a28b74f14062d945c32fa1fb > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > small fix > > Change-Id: Id71ebe5161fac08a689ee3ec538b485f6c172186 src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1264: > 1262: break; > 1263: case 256: > 1264: sve_ptrue(dst, size, /* VL256 */ 0b01101); Why not use an enum for these magic constants? ------------- PR: https://git.openjdk.java.net/jdk18/pull/49 From svkamath at openjdk.java.net Tue Dec 21 18:44:20 2021 From: svkamath at openjdk.java.net (Smita Kamath) Date: Tue, 21 Dec 2021 18:44:20 GMT Subject: [jdk18] RFR: 8279045: Intrinsics missing vzeroupper instruction In-Reply-To: <27V2F0jiu0-jjxsrjhA4U0idx3yY0d2zeMaN2_TDFcY=.92b56e97-d251-4f70-b3d2-a290913a8d59@github.com> References: <27V2F0jiu0-jjxsrjhA4U0idx3yY0d2zeMaN2_TDFcY=.92b56e97-d251-4f70-b3d2-a290913a8d59@github.com> Message-ID: On Tue, 21 Dec 2021 00:51:51 GMT, Smita Kamath wrote: > Adding vzeroupper instruction to aes and shift intrinsics. @vnkozlov Could this code change be integrated? Please do advice. If it needs testing, would it be possible for you to help me out? Thank you. ------------- PR: https://git.openjdk.java.net/jdk18/pull/52 From sviswanathan at openjdk.java.net Tue Dec 21 19:15:12 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 21 Dec 2021 19:15:12 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. [v2] In-Reply-To: References: <7xWcxp4p6wK4ccQdTb95R-vVlZtWixADO9ouDhRu0fY=.8b30e77c-4cc6-4f86-a255-7f6aa24a8122@github.com> Message-ID: On Tue, 21 Dec 2021 18:13:44 GMT, Quan Anh Mai wrote: >> dst both written and read is not the criteria for TEMP dst effect. Only if the src is used after dest is written in the encoding for instruct, TEMP dst is needed. > > Hi, do we need `TEMP dst` if some `tmp` is used after `dst` is written, as without it `tmp` and `dst` are implied to have non-overlapping lifetime and can lead to conflict. This is a pure question unrelated to the PR. Thank you very much. Hi @merykitty, yes we would need TEMP dst in that case as well. ------------- PR: https://git.openjdk.java.net/jdk18/pull/24 From jbhateja at openjdk.java.net Tue Dec 21 19:31:18 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 21 Dec 2021 19:31:18 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. [v2] In-Reply-To: References: <7xWcxp4p6wK4ccQdTb95R-vVlZtWixADO9ouDhRu0fY=.8b30e77c-4cc6-4f86-a255-7f6aa24a8122@github.com> Message-ID: On Tue, 21 Dec 2021 19:12:17 GMT, Sandhya Viswanathan wrote: >> Hi, do we need `TEMP dst` if some `tmp` is used after `dst` is written, as without it `tmp` and `dst` are implied to have non-overlapping lifetime and can lead to conflict. This is a pure question unrelated to the PR. Thank you very much. > > Hi @merykitty, yes we would need TEMP dst in that case as well. TEMP attribute ensures creation of a temporary machine operand which interferes with source operands if their registerclasses are overlapping, a TEMP + DST data flow attribute ensures that DST is not accidentally allocated a SRC register even if SRC is not live beyond that instruction. This shall prevent overriding the src before using its value in the instruction encoding block. ------------- PR: https://git.openjdk.java.net/jdk18/pull/24 From kvn at openjdk.java.net Tue Dec 21 19:39:17 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 21 Dec 2021 19:39:17 GMT Subject: [jdk18] RFR: 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail [v2] In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 13:14:59 GMT, Eric Liu wrote: >> This bug appears intermittently and it's caused by vmaskAll_immI[1] >> when the vector mask size is smaller than max predicate size of running >> machine. It generates an all-true predicate without considering those >> inactive bits. That may result in the wrong result of VectorMask.toLong. >> The problematic code is as below: >> >> >> ShortVector.SPECIES_64.MaskAll(true).toLong() >> >> assembly: >> >> ptrue p0.h <= MaskAll(true) >> mov z16.h, p0/z, #1 >> mov z17.h, #0 >> uzp1 z16.b, z16.b, z17.b >> fmov x10, d16 >> orr x10, x10, x10, lsr #7 >> orr x10, x10, x10, lsr #14 >> orr x10, x10, x10, lsr #28 >> and x10, x10, #0xff >> >> (gdb) p/x $p0 # on an SVE machine with vector length as 64 in bytes >> $1 = {0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55} >> >> Expected: >> (gdb) p/x $p0 >> $1 = {0x55, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00} >> >> >> >> Considering MaskAll is used in VectorMask.fromLong() only for a special >> case and relies on the mechanism of inline and intrinsification, even it >> could be optimized out, this patch also adds test cases for MaskAll to >> reproduce this issue stably. >> >> Also fix a small issue on register utilization for >> sve_reduce_[max|min][D|F]. >> >> [1] https://github.com/openjdk/jdk18/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L416 >> >> hotspot/compiler/vectorapi, jdk/incubator/vector passed on SVE enabled >> system. >> >> Change-Id: I9631f26f9232ffe7a28b74f14062d945c32fa1fb > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > small fix > > Change-Id: Id71ebe5161fac08a689ee3ec538b485f6c172186 I ran tier1 and 3 tests in `test/hotspot/jtreg/gc/stress/gcbasher/` failed on macosx-aarch64: java.lang.IllegalStateException at gc.stress.gcbasher.ByteCursor.readUtf8(ByteCursor.java:110) at gc.stress.gcbasher.Decompiler.decodeConstantPool(Decompiler.java:310) at gc.stress.gcbasher.Decompiler.(Decompiler.java:42) at gc.stress.gcbasher.TestGCBasher.parseClassFiles(TestGCBasher.java:46) at gc.stress.gcbasher.TestGCBasher.main(TestGCBasher.java:63) at gc.stress.gcbasher.TestGCBasherWithG1.main(TestGCBasherWithG1.java:40) It seems new failure. ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/49 From sviswanathan at openjdk.java.net Tue Dec 21 19:45:24 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 21 Dec 2021 19:45:24 GMT Subject: [jdk18] RFR: 8278948: compiler/vectorapi/reshape/TestVectorCastAVX1.java crashes in assembler In-Reply-To: References: Message-ID: On Sat, 18 Dec 2021 02:15:00 GMT, Quan Anh Mai wrote: > This patch fixes a crash spotted in `compiler/vectorapi/reshape/TestVectorCastAVX1.java` in mainline. The reason for the failure is the incorrect vector encoding of integer promotion operation leads to unsupported instruction `vpmovsxbd/vpmovsxwd ymm, xmm` on AVX1. For the same reason we currently cannot cast a short or byte vector to a 256-bit float vector on AVX1, so I also fixed that. src/hotspot/cpu/x86/x86.ad line 1778: > 1776: case Op_VectorCastS2X: > 1777: case Op_VectorCastI2X: > 1778: if (bt != T_DOUBLE && size_in_bits == 256 && UseAVX < 2) { CastI2X should work with (UseAVX == 1) for (bt == T_FLOAT) so the prior code was correct for CastI2X. The fix is only needed for CastS2X and CastB2X. ------------- PR: https://git.openjdk.java.net/jdk18/pull/46 From shade at openjdk.java.net Tue Dec 21 19:48:42 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 21 Dec 2021 19:48:42 GMT Subject: [jdk18] RFR: 8279076: C2: Bad AD file when matching SqrtF with UseSSE=0 Message-ID: See the reproducer and the analysis in the bug. The fix is simple: `Matcher::match_rule_supported` should handle the predicates for current `SqrtF` and `SqrtD` match rules. Additional testing: - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=0`, new test now passes - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=1`, new test passes - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=2`, new test passes - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=0`, `jdk/incubator/vector/` now passes - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=1`, `jdk/incubator/vector/` passes - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=2`, `jdk/incubator/vector/` passes (some unrelated failures) - [x] Linux x86_64, new test passes ------------- Commit messages: - Minor adjustment in test - Fix Changes: https://git.openjdk.java.net/jdk18/pull/60/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=60&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8279076 Stats: 64 lines in 2 files changed: 64 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk18/pull/60.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/60/head:pull/60 PR: https://git.openjdk.java.net/jdk18/pull/60 From kvn at openjdk.java.net Tue Dec 21 19:51:16 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 21 Dec 2021 19:51:16 GMT Subject: [jdk18] RFR: 8279045: Intrinsics missing vzeroupper instruction In-Reply-To: <27V2F0jiu0-jjxsrjhA4U0idx3yY0d2zeMaN2_TDFcY=.92b56e97-d251-4f70-b3d2-a290913a8d59@github.com> References: <27V2F0jiu0-jjxsrjhA4U0idx3yY0d2zeMaN2_TDFcY=.92b56e97-d251-4f70-b3d2-a290913a8d59@github.com> Message-ID: On Tue, 21 Dec 2021 00:51:51 GMT, Smita Kamath wrote: > Adding vzeroupper instruction to aes and shift intrinsics. Please, update to latest changes in JDK18 (there was changes pushed to stubs #19). I started testing. ------------- PR: https://git.openjdk.java.net/jdk18/pull/52 From svkamath at openjdk.java.net Tue Dec 21 19:57:56 2021 From: svkamath at openjdk.java.net (Smita Kamath) Date: Tue, 21 Dec 2021 19:57:56 GMT Subject: [jdk18] RFR: 8279045: Intrinsics missing vzeroupper instruction [v2] In-Reply-To: <27V2F0jiu0-jjxsrjhA4U0idx3yY0d2zeMaN2_TDFcY=.92b56e97-d251-4f70-b3d2-a290913a8d59@github.com> References: <27V2F0jiu0-jjxsrjhA4U0idx3yY0d2zeMaN2_TDFcY=.92b56e97-d251-4f70-b3d2-a290913a8d59@github.com> Message-ID: > Adding vzeroupper instruction to aes and shift intrinsics. Smita Kamath has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' - Adding vzeroupper instruction to aes and shift intrinsics ------------- Changes: - all: https://git.openjdk.java.net/jdk18/pull/52/files - new: https://git.openjdk.java.net/jdk18/pull/52/files/f8021390..1b131eec Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk18&pr=52&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk18&pr=52&range=00-01 Stats: 1689 lines in 115 files changed: 1120 ins; 135 del; 434 mod Patch: https://git.openjdk.java.net/jdk18/pull/52.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/52/head:pull/52 PR: https://git.openjdk.java.net/jdk18/pull/52 From kvn at openjdk.java.net Tue Dec 21 20:05:16 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 21 Dec 2021 20:05:16 GMT Subject: [jdk18] RFR: 8279076: C2: Bad AD file when matching SqrtF with UseSSE=0 In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 19:32:18 GMT, Aleksey Shipilev wrote: > See the reproducer and the analysis in the bug. > > The fix is simple: `Matcher::match_rule_supported` should handle the predicates for current `SqrtF` and `SqrtD` match rules. > > Additional testing: > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=0`, new test now passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=1`, new test passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=2`, new test passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=0`, `jdk/incubator/vector/` now passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=1`, `jdk/incubator/vector/` passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=2`, `jdk/incubator/vector/` passes (some unrelated failures) > - [x] Linux x86_64, new test passes Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/60 From jbhateja at openjdk.java.net Tue Dec 21 20:08:58 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 21 Dec 2021 20:08:58 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. [v4] In-Reply-To: References: Message-ID: > - Vector.maskAll was accelerated for AVX-512 target, but x86 existing backend implementation does not enable maskAll instruction patterns for 32 bit JVM, due to which operations fall backs over replicateB operation which broadcasts the mask value in a vector. > - In some cases after unboxing-boxing optimization this vector eventually reaches to XorVMask which has different operands one held in opmask register and other in vector. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8278508: Review comments addressed. ------------- Changes: - all: https://git.openjdk.java.net/jdk18/pull/24/files - new: https://git.openjdk.java.net/jdk18/pull/24/files/611b943a..583062da Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk18&pr=24&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk18&pr=24&range=02-03 Stats: 10 lines in 4 files changed: 0 ins; 7 del; 3 mod Patch: https://git.openjdk.java.net/jdk18/pull/24.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/24/head:pull/24 PR: https://git.openjdk.java.net/jdk18/pull/24 From jbhateja at openjdk.java.net Tue Dec 21 20:09:00 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 21 Dec 2021 20:09:00 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. [v3] In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 16:28:53 GMT, Jatin Bhateja wrote: >> - Vector.maskAll was accelerated for AVX-512 target, but x86 existing backend implementation does not enable maskAll instruction patterns for 32 bit JVM, due to which operations fall backs over replicateB operation which broadcasts the mask value in a vector. >> - In some cases after unboxing-boxing optimization this vector eventually reaches to XorVMask which has different operands one held in opmask register and other in vector. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8278508: Review comments resolved. Hi @sviswa7, your comments addressed. TEMP def was an artifact of some other change I was doing initially. ------------- PR: https://git.openjdk.java.net/jdk18/pull/24 From coleenp at openjdk.java.net Tue Dec 21 20:58:32 2021 From: coleenp at openjdk.java.net (Coleen Phillimore) Date: Tue, 21 Dec 2021 20:58:32 GMT Subject: [jdk18] RFR: 8278239: vmTestbase/nsk/jvmti/RedefineClasses/StressRedefine failed with EXCEPTION_ACCESS_VIOLATION at 0x000000000000000d Message-ID: This is the fix for https://github.com/openjdk/jdk/pull/6900 retargeted to JDK 18. Thanks to @stefank and @fisk for the diagnosis. ZGC is looking at metadata in unloaded nmethods, but redefinition doesn't keep this metadata from being deallocated. Change the iterators that mark metadata in use to walk unloaded nmethods as well as other alive nmethods. The test case reproduces the crash on windows if run in an 100 iteration loop. This fix does not crash in the same test. Also ran tier1-6. Above testing in progress. ------------- Commit messages: - 8278239: vmTestbase/nsk/jvmti/RedefineClasses/StressRedefine failed with EXCEPTION_ACCESS_VIOLATION at 0x000000000000000d Changes: https://git.openjdk.java.net/jdk18/pull/63/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=63&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8278239 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.java.net/jdk18/pull/63.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/63/head:pull/63 PR: https://git.openjdk.java.net/jdk18/pull/63 From kvn at openjdk.java.net Tue Dec 21 21:31:22 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 21 Dec 2021 21:31:22 GMT Subject: [jdk18] RFR: 8279045: Intrinsics missing vzeroupper instruction [v2] In-Reply-To: References: <27V2F0jiu0-jjxsrjhA4U0idx3yY0d2zeMaN2_TDFcY=.92b56e97-d251-4f70-b3d2-a290913a8d59@github.com> Message-ID: On Tue, 21 Dec 2021 19:57:56 GMT, Smita Kamath wrote: >> Adding vzeroupper instruction to aes and shift intrinsics. > > Smita Kamath has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' > - Adding vzeroupper instruction to aes and shift intrinsics Testing passed. Good to integrate. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/52 From sviswanathan at openjdk.java.net Tue Dec 21 21:47:23 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 21 Dec 2021 21:47:23 GMT Subject: [jdk18] RFR: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. [v4] In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 20:08:58 GMT, Jatin Bhateja wrote: >> - Vector.maskAll was accelerated for AVX-512 target, but x86 existing backend implementation does not enable maskAll instruction patterns for 32 bit JVM, due to which operations fall backs over replicateB operation which broadcasts the mask value in a vector. >> - In some cases after unboxing-boxing optimization this vector eventually reaches to XorVMask which has different operands one held in opmask register and other in vector. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8278508: Review comments addressed. Patch looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/24 From svkamath at openjdk.java.net Tue Dec 21 22:04:12 2021 From: svkamath at openjdk.java.net (Smita Kamath) Date: Tue, 21 Dec 2021 22:04:12 GMT Subject: [jdk18] RFR: 8279045: Intrinsics missing vzeroupper instruction In-Reply-To: References: <27V2F0jiu0-jjxsrjhA4U0idx3yY0d2zeMaN2_TDFcY=.92b56e97-d251-4f70-b3d2-a290913a8d59@github.com> Message-ID: On Tue, 21 Dec 2021 19:48:07 GMT, Vladimir Kozlov wrote: >> Adding vzeroupper instruction to aes and shift intrinsics. > > Please, update to latest changes in JDK18 (there was changes pushed to stubs #19). > I started testing. @vnkozlov Thank you. ------------- PR: https://git.openjdk.java.net/jdk18/pull/52 From svkamath at openjdk.java.net Tue Dec 21 22:13:16 2021 From: svkamath at openjdk.java.net (Smita Kamath) Date: Tue, 21 Dec 2021 22:13:16 GMT Subject: [jdk18] Integrated: 8279045: Intrinsics missing vzeroupper instruction In-Reply-To: <27V2F0jiu0-jjxsrjhA4U0idx3yY0d2zeMaN2_TDFcY=.92b56e97-d251-4f70-b3d2-a290913a8d59@github.com> References: <27V2F0jiu0-jjxsrjhA4U0idx3yY0d2zeMaN2_TDFcY=.92b56e97-d251-4f70-b3d2-a290913a8d59@github.com> Message-ID: On Tue, 21 Dec 2021 00:51:51 GMT, Smita Kamath wrote: > Adding vzeroupper instruction to aes and shift intrinsics. This pull request has now been integrated. Changeset: 9ee3ccfe Author: Smita Kamath Committer: Sandhya Viswanathan URL: https://git.openjdk.java.net/jdk18/commit/9ee3ccfee2c9cc54ac7dca49fbf35135e627ef18 Stats: 7 lines in 1 file changed: 7 ins; 0 del; 0 mod 8279045: Intrinsics missing vzeroupper instruction Reviewed-by: neliasso, sviswanathan, kvn ------------- PR: https://git.openjdk.java.net/jdk18/pull/52 From sviswanathan at openjdk.java.net Tue Dec 21 22:28:17 2021 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 21 Dec 2021 22:28:17 GMT Subject: [jdk18] RFR: 8279076: C2: Bad AD file when matching SqrtF with UseSSE=0 In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 19:32:18 GMT, Aleksey Shipilev wrote: > See the reproducer and the analysis in the bug. > > The fix is simple: `Matcher::match_rule_supported` should handle the predicates for current `SqrtF` and `SqrtD` match rules. > > Additional testing: > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=0`, new test now passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=1`, new test passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=2`, new test passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=0`, `jdk/incubator/vector/` now passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=1`, `jdk/incubator/vector/` passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=2`, `jdk/incubator/vector/` passes (some unrelated failures) > - [x] Linux x86_64, new test passes Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/60 From eliu at openjdk.java.net Wed Dec 22 01:06:20 2021 From: eliu at openjdk.java.net (Eric Liu) Date: Wed, 22 Dec 2021 01:06:20 GMT Subject: [jdk18] RFR: 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail [v2] In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 18:17:30 GMT, Dean Long wrote: >> Eric Liu has updated the pull request incrementally with one additional commit since the last revision: >> >> small fix >> >> Change-Id: Id71ebe5161fac08a689ee3ec538b485f6c172186 > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1264: > >> 1262: break; >> 1263: case 256: >> 1264: sve_ptrue(dst, size, /* VL256 */ 0b01101); > > Why not use an enum for these magic constants? I think these magic numbers are only used by sve_ptrue and probably there should be no one need to write these magic numbers again. If these magic numbers are really common used, I agree with yours that enum is better. ------------- PR: https://git.openjdk.java.net/jdk18/pull/49 From eliu at openjdk.java.net Wed Dec 22 01:41:20 2021 From: eliu at openjdk.java.net (Eric Liu) Date: Wed, 22 Dec 2021 01:41:20 GMT Subject: [jdk18] RFR: 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail [v2] In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 19:35:42 GMT, Vladimir Kozlov wrote: > I ran tier1 and 3 tests in `test/hotspot/jtreg/gc/stress/gcbasher/` failed on macosx-aarch64: > > ``` > java.lang.IllegalStateException > at gc.stress.gcbasher.ByteCursor.readUtf8(ByteCursor.java:110) > at gc.stress.gcbasher.Decompiler.decodeConstantPool(Decompiler.java:310) > at gc.stress.gcbasher.Decompiler.(Decompiler.java:42) > at gc.stress.gcbasher.TestGCBasher.parseClassFiles(TestGCBasher.java:46) > at gc.stress.gcbasher.TestGCBasher.main(TestGCBasher.java:63) > at gc.stress.gcbasher.TestGCBasherWithG1.main(TestGCBasherWithG1.java:40) > ``` > > It seems new failure. Thanks for your testing! Looks like it's the same as https://bugs.openjdk.java.net/browse/JDK-8275263. ------------- PR: https://git.openjdk.java.net/jdk18/pull/49 From kvn at openjdk.java.net Wed Dec 22 02:19:16 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 22 Dec 2021 02:19:16 GMT Subject: [jdk18] RFR: 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail [v2] In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 13:14:59 GMT, Eric Liu wrote: >> This bug appears intermittently and it's caused by vmaskAll_immI[1] >> when the vector mask size is smaller than max predicate size of running >> machine. It generates an all-true predicate without considering those >> inactive bits. That may result in the wrong result of VectorMask.toLong. >> The problematic code is as below: >> >> >> ShortVector.SPECIES_64.MaskAll(true).toLong() >> >> assembly: >> >> ptrue p0.h <= MaskAll(true) >> mov z16.h, p0/z, #1 >> mov z17.h, #0 >> uzp1 z16.b, z16.b, z17.b >> fmov x10, d16 >> orr x10, x10, x10, lsr #7 >> orr x10, x10, x10, lsr #14 >> orr x10, x10, x10, lsr #28 >> and x10, x10, #0xff >> >> (gdb) p/x $p0 # on an SVE machine with vector length as 64 in bytes >> $1 = {0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55} >> >> Expected: >> (gdb) p/x $p0 >> $1 = {0x55, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00} >> >> >> >> Considering MaskAll is used in VectorMask.fromLong() only for a special >> case and relies on the mechanism of inline and intrinsification, even it >> could be optimized out, this patch also adds test cases for MaskAll to >> reproduce this issue stably. >> >> Also fix a small issue on register utilization for >> sve_reduce_[max|min][D|F]. >> >> [1] https://github.com/openjdk/jdk18/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L416 >> >> hotspot/compiler/vectorapi, jdk/incubator/vector passed on SVE enabled >> system. >> >> Change-Id: I9631f26f9232ffe7a28b74f14062d945c32fa1fb > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > small fix > > Change-Id: Id71ebe5161fac08a689ee3ec538b485f6c172186 It is actually [8275316](https://bugs.openjdk.java.net/browse/JDK-8275316) which was closed as duplicate [8275263 ](https://bugs.openjdk.java.net/browse/JDK-8275263). But, as I understand, 8275263 was also closed as duplicate of [BACKOUT] [8275262](https://bugs.openjdk.java.net/browse/JDK-8275262). So there should not be code in JDK 18 which cause it. Note, I ran additional testing of the same local repo build but without this **8278889** changes and I don't see `gcbasher` failures. ------------- PR: https://git.openjdk.java.net/jdk18/pull/49 From jbhateja at openjdk.java.net Wed Dec 22 03:20:14 2021 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 22 Dec 2021 03:20:14 GMT Subject: [jdk18] Integrated: 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. In-Reply-To: References: Message-ID: <3hNL4nsu7Hx0JURTTrPH7zWKoFnSD0TZAkg8rrNtiok=.ac0aaf30-1419-4399-abba-31823909244f@github.com> On Tue, 14 Dec 2021 19:18:47 GMT, Jatin Bhateja wrote: > - Vector.maskAll was accelerated for AVX-512 target, but x86 existing backend implementation does not enable maskAll instruction patterns for 32 bit JVM, due to which operations fall backs over replicateB operation which broadcasts the mask value in a vector. > - In some cases after unboxing-boxing optimization this vector eventually reaches to XorVMask which has different operands one held in opmask register and other in vector. > > Kindly review and share your feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 97c5cd7f Author: Jatin Bhateja URL: https://git.openjdk.java.net/jdk18/commit/97c5cd7facf1d3565038c078d5688c7da15ad14e Stats: 166 lines in 10 files changed: 110 ins; 47 del; 9 mod 8278508: Enable X86 maskAll instruction pattern for 32 bit JVM. Reviewed-by: kvn, sviswanathan ------------- PR: https://git.openjdk.java.net/jdk18/pull/24 From eliu at openjdk.java.net Wed Dec 22 05:11:25 2021 From: eliu at openjdk.java.net (Eric Liu) Date: Wed, 22 Dec 2021 05:11:25 GMT Subject: [jdk18] RFR: 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail [v2] In-Reply-To: References: Message-ID: On Wed, 22 Dec 2021 02:16:06 GMT, Vladimir Kozlov wrote: > It is actually [8275316](https://bugs.openjdk.java.net/browse/JDK-8275316) which was closed as duplicate [8275263 ](https://bugs.openjdk.java.net/browse/JDK-8275263). But, as I understand, 8275263 was also closed as duplicate of [BACKOUT] [8275262](https://bugs.openjdk.java.net/browse/JDK-8275262). So there should not be code in JDK 18 which cause it. > > Note, I ran additional testing of the same local repo build but without this **8278889** changes and I don't see `gcbasher` failures. Thanks for your feedback. The failure is really bizarre. Is it possible that caused by the specific toolchain ? This patch doesn't change any common code except the test cases, all the changes are SVE related, and I think mac couldn't touch SVE related code at this moment. Moreover, the log is the same as [8275316](https://bugs.openjdk.java.net/browse/JDK-8275316), which means perhaps some other patches will trigger this failure as well before the root cause has been fixed. In this way, expose this failure and post a new JBS to involve some mac experts maybe not bad. Otherwise, some patches which looks okay but would be blocked unfortunately. ------------- PR: https://git.openjdk.java.net/jdk18/pull/49 From kvn at openjdk.java.net Wed Dec 22 08:17:27 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 22 Dec 2021 08:17:27 GMT Subject: [jdk18] RFR: 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail [v2] In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 13:14:59 GMT, Eric Liu wrote: >> This bug appears intermittently and it's caused by vmaskAll_immI[1] >> when the vector mask size is smaller than max predicate size of running >> machine. It generates an all-true predicate without considering those >> inactive bits. That may result in the wrong result of VectorMask.toLong. >> The problematic code is as below: >> >> >> ShortVector.SPECIES_64.MaskAll(true).toLong() >> >> assembly: >> >> ptrue p0.h <= MaskAll(true) >> mov z16.h, p0/z, #1 >> mov z17.h, #0 >> uzp1 z16.b, z16.b, z17.b >> fmov x10, d16 >> orr x10, x10, x10, lsr #7 >> orr x10, x10, x10, lsr #14 >> orr x10, x10, x10, lsr #28 >> and x10, x10, #0xff >> >> (gdb) p/x $p0 # on an SVE machine with vector length as 64 in bytes >> $1 = {0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55} >> >> Expected: >> (gdb) p/x $p0 >> $1 = {0x55, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00} >> >> >> >> Considering MaskAll is used in VectorMask.fromLong() only for a special >> case and relies on the mechanism of inline and intrinsification, even it >> could be optimized out, this patch also adds test cases for MaskAll to >> reproduce this issue stably. >> >> Also fix a small issue on register utilization for >> sve_reduce_[max|min][D|F]. >> >> [1] https://github.com/openjdk/jdk18/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L416 >> >> hotspot/compiler/vectorapi, jdk/incubator/vector passed on SVE enabled >> system. >> >> Change-Id: I9631f26f9232ffe7a28b74f14062d945c32fa1fb > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > small fix > > Change-Id: Id71ebe5161fac08a689ee3ec538b485f6c172186 This is indeed bizarre. I submitted new build in our infra only for macosx-aarch64-debug and test passed with it (20 times). And it immediately failed with old build. I checked: OS version is the same and clang (and xcode) exactly the same for both builds. My only suspicion is we may have some uninitialized value[s] in VM somewhere and depending on JVM's code and data layout we can get these strange failures. I am repeating whole tier1 build and testing on all platforms. ------------- PR: https://git.openjdk.java.net/jdk18/pull/49 From shade at openjdk.java.net Wed Dec 22 14:02:15 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 22 Dec 2021 14:02:15 GMT Subject: [jdk18] RFR: 8279076: C2: Bad AD file when matching SqrtF with UseSSE=0 In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 19:32:18 GMT, Aleksey Shipilev wrote: > See the reproducer and the analysis in the bug. > > The fix is simple: `Matcher::match_rule_supported` should handle the predicates for current `SqrtF` and `SqrtD` match rules. > > Additional testing: > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=0`, new test now passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=1`, new test passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=2`, new test passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=0`, `jdk/incubator/vector/` now passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=1`, `jdk/incubator/vector/` passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=2`, `jdk/incubator/vector/` passes (some unrelated failures) > - [x] Linux x86_64, new test passes Thanks for reviews! I see all GHA failures are known, not related to this patch, and reported as bugs already. I will integrate this after 24 hrs expire. ------------- PR: https://git.openjdk.java.net/jdk18/pull/60 From duke at openjdk.java.net Wed Dec 22 18:23:50 2021 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 22 Dec 2021 18:23:50 GMT Subject: [jdk18] RFR: 8278948: compiler/vectorapi/reshape/TestVectorCastAVX1.java crashes in assembler [v2] In-Reply-To: References: Message-ID: > This patch fixes a crash spotted in `compiler/vectorapi/reshape/TestVectorCastAVX1.java` in mainline. The reason for the failure is the incorrect vector encoding of integer promotion operation leads to unsupported instruction `vpmovsxbd/vpmovsxwd ymm, xmm` on AVX1. For the same reason we currently cannot cast a short or byte vector to a 256-bit float vector on AVX1, so I also fixed that. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: add test ------------- Changes: - all: https://git.openjdk.java.net/jdk18/pull/46/files - new: https://git.openjdk.java.net/jdk18/pull/46/files/46dc2068..5afabe5b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk18&pr=46&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk18&pr=46&range=00-01 Stats: 82 lines in 1 file changed: 82 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk18/pull/46.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/46/head:pull/46 PR: https://git.openjdk.java.net/jdk18/pull/46 From duke at openjdk.java.net Wed Dec 22 18:31:09 2021 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 22 Dec 2021 18:31:09 GMT Subject: [jdk18] RFR: 8278948: compiler/vectorapi/reshape/TestVectorCastAVX1.java crashes in assembler In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 18:06:01 GMT, Vladimir Kozlov wrote: >> This patch fixes a crash spotted in `compiler/vectorapi/reshape/TestVectorCastAVX1.java` in mainline. The reason for the failure is the incorrect vector encoding of integer promotion operation leads to unsupported instruction `vpmovsxbd/vpmovsxwd ymm, xmm` on AVX1. For the same reason we currently cannot cast a short or byte vector to a 256-bit float vector on AVX1, so I also fixed that. > > Thank you for explaining. These tests are not present in JDK18 and that confused me. I suggest to add new simple regression test with your changes for JDK 18 so we can verify the fix. And file a separate RFE for JDK 19 to uncomment lines in TestCastMethods.java when the fix is auto-pushed into JDK 19. > > And what about commented line 42 `makePair(BSPEC64, DSPEC256)1` ? Hi @vnkozlov , I have added tests for this regression. The cast from byte to double also suffers this miscalculation, it is also mistakenly rejected from intrinsification before and I also fix it here. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk18/pull/46 From duke at openjdk.java.net Wed Dec 22 18:31:10 2021 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 22 Dec 2021 18:31:10 GMT Subject: [jdk18] RFR: 8278948: compiler/vectorapi/reshape/TestVectorCastAVX1.java crashes in assembler [v2] In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 19:42:27 GMT, Sandhya Viswanathan wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> add test > > src/hotspot/cpu/x86/x86.ad line 1778: > >> 1776: case Op_VectorCastS2X: >> 1777: case Op_VectorCastI2X: >> 1778: if (bt != T_DOUBLE && size_in_bits == 256 && UseAVX < 2) { > > CastI2X should work with (UseAVX == 1) for (bt == T_FLOAT) so the prior code was correct for CastI2X. > The fix is only needed for CastS2X and CastB2X. AFAIK int vectors of size 256 bit are not supported on AVX1 so we cannot cast from int vectors to float vectors of size 256 anyway. So it is essentially the same. ------------- PR: https://git.openjdk.java.net/jdk18/pull/46 From kvn at openjdk.java.net Wed Dec 22 18:49:16 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 22 Dec 2021 18:49:16 GMT Subject: [jdk18] RFR: 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail [v2] In-Reply-To: References: Message-ID: <5hMt9bxK8gwlJb64pC911_gpYj8pmGb_bInGMMTBqC0=.fa03f80a-7327-40de-8586-25ac4b9be7f1@github.com> On Tue, 21 Dec 2021 13:14:59 GMT, Eric Liu wrote: >> This bug appears intermittently and it's caused by vmaskAll_immI[1] >> when the vector mask size is smaller than max predicate size of running >> machine. It generates an all-true predicate without considering those >> inactive bits. That may result in the wrong result of VectorMask.toLong. >> The problematic code is as below: >> >> >> ShortVector.SPECIES_64.MaskAll(true).toLong() >> >> assembly: >> >> ptrue p0.h <= MaskAll(true) >> mov z16.h, p0/z, #1 >> mov z17.h, #0 >> uzp1 z16.b, z16.b, z17.b >> fmov x10, d16 >> orr x10, x10, x10, lsr #7 >> orr x10, x10, x10, lsr #14 >> orr x10, x10, x10, lsr #28 >> and x10, x10, #0xff >> >> (gdb) p/x $p0 # on an SVE machine with vector length as 64 in bytes >> $1 = {0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55} >> >> Expected: >> (gdb) p/x $p0 >> $1 = {0x55, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00} >> >> >> >> Considering MaskAll is used in VectorMask.fromLong() only for a special >> case and relies on the mechanism of inline and intrinsification, even it >> could be optimized out, this patch also adds test cases for MaskAll to >> reproduce this issue stably. >> >> Also fix a small issue on register utilization for >> sve_reduce_[max|min][D|F]. >> >> [1] https://github.com/openjdk/jdk18/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L416 >> >> hotspot/compiler/vectorapi, jdk/incubator/vector passed on SVE enabled >> system. >> >> Change-Id: I9631f26f9232ffe7a28b74f14062d945c32fa1fb > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > small fix > > Change-Id: Id71ebe5161fac08a689ee3ec538b485f6c172186 I filed https://bugs.openjdk.java.net/browse/JDK-8279177 tier1 re-testing passed. Running tier2 and tier3 now. ------------- PR: https://git.openjdk.java.net/jdk18/pull/49 From kvn at openjdk.java.net Wed Dec 22 20:16:15 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 22 Dec 2021 20:16:15 GMT Subject: [jdk18] RFR: 8278948: compiler/vectorapi/reshape/TestVectorCastAVX1.java crashes in assembler [v2] In-Reply-To: References: Message-ID: On Wed, 22 Dec 2021 18:28:34 GMT, Quan Anh Mai wrote: >> src/hotspot/cpu/x86/x86.ad line 1778: >> >>> 1776: case Op_VectorCastS2X: >>> 1777: case Op_VectorCastI2X: >>> 1778: if (bt != T_DOUBLE && size_in_bits == 256 && UseAVX < 2) { >> >> CastI2X should work with (UseAVX == 1) for (bt == T_FLOAT) so the prior code was correct for CastI2X. >> The fix is only needed for CastS2X and CastB2X. > > AFAIK int vectors of size 256 bit are not supported on AVX1 so we cannot cast from int vectors to float vectors of size 256 anyway. So it is essentially the same. x86 docs say that `vcvtdq2ps` is AVX1 instruction: ![image](https://user-images.githubusercontent.com/5215794/147149423-10988f8d-9bcd-41ca-b728-bc2c34e3bd35.png) ------------- PR: https://git.openjdk.java.net/jdk18/pull/46 From shade at openjdk.java.net Wed Dec 22 20:20:17 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 22 Dec 2021 20:20:17 GMT Subject: [jdk18] Integrated: 8279076: C2: Bad AD file when matching SqrtF with UseSSE=0 In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 19:32:18 GMT, Aleksey Shipilev wrote: > See the reproducer and the analysis in the bug. > > The fix is simple: `Matcher::match_rule_supported` should handle the predicates for current `SqrtF` and `SqrtD` match rules. > > Additional testing: > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=0`, new test now passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=1`, new test passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=2`, new test passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=0`, `jdk/incubator/vector/` now passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=1`, `jdk/incubator/vector/` passes > - [x] Linux x86_32 `-XX:UseAVX=0 -XX:UseSSE=2`, `jdk/incubator/vector/` passes (some unrelated failures) > - [x] Linux x86_64, new test passes This pull request has now been integrated. Changeset: 9d5ae2e3 Author: Aleksey Shipilev URL: https://git.openjdk.java.net/jdk18/commit/9d5ae2e38074c3df354aeab19ebbab7d4872165a Stats: 64 lines in 2 files changed: 64 ins; 0 del; 0 mod 8279076: C2: Bad AD file when matching SqrtF with UseSSE=0 Reviewed-by: kvn, sviswanathan ------------- PR: https://git.openjdk.java.net/jdk18/pull/60 From kvn at openjdk.java.net Wed Dec 22 20:26:16 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 22 Dec 2021 20:26:16 GMT Subject: [jdk18] RFR: 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail [v2] In-Reply-To: References: Message-ID: <88zz4puk7yxoJcjumWVpPoXc7v03cnJNxVC3quFTj3k=.72fad719-e616-4804-8daa-20da718d23ae@github.com> On Tue, 21 Dec 2021 13:14:59 GMT, Eric Liu wrote: >> This bug appears intermittently and it's caused by vmaskAll_immI[1] >> when the vector mask size is smaller than max predicate size of running >> machine. It generates an all-true predicate without considering those >> inactive bits. That may result in the wrong result of VectorMask.toLong. >> The problematic code is as below: >> >> >> ShortVector.SPECIES_64.MaskAll(true).toLong() >> >> assembly: >> >> ptrue p0.h <= MaskAll(true) >> mov z16.h, p0/z, #1 >> mov z17.h, #0 >> uzp1 z16.b, z16.b, z17.b >> fmov x10, d16 >> orr x10, x10, x10, lsr #7 >> orr x10, x10, x10, lsr #14 >> orr x10, x10, x10, lsr #28 >> and x10, x10, #0xff >> >> (gdb) p/x $p0 # on an SVE machine with vector length as 64 in bytes >> $1 = {0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55} >> >> Expected: >> (gdb) p/x $p0 >> $1 = {0x55, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00} >> >> >> >> Considering MaskAll is used in VectorMask.fromLong() only for a special >> case and relies on the mechanism of inline and intrinsification, even it >> could be optimized out, this patch also adds test cases for MaskAll to >> reproduce this issue stably. >> >> Also fix a small issue on register utilization for >> sve_reduce_[max|min][D|F]. >> >> [1] https://github.com/openjdk/jdk18/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L416 >> >> hotspot/compiler/vectorapi, jdk/incubator/vector passed on SVE enabled >> system. >> >> Change-Id: I9631f26f9232ffe7a28b74f14062d945c32fa1fb > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > small fix > > Change-Id: Id71ebe5161fac08a689ee3ec538b485f6c172186 Re-testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/49 From kvn at openjdk.java.net Thu Dec 23 01:18:15 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 23 Dec 2021 01:18:15 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive [v5] In-Reply-To: <-8KQmMm4qVNnktitnRmmmVKKDdm8vBlRoGxLJiaqzKw=.bc118c49-aee2-4cdf-8551-f33c134b65ae@github.com> References: <8Vz4DDiP2vqHU5nfppPyKjmFkDY8RwIWweapHTCx4vo=.e21f9de5-ed2b-4792-954b-a0936036e4ea@github.com> <-8KQmMm4qVNnktitnRmmmVKKDdm8vBlRoGxLJiaqzKw=.bc118c49-aee2-4cdf-8551-f33c134b65ae@github.com> Message-ID: On Mon, 20 Dec 2021 09:28:30 GMT, Christian Hagedorn wrote: > > So the whole block of code in `c1_IR.cpp` is under `#ifndef PRODUCT`. But is it used in **optimized** build or only in debug? > > I suggest to change `#ifndef PRODUCT` at line 1224 to `#ifdef ASSERT` and try to build optimized VM to see if it is used in it. I see only prints and asserts. > > Shouldn't we leave `BlockPrinter` and `IR::print()` under `#ifndef PRODUCT`? These seem to be about printing things only which is probably good to have in optimized builds as well. We could instead just add an `#ifdef ASSERT` on L1263 for the validators and the IR verification and change the places using this code accordingly to `#ifdef ASSERT/DEBUG_ONLY()`. Then we would have everything under `ASSERT` instead of `!PRODUCT`. This would be in line with the original code which used `#ifdef ASSERT` in `IR::verify()`. I agree with christian's suggestion. ------------- PR: https://git.openjdk.java.net/jdk/pull/6850 From mli at openjdk.java.net Thu Dec 23 01:18:39 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Thu, 23 Dec 2021 01:18:39 GMT Subject: RFR: 8279057: Consolidate InstructionPrinter::do_BlockBegin and remove extra lines in logs of blocks with phi functions Message-ID: This pr does following things: - First, the code related to printing phi in InstructionPrinter::do_BlockBegin is bit redudant, it could be consolidated; - Second, only locals and stack values related to phi functions are printed out now. - Third, there are extra blank lines in printed log of blocks with phi functions, these blank lines are better to be removed. ------------- Commit messages: - Refine comments and messages - Initial commit Changes: https://git.openjdk.java.net/jdk/pull/6921/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6921&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8279057 Stats: 59 lines in 1 file changed: 7 ins; 25 del; 27 mod Patch: https://git.openjdk.java.net/jdk/pull/6921.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6921/head:pull/6921 PR: https://git.openjdk.java.net/jdk/pull/6921 From kvn at openjdk.java.net Thu Dec 23 01:29:11 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 23 Dec 2021 01:29:11 GMT Subject: RFR: JDK-8258603 c1 IR::verify is expensive [v5] In-Reply-To: References: <8Vz4DDiP2vqHU5nfppPyKjmFkDY8RwIWweapHTCx4vo=.e21f9de5-ed2b-4792-954b-a0936036e4ea@github.com> <-8KQmMm4qVNnktitnRmmmVKKDdm8vBlRoGxLJiaqzKw=.bc118c49-aee2-4cdf-8551-f33c134b65ae@github.com> Message-ID: On Thu, 23 Dec 2021 01:14:48 GMT, Vladimir Kozlov wrote: >>> So the whole block of code in `c1_IR.cpp` is under `#ifndef PRODUCT`. But is it used in **optimized** build or only in debug? >>> >>> I suggest to change `#ifndef PRODUCT` at line 1224 to `#ifdef ASSERT` and try to build optimized VM to see if it is used in it. I see only prints and asserts. >> >> Shouldn't we leave `BlockPrinter` and `IR::print()` under `#ifndef PRODUCT`? These seem to be about printing things only which is probably good to have in optimized builds as well. We could instead just add an `#ifdef ASSERT` on L1263 for the validators and the IR verification and change the places using this code accordingly to `#ifdef ASSERT/DEBUG_ONLY()`. Then we would have everything under `ASSERT` instead of `!PRODUCT`. This would be in line with the original code which used `#ifdef ASSERT` in `IR::verify()`. > >> > So the whole block of code in `c1_IR.cpp` is under `#ifndef PRODUCT`. But is it used in **optimized** build or only in debug? >> > I suggest to change `#ifndef PRODUCT` at line 1224 to `#ifdef ASSERT` and try to build optimized VM to see if it is used in it. I see only prints and asserts. >> >> Shouldn't we leave `BlockPrinter` and `IR::print()` under `#ifndef PRODUCT`? These seem to be about printing things only which is probably good to have in optimized builds as well. We could instead just add an `#ifdef ASSERT` on L1263 for the validators and the IR verification and change the places using this code accordingly to `#ifdef ASSERT/DEBUG_ONLY()`. Then we would have everything under `ASSERT` instead of `!PRODUCT`. This would be in line with the original code which used `#ifdef ASSERT` in `IR::verify()`. > > I agree with christian's suggestion. > @vnkozlov I don't know what an optimized build is, you'll have to point me to some instructions if you want me to build one. bash configure --with-conf-name=optimized --with-debug-level=optimized make jdk-image CONF=optimized ./build/optimized/images/jdk/bin/java -XX:+PrintCFG -version ------------- PR: https://git.openjdk.java.net/jdk/pull/6850 From kvn at openjdk.java.net Thu Dec 23 02:34:14 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 23 Dec 2021 02:34:14 GMT Subject: [jdk18] RFR: 8278948: compiler/vectorapi/reshape/TestVectorCastAVX1.java crashes in assembler In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 18:06:01 GMT, Vladimir Kozlov wrote: >> This patch fixes a crash spotted in `compiler/vectorapi/reshape/TestVectorCastAVX1.java` in mainline. The reason for the failure is the incorrect vector encoding of integer promotion operation leads to unsupported instruction `vpmovsxbd/vpmovsxwd ymm, xmm` on AVX1. For the same reason we currently cannot cast a short or byte vector to a 256-bit float vector on AVX1, so I also fixed that. > > Thank you for explaining. These tests are not present in JDK18 and that confused me. I suggest to add new simple regression test with your changes for JDK 18 so we can verify the fix. And file a separate RFE for JDK 19 to uncomment lines in TestCastMethods.java when the fix is auto-pushed into JDK 19. > > And what about commented line 42 `makePair(BSPEC64, DSPEC256)1` ? > Hi @vnkozlov , I have added tests for this regression. > The cast from byte to double also suffers this miscalculation, it is also mistakenly rejected from intrinsification before and I also fix it here. > Thank you very much. Thank you for adding test. Please, address Sandhya's comment about 256 bit CastI2F support with AVX1. I added snapshot from x86 doc to prove it. ------------- PR: https://git.openjdk.java.net/jdk18/pull/46 From shade at openjdk.java.net Thu Dec 23 08:09:31 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 23 Dec 2021 08:09:31 GMT Subject: [jdk18] RFR: 8279204: [BACKOUT] JDK-8278413: C2 crash when allocating array of size too large Message-ID: Summary: This reverts commit deaf75a58587f80046204de7559ff50b3b770bed. Roland is on vacation now, so we would not get the fix any time soon. Meanwhile, there are lots of test failures and GHA uncleanliness due to this patch. Additional testing: - [x] Test failures from recorded bugs are gone - [ ] GHA are clean ------------- Commit messages: - Revert "8278413: C2 crash when allocating array of size too large" Changes: https://git.openjdk.java.net/jdk18/pull/69/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=69&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8279204 Stats: 200 lines in 9 files changed: 53 ins; 125 del; 22 mod Patch: https://git.openjdk.java.net/jdk18/pull/69.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/69/head:pull/69 PR: https://git.openjdk.java.net/jdk18/pull/69 From duke at openjdk.java.net Thu Dec 23 11:40:15 2021 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 23 Dec 2021 11:40:15 GMT Subject: [jdk18] RFR: 8278948: compiler/vectorapi/reshape/TestVectorCastAVX1.java crashes in assembler [v2] In-Reply-To: References: Message-ID: On Wed, 22 Dec 2021 20:13:05 GMT, Vladimir Kozlov wrote: >> AFAIK int vectors of size 256 bit are not supported on AVX1 so we cannot cast from int vectors to float vectors of size 256 anyway. So it is essentially the same. > > x86 docs say that `vcvtdq2ps` is AVX1 instruction: > ![image](https://user-images.githubusercontent.com/5215794/147149423-10988f8d-9bcd-41ca-b728-bc2c34e3bd35.png) Hi, while the instruction is supported in hardware, the VM does not support the vector shape. As a result, there should not be a situation where we have a vector node with that shape on AVX1. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk18/pull/46 From yyang at openjdk.java.net Thu Dec 23 11:58:16 2021 From: yyang at openjdk.java.net (Yi Yang) Date: Thu, 23 Dec 2021 11:58:16 GMT Subject: RFR: 8271202: C1: assert(false) failed: live_in set of first block must be empty [v2] In-Reply-To: References: Message-ID: On Tue, 7 Dec 2021 23:25:03 GMT, Martin Doerr wrote: >> I have written a checker which detects usage of the illegal phi function. In case of the reproducer provided in the JBS bug ("Reduced.java"), it finds the following and bails out: >> >> invalidating local 8 because of type mismatch (new_value is NULL) >> Bailing out because StoreIndexed (id 98) uses illegal phi (id 68) >> >> I haven't checked why that node uses the illegal phi. That still seems to be a bug. Maybe there's a better solution to the underlying problem, but I hope my checker is useful to analyze bugs and to make C1 more resilient. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add test. I think we should at least fix/find such illegal use in state merging rather than LIR generation, that's far beyond where problem occurs src/hotspot/share/c1/c1_Instruction.cpp line 877: > 875: index, new_value == NULL ? " (new_value is NULL)" : "")); > 876: // Check if illegal phi gets used. > 877: SearchUsageClosure search(existing_phi); should we add something like "FIXME" comment to indicate this is a workaround and we need further investigation to solve it? ------------- PR: https://git.openjdk.java.net/jdk/pull/6683 From chagedorn at openjdk.java.net Thu Dec 23 12:06:16 2021 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 23 Dec 2021 12:06:16 GMT Subject: [jdk18] RFR: 8279204: [BACKOUT] JDK-8278413: C2 crash when allocating array of size too large In-Reply-To: References: Message-ID: On Thu, 23 Dec 2021 08:01:52 GMT, Aleksey Shipilev wrote: > Summary: This reverts commit deaf75a58587f80046204de7559ff50b3b770bed. > > Roland is on vacation now, so we would not get the fix any time soon. Meanwhile, there are lots of test failures and GHA uncleanliness due to this patch. > > Additional testing: > - [x] Test failures from recorded bugs are gone > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` > - [x] Linux x86_64 fastdebug `tier3` > - [x] GHA are clean That seems reasonable. Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/69 From yyang at openjdk.java.net Thu Dec 23 12:46:13 2021 From: yyang at openjdk.java.net (Yi Yang) Date: Thu, 23 Dec 2021 12:46:13 GMT Subject: RFR: 8279057: Consolidate InstructionPrinter::do_BlockBegin and remove extra lines in logs of blocks with phi functions In-Reply-To: References: Message-ID: On Thu, 23 Dec 2021 01:11:08 GMT, Hamlin Li wrote: > This pr does following things: > - First, the code related to printing phi in InstructionPrinter::do_BlockBegin is bit redudant, it could be consolidated; > - Second, only locals and stack values related to phi functions are printed out now. > - Third, there are extra blank lines in printed log of blocks with phi functions, these blank lines are better to be removed. src/hotspot/share/c1/c1_InstructionPrinter.cpp line 639: > 637: if (v && is_phi_of_block(v, x)) { > 638: if (!printed_phis_in_locals_header) { > 639: output()->cr(); output()->print("Values in locals related to phi functions:"); Can we keep this old style? i.e. "Locals:" and "Stacks:". ------------- PR: https://git.openjdk.java.net/jdk/pull/6921 From yyang at openjdk.java.net Thu Dec 23 12:55:10 2021 From: yyang at openjdk.java.net (Yi Yang) Date: Thu, 23 Dec 2021 12:55:10 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v9] In-Reply-To: <7GTdrykHxvlcM_-gJENv_ZiyI-EQst9FV8ieJXOM-p8=.8f1f2335-8ef1-446e-addc-9869f5a70ce2@github.com> References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> <7GTdrykHxvlcM_-gJENv_ZiyI-EQst9FV8ieJXOM-p8=.8f1f2335-8ef1-446e-addc-9869f5a70ce2@github.com> Message-ID: On Sun, 19 Dec 2021 20:28:07 GMT, Zhiqiang Zang wrote: >> A new ideal optimization can be introduced for addnode: converting "x + x" into "x << 1". >> >> >> // Convert "x + x" into "x << 1" >> if (in1 == in2) { >> return new LShiftINode(in1, phase->intcon(1)); >> } > > Zhiqiang Zang has updated the pull request incrementally with three additional commits since the last revision: > > - refactor the ir test. > - rename tests. > - use compiler mode blackhole in microbenchmark to prevent function calling from dominating time. More thought: To see what GCC-O2 does for the same code, if GCC does such transformation, I think it's also acceptable for C2. ------------- PR: https://git.openjdk.java.net/jdk/pull/6675 From duke at openjdk.java.net Thu Dec 23 13:00:21 2021 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 23 Dec 2021 13:00:21 GMT Subject: RFR: 8278114: New addnode ideal optimization: converting "x + x" into "x << 1" [v9] In-Reply-To: <7GTdrykHxvlcM_-gJENv_ZiyI-EQst9FV8ieJXOM-p8=.8f1f2335-8ef1-446e-addc-9869f5a70ce2@github.com> References: <3hK3dFC_SKVjyYQufC7boGpZsKPHywUD9GrcbcS4AyY=.e2af0d30-6a60-4870-8b54-855afb73fcf7@github.com> <7GTdrykHxvlcM_-gJENv_ZiyI-EQst9FV8ieJXOM-p8=.8f1f2335-8ef1-446e-addc-9869f5a70ce2@github.com> Message-ID: On Sun, 19 Dec 2021 20:28:07 GMT, Zhiqiang Zang wrote: >> A new ideal optimization can be introduced for addnode: converting "x + x" into "x << 1". >> >> >> // Convert "x + x" into "x << 1" >> if (in1 == in2) { >> return new LShiftINode(in1, phase->intcon(1)); >> } > > Zhiqiang Zang has updated the pull request incrementally with three additional commits since the last revision: > > - refactor the ir test. > - rename tests. > - use compiler mode blackhole in microbenchmark to prevent function calling from dominating time. All gcc, clang and msvc seem to perform this transformation, [godbolt](https://godbolt.org/z/sYYxrTjPY). Cheers. ------------- PR: https://git.openjdk.java.net/jdk/pull/6675 From shade at openjdk.java.net Thu Dec 23 15:14:15 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 23 Dec 2021 15:14:15 GMT Subject: [jdk18] RFR: 8279204: [BACKOUT] JDK-8278413: C2 crash when allocating array of size too large In-Reply-To: References: Message-ID: <8tnYIUVOwZvrYHhq5NokqNvJfKbHcUGJEiMHoE5STtA=.d2429509-21e3-423d-8140-7d73da4ce99a@github.com> On Thu, 23 Dec 2021 12:03:25 GMT, Christian Hagedorn wrote: > That seems reasonable. Looks good! Thanks! Should I wait for 24 hours for this patch, or we consider clean backouts trivial? ------------- PR: https://git.openjdk.java.net/jdk18/pull/69 From kvn at openjdk.java.net Thu Dec 23 16:12:14 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 23 Dec 2021 16:12:14 GMT Subject: [jdk18] RFR: 8279204: [BACKOUT] JDK-8278413: C2 crash when allocating array of size too large In-Reply-To: References: Message-ID: On Thu, 23 Dec 2021 08:01:52 GMT, Aleksey Shipilev wrote: > Summary: This reverts commit deaf75a58587f80046204de7559ff50b3b770bed. > > Roland is on vacation now, so we would not get the fix any time soon. Meanwhile, there are lots of test failures and GHA uncleanliness due to this patch. > > Additional testing: > - [x] Test failures from recorded bugs are gone > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` > - [x] Linux x86_64 fastdebug `tier3` > - [x] GHA are clean No need to wait. Approved. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/69 From shade at openjdk.java.net Thu Dec 23 16:25:15 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 23 Dec 2021 16:25:15 GMT Subject: [jdk18] RFR: 8279204: [BACKOUT] JDK-8278413: C2 crash when allocating array of size too large In-Reply-To: References: Message-ID: On Thu, 23 Dec 2021 08:01:52 GMT, Aleksey Shipilev wrote: > Summary: This reverts commit deaf75a58587f80046204de7559ff50b3b770bed. > > Roland is on vacation now, so we would not get the fix any time soon. Meanwhile, there are lots of test failures and GHA uncleanliness due to this patch. > > Additional testing: > - [x] Test failures from recorded bugs are gone > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` > - [x] Linux x86_64 fastdebug `tier3` > - [x] GHA are clean Thanks. ------------- PR: https://git.openjdk.java.net/jdk18/pull/69 From shade at openjdk.java.net Thu Dec 23 16:25:15 2021 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 23 Dec 2021 16:25:15 GMT Subject: [jdk18] Integrated: 8279204: [BACKOUT] JDK-8278413: C2 crash when allocating array of size too large In-Reply-To: References: Message-ID: On Thu, 23 Dec 2021 08:01:52 GMT, Aleksey Shipilev wrote: > Summary: This reverts commit deaf75a58587f80046204de7559ff50b3b770bed. > > Roland is on vacation now, so we would not get the fix any time soon. Meanwhile, there are lots of test failures and GHA uncleanliness due to this patch. > > Additional testing: > - [x] Test failures from recorded bugs are gone > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` > - [x] Linux x86_64 fastdebug `tier3` > - [x] GHA are clean This pull request has now been integrated. Changeset: 04ad6689 Author: Aleksey Shipilev URL: https://git.openjdk.java.net/jdk18/commit/04ad668921abbd71dfbc474eed6f1760f7a541b1 Stats: 200 lines in 9 files changed: 53 ins; 125 del; 22 mod 8279204: [BACKOUT] JDK-8278413: C2 crash when allocating array of size too large Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.java.net/jdk18/pull/69 From kvn at openjdk.java.net Thu Dec 23 16:48:16 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 23 Dec 2021 16:48:16 GMT Subject: [jdk18] RFR: 8278948: compiler/vectorapi/reshape/TestVectorCastAVX1.java crashes in assembler [v2] In-Reply-To: References: Message-ID: On Thu, 23 Dec 2021 11:37:06 GMT, Quan Anh Mai wrote: >> x86 docs say that `vcvtdq2ps` is AVX1 instruction: >> ![image](https://user-images.githubusercontent.com/5215794/147149423-10988f8d-9bcd-41ca-b728-bc2c34e3bd35.png) > > Hi, while the instruction is supported in hardware, the VM does not support the vector shape. As a result, there should not be a situation where we have a vector node with that shape on AVX1. Thank you very much. Why do you think that? What about this code when length is the same (and UseAVX could be 1): https://github.com/openjdk/jdk18/blob/master/src/hotspot/cpu/x86/x86.ad#L7069 ------------- PR: https://git.openjdk.java.net/jdk18/pull/46 From duke at openjdk.java.net Thu Dec 23 17:07:10 2021 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 23 Dec 2021 17:07:10 GMT Subject: [jdk18] RFR: 8278948: compiler/vectorapi/reshape/TestVectorCastAVX1.java crashes in assembler [v2] In-Reply-To: References: Message-ID: On Thu, 23 Dec 2021 16:44:54 GMT, Vladimir Kozlov wrote: >> Hi, while the instruction is supported in hardware, the VM does not support the vector shape. As a result, there should not be a situation where we have a vector node with that shape on AVX1. Thank you very much. > > Why do you think that? What about this code when length is the same (and UseAVX could be 1): > https://github.com/openjdk/jdk18/blob/master/src/hotspot/cpu/x86/x86.ad#L7069 Because the input of `VectorCastI2XNode` would be a vector node of type int and size 256, this shape is not supported on AVX1. So, we should not have this node because its input should not appear. ------------- PR: https://git.openjdk.java.net/jdk18/pull/46 From vlivanov at openjdk.java.net Thu Dec 23 18:35:21 2021 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Thu, 23 Dec 2021 18:35:21 GMT Subject: [jdk18] RFR: 8275638: GraphKit::combine_exception_states fails with "matching stack sizes" assert In-Reply-To: References: Message-ID: <2QugcAUnrOmCCsX2G3CkxiXLx8sHQSDdK8ZkeFMS3AI=.3ead5313-f6cb-4a3e-a0c2-3e286fe029ab@github.com> On Wed, 15 Dec 2021 10:24:59 GMT, Roland Westrelin wrote: > The bug and fix were discussed in a previous PR: > > https://github.com/openjdk/jdk/pull/6572 > > I pushed all commits from that PR on top of jdk 18 and added a couple > extra tests as suggested in: > > https://github.com/openjdk/jdk/pull/6572#issuecomment-994086590 I'm late to the party, but still would like to clarify one thing. It seems the root cause of the bug comes from the fact that the same JVM state is used by both `GraphKit::uncommon_trap()` and `GraphKit::builtin_throw()` while `GraphKit::null_check_receiver_before_call()` deliberately adjusts the state to please the former case. If the original state (after the call) is used for `GraphKit::builtin_throw()`, it should fix the bug as well, shouldn't it? // Do a null check on the receiver as it would happen before the call to // callee (with all arguments still on the stack). Node* null_check_receiver_before_call(ciMethod* callee) { assert(!callee->is_static(), "must be a virtual method"); // Callsite signature can be different from actual method being called (i.e _linkTo* sites). // Use callsite signature always. ciMethod* declared_method = method()->get_method_at_bci(bci()); const int nargs = declared_method->arg_size(); inc_sp(nargs); Node* n = null_check_receiver(); dec_sp(nargs); return n; } ------------- PR: https://git.openjdk.java.net/jdk18/pull/29 From kvn at openjdk.java.net Thu Dec 23 19:11:13 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 23 Dec 2021 19:11:13 GMT Subject: [jdk18] RFR: 8278948: compiler/vectorapi/reshape/TestVectorCastAVX1.java crashes in assembler [v2] In-Reply-To: References: Message-ID: On Wed, 22 Dec 2021 18:23:50 GMT, Quan Anh Mai wrote: >> This patch fixes a crash spotted in `compiler/vectorapi/reshape/TestVectorCastAVX1.java` in mainline. The reason for the failure is the incorrect vector encoding of integer promotion operation leads to unsupported instruction `vpmovsxbd/vpmovsxwd ymm, xmm` on AVX1. For the same reason we currently cannot cast a short or byte vector to a 256-bit float vector on AVX1, so I also fixed that. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > add test All questions were answered. I will run testing before approval. ------------- PR: https://git.openjdk.java.net/jdk18/pull/46 From kvn at openjdk.java.net Thu Dec 23 19:11:13 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 23 Dec 2021 19:11:13 GMT Subject: [jdk18] RFR: 8278948: compiler/vectorapi/reshape/TestVectorCastAVX1.java crashes in assembler [v2] In-Reply-To: References: Message-ID: On Thu, 23 Dec 2021 17:04:21 GMT, Quan Anh Mai wrote: >> Why do you think that? What about this code when length is the same (and UseAVX could be 1): >> https://github.com/openjdk/jdk18/blob/master/src/hotspot/cpu/x86/x86.ad#L7069 > > Because the input of `VectorCastI2XNode` would be a vector node of type int and size 256, this shape is not supported on AVX1. So, we should not have this node because its input should not appear. Right. ------------- PR: https://git.openjdk.java.net/jdk18/pull/46 From iveresov at openjdk.java.net Thu Dec 23 19:28:15 2021 From: iveresov at openjdk.java.net (Igor Veresov) Date: Thu, 23 Dec 2021 19:28:15 GMT Subject: RFR: 8271202: C1: assert(false) failed: live_in set of first block must be empty [v2] In-Reply-To: References: Message-ID: On Thu, 23 Dec 2021 11:52:57 GMT, Yi Yang wrote: > I think we should at least fix/find such illegal use in state merging rather than LIR generation, that's far beyond where problem occurs To find it you basically need another pass. I don't want to introduce another pass just to find an extremely rare situation. Piggybacking on the LIR generation pass is the most cache-friendly place I can think of. In fact, there is already a related bailout in `move_to_phi()`. I also don't think that this is necessarily even fixable. We artificially stretch the local liveness and that with conjunction with irreducible loops can create unmergable states. I think bailing out is appropriate. It is a very rare case. ------------- PR: https://git.openjdk.java.net/jdk/pull/6683 From kvn at openjdk.java.net Thu Dec 23 21:10:12 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 23 Dec 2021 21:10:12 GMT Subject: [jdk18] RFR: 8278948: compiler/vectorapi/reshape/TestVectorCastAVX1.java crashes in assembler [v2] In-Reply-To: References: Message-ID: On Wed, 22 Dec 2021 18:23:50 GMT, Quan Anh Mai wrote: >> This patch fixes a crash spotted in `compiler/vectorapi/reshape/TestVectorCastAVX1.java` in mainline. The reason for the failure is the incorrect vector encoding of integer promotion operation leads to unsupported instruction `vpmovsxbd/vpmovsxwd ymm, xmm` on AVX1. For the same reason we currently cannot cast a short or byte vector to a 256-bit float vector on AVX1, so I also fixed that. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > add test Testing tier1-3 passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/46 From haosun at openjdk.java.net Fri Dec 24 01:34:50 2021 From: haosun at openjdk.java.net (Hao Sun) Date: Fri, 24 Dec 2021 01:34:50 GMT Subject: [jdk18] RFR: 8278267: ARM32: several vector test failures for ASHR [v2] In-Reply-To: References: Message-ID: > In ARM32, "VSHL (register)" instruction [1] is shared by vector left > shift and vector right shift, and the condition to distinguish them is > whether the shift count value is positve or negative. Hence, negation > operation is needed before conducting vector right shift. > > For vector right shift, the shift count can be a RShiftCntV or a normal > vector node. Take test case Byte64VectorTests.java [2][3] as an example. > Note that RShiftCntV is already negated via rules "vsrcntD" and > "vsrcntX" whereas the normal vector node is NOT, since we don't know > whether a normal vector node is used as a vector shift count or not. > This is the root cause for these vector test failures. > > The fix is simple, moving the negation from "vsrcntD|X" to the > corresponding vector right shift rules. > > Affected rules are vsrlBB_reg and vsraBB_reg. Note that vector shift > related rules are in form of "vsAABB_CC", where > 1) AA can be l (left shift), rl (logical right shift) and ra (arithmetic > right shift). > 2) BB can be 8B/16B (byte type), 4S/8S (short type), 2I/4I (int type) > and 2L (long type). > 3) CC can be reg (register case) and immI (immediate case). > > Minor updates: > 1) Merge "vslcntD" and "vsrcntD" into rule "vscntD", as these two rules > conduct the same duplication operation now. > 2) Update the "match" primitive for vsraBB_immI rules. > 3) Style issue: remove the surrounding space for "ins_pipe" primitive. > > Tests: > We ran tier 1~3 tests on ARM32 platform. With this patch, previously > failed vector test cases can pass now without introducing test > regression. > > [1] https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Instruction-Details/Alphabetical-list-of-instructions/VSHL--register-?lang=en > [2] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2237 > [3] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2425 Hao Sun has updated the pull request incrementally with one additional commit since the last revision: Use is_var_shift() to determmine the location of negation use for right shifts Method is_var_shift() denotes that vector shift count is a variable shift: 1) for this case, vector shift count should be negated before conducting right shifts. E.g., vsrl4S_reg_var rule. 2) for the opposite case, vector shift count is generated via RShiftCntV rules and is already negated there. Hence, no negation is needed. E.g., vsrl4S_reg rule. Besides, it's safe to add "hash()" and "cmp()" methods for ShiftV node. ------------- Changes: - all: https://git.openjdk.java.net/jdk18/pull/41/files - new: https://git.openjdk.java.net/jdk18/pull/41/files/3d29fb2c..05dfae3a Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk18&pr=41&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk18&pr=41&range=00-01 Stats: 435 lines in 2 files changed: 335 ins; 34 del; 66 mod Patch: https://git.openjdk.java.net/jdk18/pull/41.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/41/head:pull/41 PR: https://git.openjdk.java.net/jdk18/pull/41 From haosun at openjdk.java.net Fri Dec 24 01:40:14 2021 From: haosun at openjdk.java.net (Hao Sun) Date: Fri, 24 Dec 2021 01:40:14 GMT Subject: [jdk18] RFR: 8278267: ARM32: several vector test failures for ASHR In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 09:53:07 GMT, Dean Long wrote: > There seems to be an interesting history here. > > Method 1: negate in RShiftCntV rule Method 2: negate in RShiftV* rules > > For aarch64, the negate was moved into the shift instruction in JDK-8213134 (Method 1 --> Method 2). Then JDK-8262916 proposed to move it back out of the shift instruction again. In that PR, the opinion that Method 1 (arm32) generated better code than Method 2 (aarch64) was expressed. Now it looks like this PR proposes for arm32 to move to Method 2 like aarch64, so I suspect that there will be a performance impact. > > I think there is a simpler fix that doesn't require moving where the negate happens. JDK-8277239 fixed a similar problem by adding a flag on vector shift nodes to indicate variable shift, then checking the flag in the predicate. Perhaps arm32 could do the same? Thanks a lot for your explanation. I do agree with your solution. Updated the code: 1) using is_var_shift() to determine to put "negation" in RShiftCntV or RShiftV* rules, and 2) adding `cmp` and `hash` for ShiftV node. Could you please take a look at the latest code when you have a chance? Maybe after the new year :) Thanks. ------------- PR: https://git.openjdk.java.net/jdk18/pull/41 From haosun at openjdk.java.net Fri Dec 24 01:48:19 2021 From: haosun at openjdk.java.net (Hao Sun) Date: Fri, 24 Dec 2021 01:48:19 GMT Subject: [jdk18] RFR: 8278267: ARM32: several vector test failures for ASHR [v2] In-Reply-To: References: Message-ID: <6oOEbjkVl2mzEsapmcSPL57TXwkunlkui6rZLZ88o_I=.738283d6-298f-4880-9005-d16216944065@github.com> On Fri, 24 Dec 2021 01:34:50 GMT, Hao Sun wrote: >> In ARM32, "VSHL (register)" instruction [1] is shared by vector left >> shift and vector right shift, and the condition to distinguish them is >> whether the shift count value is positve or negative. Hence, negation >> operation is needed before conducting vector right shift. >> >> For vector right shift, the shift count can be a RShiftCntV or a normal >> vector node. Take test case Byte64VectorTests.java [2][3] as an example. >> Note that RShiftCntV is already negated via rules "vsrcntD" and >> "vsrcntX" whereas the normal vector node is NOT, since we don't know >> whether a normal vector node is used as a vector shift count or not. >> This is the root cause for these vector test failures. >> >> The fix is simple, moving the negation from "vsrcntD|X" to the >> corresponding vector right shift rules. >> >> Affected rules are vsrlBB_reg and vsraBB_reg. Note that vector shift >> related rules are in form of "vsAABB_CC", where >> 1) AA can be l (left shift), rl (logical right shift) and ra (arithmetic >> right shift). >> 2) BB can be 8B/16B (byte type), 4S/8S (short type), 2I/4I (int type) >> and 2L (long type). >> 3) CC can be reg (register case) and immI (immediate case). >> >> Minor updates: >> 1) Merge "vslcntD" and "vsrcntD" into rule "vscntD", as these two rules >> conduct the same duplication operation now. >> 2) Update the "match" primitive for vsraBB_immI rules. >> 3) Style issue: remove the surrounding space for "ins_pipe" primitive. >> >> Tests: >> We ran tier 1~3 tests on ARM32 platform. With this patch, previously >> failed vector test cases can pass now without introducing test >> regression. >> >> [1] https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Instruction-Details/Alphabetical-list-of-instructions/VSHL--register-?lang=en >> [2] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2237 >> [3] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2425 > > Hao Sun has updated the pull request incrementally with one additional commit since the last revision: > > Use is_var_shift() to determmine the location of negation use for right shifts > > Method is_var_shift() denotes that vector shift count is a variable > shift: > 1) for this case, vector shift count should be negated before conducting > right shifts. E.g., vsrl4S_reg_var rule. > 2) for the opposite case, vector shift count is generated via RShiftCntV > rules and is already negated there. Hence, no negation is needed. > E.g., vsrl4S_reg rule. > > Besides, it's safe to add "hash()" and "cmp()" methods for ShiftV node. Tests on the latest code: We ran tier 1~3 tests on linux+ARM32 platform. With this patch, previously failed vector test cases can pass now without introducing test regression. ------------- PR: https://git.openjdk.java.net/jdk18/pull/41 From eliu at openjdk.java.net Fri Dec 24 02:59:12 2021 From: eliu at openjdk.java.net (Eric Liu) Date: Fri, 24 Dec 2021 02:59:12 GMT Subject: [jdk18] RFR: 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail [v2] In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 13:14:59 GMT, Eric Liu wrote: >> This bug appears intermittently and it's caused by vmaskAll_immI[1] >> when the vector mask size is smaller than max predicate size of running >> machine. It generates an all-true predicate without considering those >> inactive bits. That may result in the wrong result of VectorMask.toLong. >> The problematic code is as below: >> >> >> ShortVector.SPECIES_64.MaskAll(true).toLong() >> >> assembly: >> >> ptrue p0.h <= MaskAll(true) >> mov z16.h, p0/z, #1 >> mov z17.h, #0 >> uzp1 z16.b, z16.b, z17.b >> fmov x10, d16 >> orr x10, x10, x10, lsr #7 >> orr x10, x10, x10, lsr #14 >> orr x10, x10, x10, lsr #28 >> and x10, x10, #0xff >> >> (gdb) p/x $p0 # on an SVE machine with vector length as 64 in bytes >> $1 = {0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55} >> >> Expected: >> (gdb) p/x $p0 >> $1 = {0x55, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00} >> >> >> >> Considering MaskAll is used in VectorMask.fromLong() only for a special >> case and relies on the mechanism of inline and intrinsification, even it >> could be optimized out, this patch also adds test cases for MaskAll to >> reproduce this issue stably. >> >> Also fix a small issue on register utilization for >> sve_reduce_[max|min][D|F]. >> >> [1] https://github.com/openjdk/jdk18/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L416 >> >> hotspot/compiler/vectorapi, jdk/incubator/vector passed on SVE enabled >> system. >> >> Change-Id: I9631f26f9232ffe7a28b74f14062d945c32fa1fb > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > small fix > > Change-Id: Id71ebe5161fac08a689ee3ec538b485f6c172186 Is there anyone can help to sponsor this if no more reviews? ------------- PR: https://git.openjdk.java.net/jdk18/pull/49 From eliu at openjdk.java.net Fri Dec 24 03:14:18 2021 From: eliu at openjdk.java.net (Eric Liu) Date: Fri, 24 Dec 2021 03:14:18 GMT Subject: [jdk18] Integrated: 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail In-Reply-To: References: Message-ID: On Mon, 20 Dec 2021 15:35:47 GMT, Eric Liu wrote: > This bug appears intermittently and it's caused by vmaskAll_immI[1] > when the vector mask size is smaller than max predicate size of running > machine. It generates an all-true predicate without considering those > inactive bits. That may result in the wrong result of VectorMask.toLong. > The problematic code is as below: > > > ShortVector.SPECIES_64.MaskAll(true).toLong() > > assembly: > > ptrue p0.h <= MaskAll(true) > mov z16.h, p0/z, #1 > mov z17.h, #0 > uzp1 z16.b, z16.b, z17.b > fmov x10, d16 > orr x10, x10, x10, lsr #7 > orr x10, x10, x10, lsr #14 > orr x10, x10, x10, lsr #28 > and x10, x10, #0xff > > (gdb) p/x $p0 # on an SVE machine with vector length as 64 in bytes > $1 = {0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55} > > Expected: > (gdb) p/x $p0 > $1 = {0x55, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00} > > > > Considering MaskAll is used in VectorMask.fromLong() only for a special > case and relies on the mechanism of inline and intrinsification, even it > could be optimized out, this patch also adds test cases for MaskAll to > reproduce this issue stably. > > Also fix a small issue on register utilization for > sve_reduce_[max|min][D|F]. > > [1] https://github.com/openjdk/jdk18/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L416 > > hotspot/compiler/vectorapi, jdk/incubator/vector passed on SVE enabled > system. > > Change-Id: I9631f26f9232ffe7a28b74f14062d945c32fa1fb This pull request has now been integrated. Changeset: 6588bedc Author: Eric Liu Committer: Vladimir Kozlov URL: https://git.openjdk.java.net/jdk18/commit/6588bedc19ab42cec9e5bb6f13be14fb4dc5a655 Stats: 409 lines in 37 files changed: 318 ins; 0 del; 91 mod 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail Reviewed-by: njian, kvn ------------- PR: https://git.openjdk.java.net/jdk18/pull/49 From eliu at openjdk.java.net Fri Dec 24 03:34:12 2021 From: eliu at openjdk.java.net (Eric Liu) Date: Fri, 24 Dec 2021 03:34:12 GMT Subject: [jdk18] RFR: 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail [v2] In-Reply-To: References: Message-ID: On Tue, 21 Dec 2021 13:14:59 GMT, Eric Liu wrote: >> This bug appears intermittently and it's caused by vmaskAll_immI[1] >> when the vector mask size is smaller than max predicate size of running >> machine. It generates an all-true predicate without considering those >> inactive bits. That may result in the wrong result of VectorMask.toLong. >> The problematic code is as below: >> >> >> ShortVector.SPECIES_64.MaskAll(true).toLong() >> >> assembly: >> >> ptrue p0.h <= MaskAll(true) >> mov z16.h, p0/z, #1 >> mov z17.h, #0 >> uzp1 z16.b, z16.b, z17.b >> fmov x10, d16 >> orr x10, x10, x10, lsr #7 >> orr x10, x10, x10, lsr #14 >> orr x10, x10, x10, lsr #28 >> and x10, x10, #0xff >> >> (gdb) p/x $p0 # on an SVE machine with vector length as 64 in bytes >> $1 = {0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55} >> >> Expected: >> (gdb) p/x $p0 >> $1 = {0x55, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00} >> >> >> >> Considering MaskAll is used in VectorMask.fromLong() only for a special >> case and relies on the mechanism of inline and intrinsification, even it >> could be optimized out, this patch also adds test cases for MaskAll to >> reproduce this issue stably. >> >> Also fix a small issue on register utilization for >> sve_reduce_[max|min][D|F]. >> >> [1] https://github.com/openjdk/jdk18/blob/master/src/hotspot/cpu/aarch64/aarch64_sve.ad#L416 >> >> hotspot/compiler/vectorapi, jdk/incubator/vector passed on SVE enabled >> system. >> >> Change-Id: I9631f26f9232ffe7a28b74f14062d945c32fa1fb > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > small fix > > Change-Id: Id71ebe5161fac08a689ee3ec538b485f6c172186 Thanks for all reviewers:P ------------- PR: https://git.openjdk.java.net/jdk18/pull/49 From duke at openjdk.java.net Fri Dec 24 07:12:03 2021 From: duke at openjdk.java.net (Vamsi Parasa) Date: Fri, 24 Dec 2021 07:12:03 GMT Subject: RFR: 8278868: Add x86 vectorization support for Long.bitCount() [v3] In-Reply-To: References: Message-ID: > Vectorization support of Integer.bitCount() already exists but currently the same support is lacking for Long.bitCount(). Similar to the C2 PopCountVI node, we created a C2 PopCountVL node and used vpopcntq x86 instruction to enable vectorized Long.bitCount(). This patch shows 2.57x improvement in performance on a JMH micro benchmark due to x86 vectorization. Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: Use generic vector node names ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6857/files - new: https://git.openjdk.java.net/jdk/pull/6857/files/4567eab8..67f2a71b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6857&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6857&range=01-02 Stats: 4 lines in 3 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/6857.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6857/head:pull/6857 PR: https://git.openjdk.java.net/jdk/pull/6857 From mli at openjdk.java.net Fri Dec 24 08:52:15 2021 From: mli at openjdk.java.net (Hamlin Li) Date: Fri, 24 Dec 2021 08:52:15 GMT Subject: RFR: 8279057: Consolidate InstructionPrinter::do_BlockBegin and remove extra lines in logs of blocks with phi functions In-Reply-To: References: Message-ID: On Thu, 23 Dec 2021 12:43:13 GMT, Yi Yang wrote: >> This pr does following things: >> - First, the code related to printing phi in InstructionPrinter::do_BlockBegin is bit redudant, it could be consolidated; >> - Second, only locals and stack values related to phi functions are printed out now. >> - Third, there are extra blank lines in printed log of blocks with phi functions, these blank lines are better to be removed. > > src/hotspot/share/c1/c1_InstructionPrinter.cpp line 639: > >> 637: if (v && is_phi_of_block(v, x)) { >> 638: if (!printed_phis_in_locals_header) { >> 639: output()->cr(); output()->print("Values in locals related to phi functions:"); > > Can we keep this old style? i.e. "Locals:" and "Stacks:". Thanks @kelthuzadx As this patch is modifying some output content, I'm not quite sure if the labels should be modified accordingly. I do understand your concerns, Let's see how others think about it. ------------- PR: https://git.openjdk.java.net/jdk/pull/6921 From jiefu at openjdk.java.net Fri Dec 24 10:34:44 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 24 Dec 2021 10:34:44 GMT Subject: RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations Message-ID: Hi all, Happy Christmas Day! We have observed that C2 fails to auto-vectorize two-dimensional array operations in our machine learning programs. And we have made an reproducer in the JBS. Now let's discuss the reproducer. The auto-vectorization fails due to `cl->slp_max_unroll() == 0` [1], which means the previous slp analysis never passed. As for our example, C2 had tried its first slp analysis with `future_unroll_cnt=4` [2]. But unfortunately, it failed due to the loop IR is too complicated [3] like the following. SuperWord::transform_loop: loop too complicated, cl_exit->in(0) != lpt->_head cl_exit 823 823 CountedLoopEnd === 738 822 [[ 907 682 ]] [lt] P=0.999999, C=-1.000000 !orig=[680] cl_exit->in(0) 738 738 IfTrue === 735 [[ 823 ]] #1 !orig=[442] !jvms: DoubleArray2::test @ bci:17 (line 10) lpt->_head 1267 1267 CountedLoop === 1267 1224 682 [[ 1267 1278 1283 1284 1288 1254 1282 ]] inner stride: 2 main of N1267 !orig=[824],[748],[687] !jvms: DoubleArray2::test @ bci:30 (line 11) Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce RangeCheck Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce Unroll 4 Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce Loop: N0/N0 has_sfpt Loop: N493/N463 limit_check profile_predicated predicated counted [0,int),+1 (65 iters) sfpts={ 453 } Loop: N946/N966 counted [0,int),+1 (4 iters) pre has_sfpt Loop: N1483/N682 counted [int,int),+4 (65 iters) main rc has_sfpt Loop: N857/N877 counted [int,int),+1 (4 iters) post has_sfpt PredicatesOff Then, C2 unrolled the loop with `unroll-factor=4` and also did some other opts, which actually simplified the loop IR representation. And then, comes the next round of loop unrolling analysis, in which C2 would check if `future_unroll_cnt=8` [2] is OK for unrolling. C2 rejected `future_unroll_cnt=8` for this example and returned false immediately [4] without doing a second slp analysis, leaving `cl->slp_max_unroll() == 0`. But if we re-do the slp analysis with `future_unroll_cnt=4` before returning false, it would pass. So the key idea is: slp analysis may fail due to the loop IR is too complicated especially during the early stage of loop unrolling analysis. But after several rounds of loop unrolling and other optimizations, it's possible that the loop IR becomes simple enough to pass the slp analysis. So C2 can try one more slp analysis instead of returning false immediately here [4]. We have observed up to 1.7x performance improvement by our micro benchmarks. ![image](https://user-images.githubusercontent.com/19923746/147344527-b4d9c0ae-c0d4-4cac-b17a-48474648b21a.png) Testing: - tier1 ~ tier3 on Linux/x64, no regression. Thanks. Best regards, Jie [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L129 [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908 [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L137 [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L910 ------------- Commit messages: - 8279258: Auto-vectorization enhancement for two-dimensional array operations Changes: https://git.openjdk.java.net/jdk/pull/6933/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6933&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8279258 Stats: 140 lines in 2 files changed: 136 ins; 0 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/6933.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6933/head:pull/6933 PR: https://git.openjdk.java.net/jdk/pull/6933 From bulasevich at openjdk.java.net Fri Dec 24 12:00:20 2021 From: bulasevich at openjdk.java.net (Boris Ulasevich) Date: Fri, 24 Dec 2021 12:00:20 GMT Subject: [jdk18] RFR: 8278267: ARM32: several vector test failures for ASHR [v2] In-Reply-To: References: Message-ID: On Fri, 24 Dec 2021 01:34:50 GMT, Hao Sun wrote: >> In ARM32, "VSHL (register)" instruction [1] is shared by vector left >> shift and vector right shift, and the condition to distinguish them is >> whether the shift count value is positve or negative. Hence, negation >> operation is needed before conducting vector right shift. >> >> For vector right shift, the shift count can be a RShiftCntV or a normal >> vector node. Take test case Byte64VectorTests.java [2][3] as an example. >> Note that RShiftCntV is already negated via rules "vsrcntD" and >> "vsrcntX" whereas the normal vector node is NOT, since we don't know >> whether a normal vector node is used as a vector shift count or not. >> This is the root cause for these vector test failures. >> >> The fix is simple, moving the negation from "vsrcntD|X" to the >> corresponding vector right shift rules. >> >> Affected rules are vsrlBB_reg and vsraBB_reg. Note that vector shift >> related rules are in form of "vsAABB_CC", where >> 1) AA can be l (left shift), rl (logical right shift) and ra (arithmetic >> right shift). >> 2) BB can be 8B/16B (byte type), 4S/8S (short type), 2I/4I (int type) >> and 2L (long type). >> 3) CC can be reg (register case) and immI (immediate case). >> >> Minor updates: >> 1) Merge "vslcntD" and "vsrcntD" into rule "vscntD", as these two rules >> conduct the same duplication operation now. >> 2) Update the "match" primitive for vsraBB_immI rules. >> 3) Style issue: remove the surrounding space for "ins_pipe" primitive. >> >> Tests: >> We ran tier 1~3 tests on ARM32 platform. With this patch, previously >> failed vector test cases can pass now without introducing test >> regression. >> >> [1] https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Instruction-Details/Alphabetical-list-of-instructions/VSHL--register-?lang=en >> [2] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2237 >> [3] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2425 > > Hao Sun has updated the pull request incrementally with one additional commit since the last revision: > > Use is_var_shift() to determmine the location of negation use for right shifts > > Method is_var_shift() denotes that vector shift count is a variable > shift: > 1) for this case, vector shift count should be negated before conducting > right shifts. E.g., vsrl4S_reg_var rule. > 2) for the opposite case, vector shift count is generated via RShiftCntV > rules and is already negated there. Hence, no negation is needed. > E.g., vsrl4S_reg rule. > > Besides, it's safe to add "hash()" and "cmp()" methods for ShiftV node. The change is Ok for me (I am not a reviewer). Though I would avoid changing the style. If we want to change a style, I would suggest to change all the entries (there is a lot of surrounding spaces remaining in a file), and I think FIXME for ins_cost/ins_pipe are outdated as well. ------------- PR: https://git.openjdk.java.net/jdk18/pull/41 From snazarki at openjdk.java.net Fri Dec 24 16:31:43 2021 From: snazarki at openjdk.java.net (Sergey Nazarkin) Date: Fri, 24 Dec 2021 16:31:43 GMT Subject: RFR: 8279225: C1 longs comparison operation destroys argument registers Message-ID: Several regression tests are failed on arm32 CPU if tiered compilation is enabled. The list includes java/math/BigDecimal/DivideMcTests java/util/Arrays/Sorting.java java/util/Arrays/SortingNearlySortedPrimitive.java java/util/concurrent/tck/JSR166TestCase java/util/stream/SliceOpTest.java etc It appears C1 comp_op for long operands destroys arguments registers: void LIR_Assembler::comp_op(LIR_Condition condition, LIR_Opr opr1, LIR_Opr opr2, LIR_Op2* op) { .... Register ylo = opr2->as_register_lo(); Register yhi = opr2->as_register_hi(); if (condition == lir_cond_equal || condition == lir_cond_notEqual) { __ teq(xhi, yhi); __ teq(xlo, ylo, eq); } else { __ subs(xlo, xlo, ylo); // <<< incorrect __ sbcs(xhi, xhi, yhi); // <<< incorrect } ... } Tested with hotspot_tier2, jdk_tier3 on linux_arm ------------- Commit messages: - 8279225: C1 longs comparison operation destroys argument registers Changes: https://git.openjdk.java.net/jdk/pull/6934/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6934&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8279225 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/6934.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6934/head:pull/6934 PR: https://git.openjdk.java.net/jdk/pull/6934 From snazarki at openjdk.java.net Fri Dec 24 16:31:44 2021 From: snazarki at openjdk.java.net (Sergey Nazarkin) Date: Fri, 24 Dec 2021 16:31:44 GMT Subject: RFR: 8279225: C1 longs comparison operation destroys argument registers In-Reply-To: References: Message-ID: On Fri, 24 Dec 2021 16:25:37 GMT, Sergey Nazarkin wrote: > Several regression tests are failed on arm32 CPU if tiered compilation is enabled. > > The list includes > java/math/BigDecimal/DivideMcTests > java/util/Arrays/Sorting.java > java/util/Arrays/SortingNearlySortedPrimitive.java > java/util/concurrent/tck/JSR166TestCase > java/util/stream/SliceOpTest.java > etc > > It appears C1 comp_op for long operands destroys arguments registers: > void LIR_Assembler::comp_op(LIR_Condition condition, LIR_Opr opr1, LIR_Opr opr2, LIR_Op2* op) { > .... > Register ylo = opr2->as_register_lo(); > Register yhi = opr2->as_register_hi(); > if (condition == lir_cond_equal || condition == lir_cond_notEqual) { > __ teq(xhi, yhi); > __ teq(xlo, ylo, eq); > } else { > __ subs(xlo, xlo, ylo); // <<< incorrect > __ sbcs(xhi, xhi, yhi); // <<< incorrect > } > ... > } > > Tested with hotspot_tier2, jdk_tier3 on linux_arm @bulasevich could you please check this fix? ------------- PR: https://git.openjdk.java.net/jdk/pull/6934 From iveresov at openjdk.java.net Fri Dec 24 19:14:15 2021 From: iveresov at openjdk.java.net (Igor Veresov) Date: Fri, 24 Dec 2021 19:14:15 GMT Subject: RFR: 8271202: C1: assert(false) failed: live_in set of first block must be empty [v2] In-Reply-To: References: Message-ID: On Tue, 7 Dec 2021 23:25:03 GMT, Martin Doerr wrote: >> I have written a checker which detects usage of the illegal phi function. In case of the reproducer provided in the JBS bug ("Reduced.java"), it finds the following and bails out: >> >> invalidating local 8 because of type mismatch (new_value is NULL) >> Bailing out because StoreIndexed (id 98) uses illegal phi (id 68) >> >> I haven't checked why that node uses the illegal phi. That still seems to be a bug. Maybe there's a better solution to the underlying problem, but I hope my checker is useful to analyze bugs and to make C1 more resilient. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add test. Alright, we're tight on time with P3 for 18. I'm going to put out a corresponding PR. ------------- PR: https://git.openjdk.java.net/jdk/pull/6683 From iveresov at openjdk.java.net Fri Dec 24 19:28:49 2021 From: iveresov at openjdk.java.net (Igor Veresov) Date: Fri, 24 Dec 2021 19:28:49 GMT Subject: [jdk18] RFR: 8271202 C1: assert(false) failed: live_in set of first block must be empty Message-ID: The root cause seems to be because of the irreducible loops (and therefore an unusual block traversal order when inserting phis) the phi invalidation logic in try_merge() doesn't invalidate phis that have invalid locals as inputs. I've attached a drawing: [8271202.pdf](https://github.com/openjdk/jdk18/files/7775050/8271202.pdf). Notice that i54 = phi (i43, 96) is not invalidated even though 96 is illegal. Transitively, i43, it should be illegal too. I would propose that we add a check for that and bailout in move_phi(). This has also been discussed here: https://github.com/openjdk/jdk/pull/6683 Testing is clean. ------------- Commit messages: - Bailout in case live range extension produces invalid phi Changes: https://git.openjdk.java.net/jdk18/pull/73/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk18&pr=73&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8271202 Stats: 75 lines in 2 files changed: 75 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk18/pull/73.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/73/head:pull/73 PR: https://git.openjdk.java.net/jdk18/pull/73 From vlivanov at openjdk.java.net Fri Dec 24 19:51:13 2021 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 24 Dec 2021 19:51:13 GMT Subject: [jdk18] RFR: 8271202 C1: assert(false) failed: live_in set of first block must be empty In-Reply-To: References: Message-ID: On Fri, 24 Dec 2021 19:19:35 GMT, Igor Veresov wrote: > The root cause seems to be because of the irreducible loops (and therefore an unusual block traversal order when inserting phis) the phi invalidation logic in try_merge() doesn't invalidate phis that have invalid locals as inputs. I've attached a drawing: [8271202.pdf](https://github.com/openjdk/jdk18/files/7775050/8271202.pdf). Notice that i54 = phi (i43, 96) is not invalidated even though 96 is illegal. Transitively, i43, it should be illegal too. I would propose that we add a check for that and bailout in move_phi(). > > This has also been discussed here: https://github.com/openjdk/jdk/pull/6683 > > Testing is clean. Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/73 From bulasevich at openjdk.java.net Fri Dec 24 22:02:15 2021 From: bulasevich at openjdk.java.net (Boris Ulasevich) Date: Fri, 24 Dec 2021 22:02:15 GMT Subject: RFR: 8279225: C1 longs comparison operation destroys argument registers In-Reply-To: References: Message-ID: <1KIH3vW9JTvOXpCMNw4SUFJ06DKSxrvq1CufFOKe_iw=.e36c36cc-d3d0-46f9-a4e6-89b174b797f4@github.com> On Fri, 24 Dec 2021 16:25:37 GMT, Sergey Nazarkin wrote: > Several regression tests are failed on arm32 CPU if tiered compilation is enabled. > > The list includes > java/math/BigDecimal/DivideMcTests > java/util/Arrays/Sorting.java > java/util/Arrays/SortingNearlySortedPrimitive.java > java/util/concurrent/tck/JSR166TestCase > java/util/stream/SliceOpTest.java > etc > > It appears C1 comp_op for long operands destroys arguments registers: > void LIR_Assembler::comp_op(LIR_Condition condition, LIR_Opr opr1, LIR_Opr opr2, LIR_Op2* op) { > .... > Register ylo = opr2->as_register_lo(); > Register yhi = opr2->as_register_hi(); > if (condition == lir_cond_equal || condition == lir_cond_notEqual) { > __ teq(xhi, yhi); > __ teq(xlo, ylo, eq); > } else { > __ subs(xlo, xlo, ylo); // <<< incorrect > __ sbcs(xhi, xhi, yhi); // <<< incorrect > } > ... > } > > Tested with hotspot_tier2, jdk_tier3 on linux_arm Good catch. The fix is Ok, thank you! ------------- PR: https://git.openjdk.java.net/jdk/pull/6934 From kvn at openjdk.java.net Fri Dec 24 22:43:15 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 24 Dec 2021 22:43:15 GMT Subject: [jdk18] RFR: 8271202 C1: assert(false) failed: live_in set of first block must be empty In-Reply-To: References: Message-ID: On Fri, 24 Dec 2021 19:19:35 GMT, Igor Veresov wrote: > The root cause seems to be because of the irreducible loops (and therefore an unusual block traversal order when inserting phis) the phi invalidation logic in try_merge() doesn't invalidate phis that have invalid locals as inputs. I've attached a drawing: [8271202.pdf](https://github.com/openjdk/jdk18/files/7775050/8271202.pdf). Notice that i54 = phi (i43, 96) is not invalidated even though 96 is illegal. Transitively, i43, it should be illegal too. I would propose that we add a check for that and bailout in move_phi(). > > This has also been discussed here: https://github.com/openjdk/jdk/pull/6683 > > Testing is clean. Agree. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk18/pull/73 From iveresov at openjdk.java.net Sat Dec 25 05:41:21 2021 From: iveresov at openjdk.java.net (Igor Veresov) Date: Sat, 25 Dec 2021 05:41:21 GMT Subject: [jdk18] RFR: 8271202 C1: assert(false) failed: live_in set of first block must be empty In-Reply-To: References: Message-ID: On Fri, 24 Dec 2021 19:19:35 GMT, Igor Veresov wrote: > The root cause seems to be because of the irreducible loops (and therefore an unusual block traversal order when inserting phis) the phi invalidation logic in try_merge() doesn't invalidate phis that have invalid locals as inputs. I've attached a drawing: [8271202.pdf](https://github.com/openjdk/jdk18/files/7775050/8271202.pdf). Notice that i54 = phi (i43, 96) is not invalidated even though 96 is illegal. Transitively, i43, it should be illegal too. I would propose that we add a check for that and bailout in move_phi(). > > This has also been discussed here: https://github.com/openjdk/jdk/pull/6683 > > Testing is clean. Thanks, guys! ------------- PR: https://git.openjdk.java.net/jdk18/pull/73 From iveresov at openjdk.java.net Sat Dec 25 05:41:23 2021 From: iveresov at openjdk.java.net (Igor Veresov) Date: Sat, 25 Dec 2021 05:41:23 GMT Subject: [jdk18] Integrated: 8271202 C1: assert(false) failed: live_in set of first block must be empty In-Reply-To: References: Message-ID: On Fri, 24 Dec 2021 19:19:35 GMT, Igor Veresov wrote: > The root cause seems to be because of the irreducible loops (and therefore an unusual block traversal order when inserting phis) the phi invalidation logic in try_merge() doesn't invalidate phis that have invalid locals as inputs. I've attached a drawing: [8271202.pdf](https://github.com/openjdk/jdk18/files/7775050/8271202.pdf). Notice that i54 = phi (i43, 96) is not invalidated even though 96 is illegal. Transitively, i43, it should be illegal too. I would propose that we add a check for that and bailout in move_phi(). > > This has also been discussed here: https://github.com/openjdk/jdk/pull/6683 > > Testing is clean. This pull request has now been integrated. Changeset: 54b800d5 Author: Igor Veresov URL: https://git.openjdk.java.net/jdk18/commit/54b800d56d6bc86676722ad96e87b8344606bcb7 Stats: 75 lines in 2 files changed: 75 ins; 0 del; 0 mod 8271202: C1: assert(false) failed: live_in set of first block must be empty Co-authored-by: Martin Doerr Reviewed-by: vlivanov, kvn ------------- PR: https://git.openjdk.java.net/jdk18/pull/73 From jwilhelm at openjdk.java.net Mon Dec 27 00:43:54 2021 From: jwilhelm at openjdk.java.net (Jesper Wilhelmsson) Date: Mon, 27 Dec 2021 00:43:54 GMT Subject: RFR: Merge jdk18 Message-ID: Forwardport JDK 18 -> JDK 19 ------------- Commit messages: - Merge remote-tracking branch 'jdk18/master' into Merge_jdk18 - 8271202: C1: assert(false) failed: live_in set of first block must be empty - 8279195: Document the -XX:+NeverActAsServerClassMachine flag - 8278889: AArch64: [vectorapi] VectorMaskLoadStoreTest.testMaskCast() test fail The webrevs contain the adjustments done while merging with regards to each parent branch: - master: https://webrevs.openjdk.java.net/?repo=jdk&pr=6936&range=00.0 - jdk18: https://webrevs.openjdk.java.net/?repo=jdk&pr=6936&range=00.1 Changes: https://git.openjdk.java.net/jdk/pull/6936/files Stats: 509 lines in 40 files changed: 418 ins; 0 del; 91 mod Patch: https://git.openjdk.java.net/jdk/pull/6936.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6936/head:pull/6936 PR: https://git.openjdk.java.net/jdk/pull/6936 From jwilhelm at openjdk.java.net Mon Dec 27 01:26:24 2021 From: jwilhelm at openjdk.java.net (Jesper Wilhelmsson) Date: Mon, 27 Dec 2021 01:26:24 GMT Subject: Integrated: Merge jdk18 In-Reply-To: References: Message-ID: On Mon, 27 Dec 2021 00:35:04 GMT, Jesper Wilhelmsson wrote: > Forwardport JDK 18 -> JDK 19 This pull request has now been integrated. Changeset: 4f607f2a Author: Jesper Wilhelmsson URL: https://git.openjdk.java.net/jdk/commit/4f607f2adac3798c16a62e902ba9ce0df3ab1add Stats: 509 lines in 40 files changed: 418 ins; 0 del; 91 mod Merge ------------- PR: https://git.openjdk.java.net/jdk/pull/6936 From haosun at openjdk.java.net Mon Dec 27 08:12:11 2021 From: haosun at openjdk.java.net (Hao Sun) Date: Mon, 27 Dec 2021 08:12:11 GMT Subject: RFR: 8279225: C1 longs comparison operation destroys argument registers In-Reply-To: References: Message-ID: On Fri, 24 Dec 2021 16:25:37 GMT, Sergey Nazarkin wrote: > Several regression tests are failed on arm32 CPU if tiered compilation is enabled. > > The list includes > java/math/BigDecimal/DivideMcTests > java/util/Arrays/Sorting.java > java/util/Arrays/SortingNearlySortedPrimitive.java > java/util/concurrent/tck/JSR166TestCase > java/util/stream/SliceOpTest.java > etc > > It appears C1 comp_op for long operands destroys arguments registers: > void LIR_Assembler::comp_op(LIR_Condition condition, LIR_Opr opr1, LIR_Opr opr2, LIR_Op2* op) { > .... > Register ylo = opr2->as_register_lo(); > Register yhi = opr2->as_register_hi(); > if (condition == lir_cond_equal || condition == lir_cond_notEqual) { > __ teq(xhi, yhi); > __ teq(xlo, ylo, eq); > } else { > __ subs(xlo, xlo, ylo); // <<< incorrect > __ sbcs(xhi, xhi, yhi); // <<< incorrect > } > ... > } > > Tested with hotspot_tier2, jdk_tier3 on linux_arm This fix looks good to me. (I'm not a Reviewer). One thing to remind is that the PR and JBS should use the same title. src/hotspot/cpu/arm/c1_LIRAssembler_arm.cpp line 1823: > 1821: __ teq(xlo, ylo, eq); > 1822: } else { > 1823: __ cmp(xlo, ylo); nit: we may want to use `__ subs(Rtemp, xlo, ylo);` here to align with the usage in match rules in arm.ad, e.g., `compL_reg_reg_LEGT`. It's okay to me if you use `cmp` anyway. ------------- Marked as reviewed by haosun (Author). PR: https://git.openjdk.java.net/jdk/pull/6934 From neliasso at openjdk.java.net Mon Dec 27 09:57:14 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Mon, 27 Dec 2021 09:57:14 GMT Subject: RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations In-Reply-To: References: Message-ID: <6HiL4ctmhBMv6YiiRVseQuYaqMeEIS_3fpEueaVbbFI=.92fc9898-0e64-4117-9c18-5583a7b374b8@github.com> On Fri, 24 Dec 2021 10:26:30 GMT, Jie Fu wrote: > Hi all, > > Happy Christmas Day! > > We have observed that C2 fails to auto-vectorize two-dimensional array operations in our machine learning programs. > And we have made an reproducer in the JBS. > > Now let's discuss the reproducer. > The auto-vectorization fails due to `cl->slp_max_unroll() == 0` [1], which means the previous slp analysis never passed. > > As for our example, C2 had tried its first slp analysis with `future_unroll_cnt=4` [2]. > But unfortunately, it failed due to the loop IR is too complicated [3] like the following. > > SuperWord::transform_loop: loop too complicated, cl_exit->in(0) != lpt->_head > cl_exit 823 823 CountedLoopEnd === 738 822 [[ 907 682 ]] [lt] P=0.999999, C=-1.000000 !orig=[680] > cl_exit->in(0) 738 738 IfTrue === 735 [[ 823 ]] #1 !orig=[442] !jvms: DoubleArray2::test @ bci:17 (line 10) > lpt->_head 1267 1267 CountedLoop === 1267 1224 682 [[ 1267 1278 1283 1284 1288 1254 1282 ]] inner stride: 2 main of N1267 !orig=[824],[748],[687] !jvms: DoubleArray2::test @ bci:30 (line 11) > Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce > RangeCheck Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce > Unroll 4 Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce > Loop: N0/N0 has_sfpt > Loop: N493/N463 limit_check profile_predicated predicated counted [0,int),+1 (65 iters) sfpts={ 453 } > Loop: N946/N966 counted [0,int),+1 (4 iters) pre has_sfpt > Loop: N1483/N682 counted [int,int),+4 (65 iters) main rc has_sfpt > Loop: N857/N877 counted [int,int),+1 (4 iters) post has_sfpt > PredicatesOff > > > Then, C2 unrolled the loop with `unroll-factor=4` and also did some other opts, which actually simplified the loop IR representation. > > And then, comes the next round of loop unrolling analysis, in which C2 would check if `future_unroll_cnt=8` [2] is OK for unrolling. > C2 rejected `future_unroll_cnt=8` for this example and returned false immediately [4] without doing a second slp analysis, leaving `cl->slp_max_unroll() == 0`. > But if we re-do the slp analysis with `future_unroll_cnt=4` before returning false, it would pass. > > So the key idea is: > > slp analysis may fail due to the loop IR is too complicated especially during the early stage of loop unrolling analysis. > But after several rounds of loop unrolling and other optimizations, it's possible that the loop IR becomes simple enough to pass the slp analysis. > So C2 can try one more slp analysis instead of returning false immediately here [4]. > > > We have observed up to 1.7x performance improvement by our micro benchmarks. > > ![image](https://user-images.githubusercontent.com/19923746/147344527-b4d9c0ae-c0d4-4cac-b17a-48474648b21a.png) > > Testing: > - tier1 ~ tier3 on Linux/x64, no regression. > > Thanks. > Best regards, > Jie > > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L129 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L137 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L910 Hi Jie, I hope you have a good holiday too! Nice find, and straight forward fix too. I have one comment in the code. It's excellent that you have added a microbenchmark too. I would like to have a small regression test too, that quickly fails if this would break in the future. Perhaps something using the IR Testing framework. Best regards, Nils Eliasson src/hotspot/share/opto/loopTransform.cpp line 913: > 911: 1.2 * cl->node_count_before_unroll() < (double)_body.size()) { > 912: if (UseSuperWord && (cl->slp_max_unroll() == 0) && > 913: (cl->unrolled_count() - 1) * (100.0 / LoopPercentProfileLimit) <= cl->profile_trip_cnt()) { On line 911 and 913 this is repeated: "(X - 1) * (100.0 / LoopPercentProfileLimit) > cl->profile_trip_cnt()" Please replace that with a method. ------------- PR: https://git.openjdk.java.net/jdk/pull/6933 From aph at openjdk.java.net Mon Dec 27 10:55:12 2021 From: aph at openjdk.java.net (Andrew Haley) Date: Mon, 27 Dec 2021 10:55:12 GMT Subject: RFR: 8279225: C1 longs comparison operation destroys argument registers In-Reply-To: References: Message-ID: On Mon, 27 Dec 2021 08:07:37 GMT, Hao Sun wrote: >> Several regression tests are failed on arm32 CPU if tiered compilation is enabled. >> >> The list includes >> java/math/BigDecimal/DivideMcTests >> java/util/Arrays/Sorting.java >> java/util/Arrays/SortingNearlySortedPrimitive.java >> java/util/concurrent/tck/JSR166TestCase >> java/util/stream/SliceOpTest.java >> etc >> >> It appears C1 comp_op for long operands destroys arguments registers: >> void LIR_Assembler::comp_op(LIR_Condition condition, LIR_Opr opr1, LIR_Opr opr2, LIR_Op2* op) { >> .... >> Register ylo = opr2->as_register_lo(); >> Register yhi = opr2->as_register_hi(); >> if (condition == lir_cond_equal || condition == lir_cond_notEqual) { >> __ teq(xhi, yhi); >> __ teq(xlo, ylo, eq); >> } else { >> __ subs(xlo, xlo, ylo); // <<< incorrect >> __ sbcs(xhi, xhi, yhi); // <<< incorrect >> } >> ... >> } >> >> Tested with hotspot_tier2, jdk_tier3 on linux_arm > > src/hotspot/cpu/arm/c1_LIRAssembler_arm.cpp line 1823: > >> 1821: __ teq(xlo, ylo, eq); >> 1822: } else { >> 1823: __ cmp(xlo, ylo); > > nit: we may want to use `__ subs(Rtemp, xlo, ylo);` here to align with the usage in match rules in arm.ad, e.g., `compL_reg_reg_LEGT`. It's okay to me if you use `cmp` anyway. I agree. ------------- PR: https://git.openjdk.java.net/jdk/pull/6934 From snazarki at openjdk.java.net Mon Dec 27 11:47:47 2021 From: snazarki at openjdk.java.net (Sergey Nazarkin) Date: Mon, 27 Dec 2021 11:47:47 GMT Subject: RFR: 8279225: C1 longs comparison operation destroys argument registers [v2] In-Reply-To: References: Message-ID: > Several regression tests are failed on arm32 CPU if tiered compilation is enabled. > > The list includes > java/math/BigDecimal/DivideMcTests > java/util/Arrays/Sorting.java > java/util/Arrays/SortingNearlySortedPrimitive.java > java/util/concurrent/tck/JSR166TestCase > java/util/stream/SliceOpTest.java > etc > > It appears C1 comp_op for long operands destroys arguments registers: > void LIR_Assembler::comp_op(LIR_Condition condition, LIR_Opr opr1, LIR_Opr opr2, LIR_Op2* op) { > .... > Register ylo = opr2->as_register_lo(); > Register yhi = opr2->as_register_hi(); > if (condition == lir_cond_equal || condition == lir_cond_notEqual) { > __ teq(xhi, yhi); > __ teq(xlo, ylo, eq); > } else { > __ subs(xlo, xlo, ylo); // <<< incorrect > __ sbcs(xhi, xhi, yhi); // <<< incorrect > } > ... > } > > Tested with hotspot_tier2, jdk_tier3 on linux_arm Sergey Nazarkin has updated the pull request incrementally with one additional commit since the last revision: Align C1 long cmp with match rules in arm.ad ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6934/files - new: https://git.openjdk.java.net/jdk/pull/6934/files/53916e49..3ee6f5c9 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6934&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6934&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/6934.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6934/head:pull/6934 PR: https://git.openjdk.java.net/jdk/pull/6934 From snazarki at openjdk.java.net Mon Dec 27 11:47:50 2021 From: snazarki at openjdk.java.net (Sergey Nazarkin) Date: Mon, 27 Dec 2021 11:47:50 GMT Subject: RFR: 8279225: C1 longs comparison operation destroys argument registers [v2] In-Reply-To: References: Message-ID: <2X8YP-xsQY23n23p--ZjhBoGdNGECmx1Dn9hYoiOC80=.4dcf2714-190b-41b4-8017-31add573f58d@github.com> On Mon, 27 Dec 2021 10:52:33 GMT, Andrew Haley wrote: >> src/hotspot/cpu/arm/c1_LIRAssembler_arm.cpp line 1823: >> >>> 1821: __ teq(xlo, ylo, eq); >>> 1822: } else { >>> 1823: __ cmp(xlo, ylo); >> >> nit: we may want to use `__ subs(Rtemp, xlo, ylo);` here to align with the usage in match rules in arm.ad, e.g., `compL_reg_reg_LEGT`. It's okay to me if you use `cmp` anyway. > > I agree. Aligned with C2 implementation ------------- PR: https://git.openjdk.java.net/jdk/pull/6934 From jiefu at openjdk.java.net Mon Dec 27 14:41:58 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 27 Dec 2021 14:41:58 GMT Subject: RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations [v2] In-Reply-To: References: Message-ID: <6Xo2LbTXntR1TIGyJ1eR7drZi0dtSY3GZ2Ya480c0ok=.6a8d0a96-4a91-4441-a10f-1bc9fc46c5f9@github.com> > Hi all, > > Happy Christmas Day! > > We have observed that C2 fails to auto-vectorize two-dimensional array operations in our machine learning programs. > And we have made an reproducer in the JBS. > > Now let's discuss the reproducer. > The auto-vectorization fails due to `cl->slp_max_unroll() == 0` [1], which means the previous slp analysis never passed. > > As for our example, C2 had tried its first slp analysis with `future_unroll_cnt=4` [2]. > But unfortunately, it failed due to the loop IR is too complicated [3] like the following. > > SuperWord::transform_loop: loop too complicated, cl_exit->in(0) != lpt->_head > cl_exit 823 823 CountedLoopEnd === 738 822 [[ 907 682 ]] [lt] P=0.999999, C=-1.000000 !orig=[680] > cl_exit->in(0) 738 738 IfTrue === 735 [[ 823 ]] #1 !orig=[442] !jvms: DoubleArray2::test @ bci:17 (line 10) > lpt->_head 1267 1267 CountedLoop === 1267 1224 682 [[ 1267 1278 1283 1284 1288 1254 1282 ]] inner stride: 2 main of N1267 !orig=[824],[748],[687] !jvms: DoubleArray2::test @ bci:30 (line 11) > Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce > RangeCheck Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce > Unroll 4 Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce > Loop: N0/N0 has_sfpt > Loop: N493/N463 limit_check profile_predicated predicated counted [0,int),+1 (65 iters) sfpts={ 453 } > Loop: N946/N966 counted [0,int),+1 (4 iters) pre has_sfpt > Loop: N1483/N682 counted [int,int),+4 (65 iters) main rc has_sfpt > Loop: N857/N877 counted [int,int),+1 (4 iters) post has_sfpt > PredicatesOff > > > Then, C2 unrolled the loop with `unroll-factor=4` and also did some other opts, which actually simplified the loop IR representation. > > And then, comes the next round of loop unrolling analysis, in which C2 would check if `future_unroll_cnt=8` [2] is OK for unrolling. > C2 rejected `future_unroll_cnt=8` for this example and returned false immediately [4] without doing a second slp analysis, leaving `cl->slp_max_unroll() == 0`. > But if we re-do the slp analysis with `future_unroll_cnt=4` before returning false, it would pass. > > So the key idea is: > > slp analysis may fail due to the loop IR is too complicated especially during the early stage of loop unrolling analysis. > But after several rounds of loop unrolling and other optimizations, it's possible that the loop IR becomes simple enough to pass the slp analysis. > So C2 can try one more slp analysis instead of returning false immediately here [4]. > > > We have observed up to 1.7x performance improvement by our micro benchmarks. > > ![image](https://user-images.githubusercontent.com/19923746/147344527-b4d9c0ae-c0d4-4cac-b17a-48474648b21a.png) > > Testing: > - tier1 ~ tier3 on Linux/x64, no regression. > > Thanks. > Best regards, > Jie > > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L129 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L137 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L910 Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Address review comments - Merge branch 'master' into JDK-8279258 - 8279258: Auto-vectorization enhancement for two-dimensional array operations ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6933/files - new: https://git.openjdk.java.net/jdk/pull/6933/files/7f4fc92e..1a9b4c84 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6933&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6933&range=00-01 Stats: 598 lines in 58 files changed: 490 ins; 1 del; 107 mod Patch: https://git.openjdk.java.net/jdk/pull/6933.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6933/head:pull/6933 PR: https://git.openjdk.java.net/jdk/pull/6933 From jiefu at openjdk.java.net Mon Dec 27 14:52:10 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 27 Dec 2021 14:52:10 GMT Subject: RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations [v2] In-Reply-To: <6HiL4ctmhBMv6YiiRVseQuYaqMeEIS_3fpEueaVbbFI=.92fc9898-0e64-4117-9c18-5583a7b374b8@github.com> References: <6HiL4ctmhBMv6YiiRVseQuYaqMeEIS_3fpEueaVbbFI=.92fc9898-0e64-4117-9c18-5583a7b374b8@github.com> Message-ID: On Mon, 27 Dec 2021 09:54:30 GMT, Nils Eliasson wrote: > Hi Jie, > > I hope you have a good holiday too! > > Nice find, and straight forward fix too. I have one comment in the code. > > It's excellent that you have added a microbenchmark too. I would like to have a small regression test too, that quickly fails if this would break in the future. Perhaps something using the IR Testing framework. > Thanks @neliasso ! A regression test has been added and your comment has been addressed. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6933 From haosun at openjdk.java.net Tue Dec 28 00:38:12 2021 From: haosun at openjdk.java.net (Hao Sun) Date: Tue, 28 Dec 2021 00:38:12 GMT Subject: RFR: 8279225: [arm32] C1 longs comparison operation destroys argument registers [v2] In-Reply-To: References: Message-ID: On Mon, 27 Dec 2021 11:47:47 GMT, Sergey Nazarkin wrote: >> Several regression tests are failed on arm32 CPU if tiered compilation is enabled. >> >> The list includes >> java/math/BigDecimal/DivideMcTests >> java/util/Arrays/Sorting.java >> java/util/Arrays/SortingNearlySortedPrimitive.java >> java/util/concurrent/tck/JSR166TestCase >> java/util/stream/SliceOpTest.java >> etc >> >> It appears C1 comp_op for long operands destroys arguments registers: >> void LIR_Assembler::comp_op(LIR_Condition condition, LIR_Opr opr1, LIR_Opr opr2, LIR_Op2* op) { >> .... >> Register ylo = opr2->as_register_lo(); >> Register yhi = opr2->as_register_hi(); >> if (condition == lir_cond_equal || condition == lir_cond_notEqual) { >> __ teq(xhi, yhi); >> __ teq(xlo, ylo, eq); >> } else { >> __ subs(xlo, xlo, ylo); // <<< incorrect >> __ sbcs(xhi, xhi, yhi); // <<< incorrect >> } >> ... >> } >> >> Tested with hotspot_tier2, jdk_tier3 on linux_arm > > Sergey Nazarkin has updated the pull request incrementally with one additional commit since the last revision: > > Align C1 long cmp with match rules in arm.ad Thanks for your update. LGTM. (I'm not a Reviewer) Marked as reviewed by haosun (Author). ------------- PR: https://git.openjdk.java.net/jdk/pull/6934 From haosun at openjdk.java.net Tue Dec 28 02:32:38 2021 From: haosun at openjdk.java.net (Hao Sun) Date: Tue, 28 Dec 2021 02:32:38 GMT Subject: [jdk18] RFR: 8278267: ARM32: several vector test failures for ASHR [v3] In-Reply-To: References: Message-ID: > In ARM32, "VSHL (register)" instruction [1] is shared by vector left > shift and vector right shift, and the condition to distinguish them is > whether the shift count value is positve or negative. Hence, negation > operation is needed before conducting vector right shift. > > For vector right shift, the shift count can be a RShiftCntV or a normal > vector node. Take test case Byte64VectorTests.java [2][3] as an example. > Note that RShiftCntV is already negated via rules "vsrcntD" and > "vsrcntX" whereas the normal vector node is NOT, since we don't know > whether a normal vector node is used as a vector shift count or not. > This is the root cause for these vector test failures. > > The fix is simple, moving the negation from "vsrcntD|X" to the > corresponding vector right shift rules. > > Affected rules are vsrlBB_reg and vsraBB_reg. Note that vector shift > related rules are in form of "vsAABB_CC", where > 1) AA can be l (left shift), rl (logical right shift) and ra (arithmetic > right shift). > 2) BB can be 8B/16B (byte type), 4S/8S (short type), 2I/4I (int type) > and 2L (long type). > 3) CC can be reg (register case) and immI (immediate case). > > Minor updates: > 1) Merge "vslcntD" and "vsrcntD" into rule "vscntD", as these two rules > conduct the same duplication operation now. > 2) Update the "match" primitive for vsraBB_immI rules. > 3) Style issue: remove the surrounding space for "ins_pipe" primitive. > > Tests: > We ran tier 1~3 tests on ARM32 platform. With this patch, previously > failed vector test cases can pass now without introducing test > regression. > > [1] https://developer.arm.com/documentation/ddi0406/c/Application-Level-Architecture/Instruction-Details/Alphabetical-list-of-instructions/VSHL--register-?lang=en > [2] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2237 > [3] https://github.com/openjdk/jdk/blame/master/test/jdk/jdk/incubator/vector/Byte64VectorTests.java#L2425 Hao Sun has updated the pull request incrementally with one additional commit since the last revision: Make minimal updates to exisiting rules 1. logical left shift rules a). add is_var_shift check for vslAA_immI rules. b). for vslAA_reg rules, remove the matching for URShiftV cases as we have the separate logical right shift rules now. 2. logical right shift rules a). add vsrlAA_reg and vsrlAA_reg_var rules. b). add is_var_shift check for vsrlAA_immI rules. 3. arithmetic right shift rules a). add is_var_shift check for vsraAA_reg rules. b). add vsraAA_reg_var rules c). for vsraAA_immI rules, add is_var_shift check and update the match primitive. Code style issues(FIXME and the surrounding space in ins_pipe): 1. for modified rules, keep it as it was 2. for newly added rules, update the style ------------- Changes: - all: https://git.openjdk.java.net/jdk18/pull/41/files - new: https://git.openjdk.java.net/jdk18/pull/41/files/05dfae3a..566efefe Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk18&pr=41&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk18&pr=41&range=01-02 Stats: 247 lines in 1 file changed: 111 ins; 36 del; 100 mod Patch: https://git.openjdk.java.net/jdk18/pull/41.diff Fetch: git fetch https://git.openjdk.java.net/jdk18 pull/41/head:pull/41 PR: https://git.openjdk.java.net/jdk18/pull/41 From haosun at openjdk.java.net Tue Dec 28 02:32:39 2021 From: haosun at openjdk.java.net (Hao Sun) Date: Tue, 28 Dec 2021 02:32:39 GMT Subject: [jdk18] RFR: 8278267: ARM32: several vector test failures for ASHR [v2] In-Reply-To: References: Message-ID: <6Bxl6VyhPdhRh4ZVn3elab4G_gVOA8bVeF60JGTCxNA=.88601406-0864-4ffc-9cde-684f57462a1c@github.com> On Fri, 24 Dec 2021 11:56:54 GMT, Boris Ulasevich wrote: > The change is Ok for me (I am not a reviewer). Though I would avoid changing the style. If we want to change a style, I would suggest to change all the entries (there is a lot of surrounding spaces remaining in a file), and I think FIXME for ins_cost/ins_pipe are outdated as well. Thanks for your review. Yes. We may want to use another RFE to change the style issue for the whole file. Hence, in the latest commit I made minimal updates to existing rules, and only updated the style in the newly added rules. ------------- PR: https://git.openjdk.java.net/jdk18/pull/41 From neliasso at openjdk.java.net Tue Dec 28 10:42:11 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Tue, 28 Dec 2021 10:42:11 GMT Subject: RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations [v2] In-Reply-To: <6Xo2LbTXntR1TIGyJ1eR7drZi0dtSY3GZ2Ya480c0ok=.6a8d0a96-4a91-4441-a10f-1bc9fc46c5f9@github.com> References: <6Xo2LbTXntR1TIGyJ1eR7drZi0dtSY3GZ2Ya480c0ok=.6a8d0a96-4a91-4441-a10f-1bc9fc46c5f9@github.com> Message-ID: On Mon, 27 Dec 2021 14:41:58 GMT, Jie Fu wrote: >> Hi all, >> >> Happy Christmas Day! >> >> We have observed that C2 fails to auto-vectorize two-dimensional array operations in our machine learning programs. >> And we have made an reproducer in the JBS. >> >> Now let's discuss the reproducer. >> The auto-vectorization fails due to `cl->slp_max_unroll() == 0` [1], which means the previous slp analysis never passed. >> >> As for our example, C2 had tried its first slp analysis with `future_unroll_cnt=4` [2]. >> But unfortunately, it failed due to the loop IR is too complicated [3] like the following. >> >> SuperWord::transform_loop: loop too complicated, cl_exit->in(0) != lpt->_head >> cl_exit 823 823 CountedLoopEnd === 738 822 [[ 907 682 ]] [lt] P=0.999999, C=-1.000000 !orig=[680] >> cl_exit->in(0) 738 738 IfTrue === 735 [[ 823 ]] #1 !orig=[442] !jvms: DoubleArray2::test @ bci:17 (line 10) >> lpt->_head 1267 1267 CountedLoop === 1267 1224 682 [[ 1267 1278 1283 1284 1288 1254 1282 ]] inner stride: 2 main of N1267 !orig=[824],[748],[687] !jvms: DoubleArray2::test @ bci:30 (line 11) >> Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce >> RangeCheck Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce >> Unroll 4 Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce >> Loop: N0/N0 has_sfpt >> Loop: N493/N463 limit_check profile_predicated predicated counted [0,int),+1 (65 iters) sfpts={ 453 } >> Loop: N946/N966 counted [0,int),+1 (4 iters) pre has_sfpt >> Loop: N1483/N682 counted [int,int),+4 (65 iters) main rc has_sfpt >> Loop: N857/N877 counted [int,int),+1 (4 iters) post has_sfpt >> PredicatesOff >> >> >> Then, C2 unrolled the loop with `unroll-factor=4` and also did some other opts, which actually simplified the loop IR representation. >> >> And then, comes the next round of loop unrolling analysis, in which C2 would check if `future_unroll_cnt=8` [2] is OK for unrolling. >> C2 rejected `future_unroll_cnt=8` for this example and returned false immediately [4] without doing a second slp analysis, leaving `cl->slp_max_unroll() == 0`. >> But if we re-do the slp analysis with `future_unroll_cnt=4` before returning false, it would pass. >> >> So the key idea is: >> >> slp analysis may fail due to the loop IR is too complicated especially during the early stage of loop unrolling analysis. >> But after several rounds of loop unrolling and other optimizations, it's possible that the loop IR becomes simple enough to pass the slp analysis. >> So C2 can try one more slp analysis instead of returning false immediately here [4]. >> >> >> We have observed up to 1.7x performance improvement by our micro benchmarks. >> >> ![image](https://user-images.githubusercontent.com/19923746/147344527-b4d9c0ae-c0d4-4cac-b17a-48474648b21a.png) >> >> Testing: >> - tier1 ~ tier3 on Linux/x64, no regression. >> >> Thanks. >> Best regards, >> Jie >> >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L129 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L137 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L910 > > Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Address review comments > - Merge branch 'master' into JDK-8279258 > - 8279258: Auto-vectorization enhancement for two-dimensional array operations Excellent! I will run some tests and if they pass I am ready to approve this. Regards, Nils ------------- PR: https://git.openjdk.java.net/jdk/pull/6933 From aph at openjdk.java.net Tue Dec 28 11:02:10 2021 From: aph at openjdk.java.net (Andrew Haley) Date: Tue, 28 Dec 2021 11:02:10 GMT Subject: RFR: 8279225: [arm32] C1 longs comparison operation destroys argument registers [v2] In-Reply-To: References: Message-ID: On Mon, 27 Dec 2021 11:47:47 GMT, Sergey Nazarkin wrote: >> Several regression tests are failed on arm32 CPU if tiered compilation is enabled. >> >> The list includes >> java/math/BigDecimal/DivideMcTests >> java/util/Arrays/Sorting.java >> java/util/Arrays/SortingNearlySortedPrimitive.java >> java/util/concurrent/tck/JSR166TestCase >> java/util/stream/SliceOpTest.java >> etc >> >> It appears C1 comp_op for long operands destroys arguments registers: >> void LIR_Assembler::comp_op(LIR_Condition condition, LIR_Opr opr1, LIR_Opr opr2, LIR_Op2* op) { >> .... >> Register ylo = opr2->as_register_lo(); >> Register yhi = opr2->as_register_hi(); >> if (condition == lir_cond_equal || condition == lir_cond_notEqual) { >> __ teq(xhi, yhi); >> __ teq(xlo, ylo, eq); >> } else { >> __ subs(xlo, xlo, ylo); // <<< incorrect >> __ sbcs(xhi, xhi, yhi); // <<< incorrect >> } >> ... >> } >> >> Tested with hotspot_tier2, jdk_tier3 on linux_arm > > Sergey Nazarkin has updated the pull request incrementally with one additional commit since the last revision: > > Align C1 long cmp with match rules in arm.ad Marked as reviewed by aph (Reviewer). OK. ------------- PR: https://git.openjdk.java.net/jdk/pull/6934Marked as reviewed by aph (Reviewer). From snazarki at openjdk.java.net Tue Dec 28 11:32:14 2021 From: snazarki at openjdk.java.net (Sergey Nazarkin) Date: Tue, 28 Dec 2021 11:32:14 GMT Subject: Integrated: 8279225: [arm32] C1 longs comparison operation destroys argument registers In-Reply-To: References: Message-ID: On Fri, 24 Dec 2021 16:25:37 GMT, Sergey Nazarkin wrote: > Several regression tests are failed on arm32 CPU if tiered compilation is enabled. > > The list includes > java/math/BigDecimal/DivideMcTests > java/util/Arrays/Sorting.java > java/util/Arrays/SortingNearlySortedPrimitive.java > java/util/concurrent/tck/JSR166TestCase > java/util/stream/SliceOpTest.java > etc > > It appears C1 comp_op for long operands destroys arguments registers: > void LIR_Assembler::comp_op(LIR_Condition condition, LIR_Opr opr1, LIR_Opr opr2, LIR_Op2* op) { > .... > Register ylo = opr2->as_register_lo(); > Register yhi = opr2->as_register_hi(); > if (condition == lir_cond_equal || condition == lir_cond_notEqual) { > __ teq(xhi, yhi); > __ teq(xlo, ylo, eq); > } else { > __ subs(xlo, xlo, ylo); // <<< incorrect > __ sbcs(xhi, xhi, yhi); // <<< incorrect > } > ... > } > > Tested with hotspot_tier2, jdk_tier3 on linux_arm This pull request has now been integrated. Changeset: 299022df Author: Sergey Nazarkin Committer: Alexey Bakhtin URL: https://git.openjdk.java.net/jdk/commit/299022dfacbcb49e3bc5beca8ff9b1fca1101493 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8279225: [arm32] C1 longs comparison operation destroys argument registers Reviewed-by: haosun, aph ------------- PR: https://git.openjdk.java.net/jdk/pull/6934 From snazarki at openjdk.java.net Tue Dec 28 21:00:24 2021 From: snazarki at openjdk.java.net (Sergey Nazarkin) Date: Tue, 28 Dec 2021 21:00:24 GMT Subject: RFR: 8279300: [arm32] SIGILL when running GetObjectSizeIntrinsicsTest Message-ID: The fix resolves SIGILL crash (release JVM) and assert (debug JVM) during run of GetObjectSizeIntrinsicsTest. The change doesn't fix the cause (see JDK-8279301) but adds a guard for incoming constants. The test still fails due to present of AbortVMOnCompilationFailure command line flag, it should be resolved after JDK-8279301 ------------- Commit messages: - 8279300: [arm32] SIGILL when running GetObjectSizeIntrinsicsTest Changes: https://git.openjdk.java.net/jdk/pull/6937/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6937&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8279300 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6937.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6937/head:pull/6937 PR: https://git.openjdk.java.net/jdk/pull/6937 From neliasso at openjdk.java.net Wed Dec 29 09:25:20 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Wed, 29 Dec 2021 09:25:20 GMT Subject: RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations [v2] In-Reply-To: <6Xo2LbTXntR1TIGyJ1eR7drZi0dtSY3GZ2Ya480c0ok=.6a8d0a96-4a91-4441-a10f-1bc9fc46c5f9@github.com> References: <6Xo2LbTXntR1TIGyJ1eR7drZi0dtSY3GZ2Ya480c0ok=.6a8d0a96-4a91-4441-a10f-1bc9fc46c5f9@github.com> Message-ID: On Mon, 27 Dec 2021 14:41:58 GMT, Jie Fu wrote: >> Hi all, >> >> Happy Christmas Day! >> >> We have observed that C2 fails to auto-vectorize two-dimensional array operations in our machine learning programs. >> And we have made an reproducer in the JBS. >> >> Now let's discuss the reproducer. >> The auto-vectorization fails due to `cl->slp_max_unroll() == 0` [1], which means the previous slp analysis never passed. >> >> As for our example, C2 had tried its first slp analysis with `future_unroll_cnt=4` [2]. >> But unfortunately, it failed due to the loop IR is too complicated [3] like the following. >> >> SuperWord::transform_loop: loop too complicated, cl_exit->in(0) != lpt->_head >> cl_exit 823 823 CountedLoopEnd === 738 822 [[ 907 682 ]] [lt] P=0.999999, C=-1.000000 !orig=[680] >> cl_exit->in(0) 738 738 IfTrue === 735 [[ 823 ]] #1 !orig=[442] !jvms: DoubleArray2::test @ bci:17 (line 10) >> lpt->_head 1267 1267 CountedLoop === 1267 1224 682 [[ 1267 1278 1283 1284 1288 1254 1282 ]] inner stride: 2 main of N1267 !orig=[824],[748],[687] !jvms: DoubleArray2::test @ bci:30 (line 11) >> Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce >> RangeCheck Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce >> Unroll 4 Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce >> Loop: N0/N0 has_sfpt >> Loop: N493/N463 limit_check profile_predicated predicated counted [0,int),+1 (65 iters) sfpts={ 453 } >> Loop: N946/N966 counted [0,int),+1 (4 iters) pre has_sfpt >> Loop: N1483/N682 counted [int,int),+4 (65 iters) main rc has_sfpt >> Loop: N857/N877 counted [int,int),+1 (4 iters) post has_sfpt >> PredicatesOff >> >> >> Then, C2 unrolled the loop with `unroll-factor=4` and also did some other opts, which actually simplified the loop IR representation. >> >> And then, comes the next round of loop unrolling analysis, in which C2 would check if `future_unroll_cnt=8` [2] is OK for unrolling. >> C2 rejected `future_unroll_cnt=8` for this example and returned false immediately [4] without doing a second slp analysis, leaving `cl->slp_max_unroll() == 0`. >> But if we re-do the slp analysis with `future_unroll_cnt=4` before returning false, it would pass. >> >> So the key idea is: >> >> slp analysis may fail due to the loop IR is too complicated especially during the early stage of loop unrolling analysis. >> But after several rounds of loop unrolling and other optimizations, it's possible that the loop IR becomes simple enough to pass the slp analysis. >> So C2 can try one more slp analysis instead of returning false immediately here [4]. >> >> >> We have observed up to 1.7x performance improvement by our micro benchmarks. >> >> ![image](https://user-images.githubusercontent.com/19923746/147344527-b4d9c0ae-c0d4-4cac-b17a-48474648b21a.png) >> >> Testing: >> - tier1 ~ tier3 on Linux/x64, no regression. >> >> Thanks. >> Best regards, >> Jie >> >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L129 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L137 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L910 > > Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Address review comments > - Merge branch 'master' into JDK-8279258 > - 8279258: Auto-vectorization enhancement for two-dimensional array operations Testing tier 1 to 3 passed. ------------- PR: https://git.openjdk.java.net/jdk/pull/6933 From home.josef.lehner at gmail.com Tue Dec 28 20:57:37 2021 From: home.josef.lehner at gmail.com (Josef Lehner) Date: Tue, 28 Dec 2021 21:57:37 +0100 Subject: JDK-8231460: Java update from 11.0.11 to 11.0.13 changes JVM code cache behavior and results in more process cpu usage and unexpected profiled nmethods memory usage Message-ID: Hi Tobias, Lutz, Thanks for your insights. I updated the Stackoverflow question (UPDATE 2021-12-28) with the requested data, running the jcmd command every 5 seconds: jcmd Compiler.CodeHeap_Analytics aggregate https://stackoverflow.com/questions/70086548/java-update-from-11-0-11-to-11-0-13-changes-jvm-code-cache-behavior-and-results You can access the zip file with the output right here too: https://drive.google.com/file/d/1PLCo1XnKBHaGdGpVgzHraTy1pJxs4Id9/view?usp=sharing Best regards and a happy new year, Josef From jiefu at openjdk.java.net Thu Dec 30 01:27:11 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 30 Dec 2021 01:27:11 GMT Subject: RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations [v2] In-Reply-To: References: <6Xo2LbTXntR1TIGyJ1eR7drZi0dtSY3GZ2Ya480c0ok=.6a8d0a96-4a91-4441-a10f-1bc9fc46c5f9@github.com> Message-ID: On Wed, 29 Dec 2021 09:21:46 GMT, Nils Eliasson wrote: > Testing tier 1 to 3 passed. Thanks @neliasso for the review and testing. So are you fine with this change? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/6933 From neliasso at openjdk.java.net Thu Dec 30 09:43:22 2021 From: neliasso at openjdk.java.net (Nils Eliasson) Date: Thu, 30 Dec 2021 09:43:22 GMT Subject: RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations [v2] In-Reply-To: <6Xo2LbTXntR1TIGyJ1eR7drZi0dtSY3GZ2Ya480c0ok=.6a8d0a96-4a91-4441-a10f-1bc9fc46c5f9@github.com> References: <6Xo2LbTXntR1TIGyJ1eR7drZi0dtSY3GZ2Ya480c0ok=.6a8d0a96-4a91-4441-a10f-1bc9fc46c5f9@github.com> Message-ID: On Mon, 27 Dec 2021 14:41:58 GMT, Jie Fu wrote: >> Hi all, >> >> Happy Christmas Day! >> >> We have observed that C2 fails to auto-vectorize two-dimensional array operations in our machine learning programs. >> And we have made an reproducer in the JBS. >> >> Now let's discuss the reproducer. >> The auto-vectorization fails due to `cl->slp_max_unroll() == 0` [1], which means the previous slp analysis never passed. >> >> As for our example, C2 had tried its first slp analysis with `future_unroll_cnt=4` [2]. >> But unfortunately, it failed due to the loop IR is too complicated [3] like the following. >> >> SuperWord::transform_loop: loop too complicated, cl_exit->in(0) != lpt->_head >> cl_exit 823 823 CountedLoopEnd === 738 822 [[ 907 682 ]] [lt] P=0.999999, C=-1.000000 !orig=[680] >> cl_exit->in(0) 738 738 IfTrue === 735 [[ 823 ]] #1 !orig=[442] !jvms: DoubleArray2::test @ bci:17 (line 10) >> lpt->_head 1267 1267 CountedLoop === 1267 1224 682 [[ 1267 1278 1283 1284 1288 1254 1282 ]] inner stride: 2 main of N1267 !orig=[824],[748],[687] !jvms: DoubleArray2::test @ bci:30 (line 11) >> Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce >> RangeCheck Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce >> Unroll 4 Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce >> Loop: N0/N0 has_sfpt >> Loop: N493/N463 limit_check profile_predicated predicated counted [0,int),+1 (65 iters) sfpts={ 453 } >> Loop: N946/N966 counted [0,int),+1 (4 iters) pre has_sfpt >> Loop: N1483/N682 counted [int,int),+4 (65 iters) main rc has_sfpt >> Loop: N857/N877 counted [int,int),+1 (4 iters) post has_sfpt >> PredicatesOff >> >> >> Then, C2 unrolled the loop with `unroll-factor=4` and also did some other opts, which actually simplified the loop IR representation. >> >> And then, comes the next round of loop unrolling analysis, in which C2 would check if `future_unroll_cnt=8` [2] is OK for unrolling. >> C2 rejected `future_unroll_cnt=8` for this example and returned false immediately [4] without doing a second slp analysis, leaving `cl->slp_max_unroll() == 0`. >> But if we re-do the slp analysis with `future_unroll_cnt=4` before returning false, it would pass. >> >> So the key idea is: >> >> slp analysis may fail due to the loop IR is too complicated especially during the early stage of loop unrolling analysis. >> But after several rounds of loop unrolling and other optimizations, it's possible that the loop IR becomes simple enough to pass the slp analysis. >> So C2 can try one more slp analysis instead of returning false immediately here [4]. >> >> >> We have observed up to 1.7x performance improvement by our micro benchmarks. >> >> ![image](https://user-images.githubusercontent.com/19923746/147344527-b4d9c0ae-c0d4-4cac-b17a-48474648b21a.png) >> >> Testing: >> - tier1 ~ tier3 on Linux/x64, no regression. >> >> Thanks. >> Best regards, >> Jie >> >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L129 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L137 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L910 > > Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Address review comments > - Merge branch 'master' into JDK-8279258 > - 8279258: Auto-vectorization enhancement for two-dimensional array operations Yes. Looks good! ------------- Marked as reviewed by neliasso (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6933 From kvn at openjdk.java.net Thu Dec 30 21:18:17 2021 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 30 Dec 2021 21:18:17 GMT Subject: RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations [v2] In-Reply-To: <6Xo2LbTXntR1TIGyJ1eR7drZi0dtSY3GZ2Ya480c0ok=.6a8d0a96-4a91-4441-a10f-1bc9fc46c5f9@github.com> References: <6Xo2LbTXntR1TIGyJ1eR7drZi0dtSY3GZ2Ya480c0ok=.6a8d0a96-4a91-4441-a10f-1bc9fc46c5f9@github.com> Message-ID: On Mon, 27 Dec 2021 14:41:58 GMT, Jie Fu wrote: >> Hi all, >> >> Happy Christmas Day! >> >> We have observed that C2 fails to auto-vectorize two-dimensional array operations in our machine learning programs. >> And we have made an reproducer in the JBS. >> >> Now let's discuss the reproducer. >> The auto-vectorization fails due to `cl->slp_max_unroll() == 0` [1], which means the previous slp analysis never passed. >> >> As for our example, C2 had tried its first slp analysis with `future_unroll_cnt=4` [2]. >> But unfortunately, it failed due to the loop IR is too complicated [3] like the following. >> >> SuperWord::transform_loop: loop too complicated, cl_exit->in(0) != lpt->_head >> cl_exit 823 823 CountedLoopEnd === 738 822 [[ 907 682 ]] [lt] P=0.999999, C=-1.000000 !orig=[680] >> cl_exit->in(0) 738 738 IfTrue === 735 [[ 823 ]] #1 !orig=[442] !jvms: DoubleArray2::test @ bci:17 (line 10) >> lpt->_head 1267 1267 CountedLoop === 1267 1224 682 [[ 1267 1278 1283 1284 1288 1254 1282 ]] inner stride: 2 main of N1267 !orig=[824],[748],[687] !jvms: DoubleArray2::test @ bci:30 (line 11) >> Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce >> RangeCheck Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce >> Unroll 4 Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce >> Loop: N0/N0 has_sfpt >> Loop: N493/N463 limit_check profile_predicated predicated counted [0,int),+1 (65 iters) sfpts={ 453 } >> Loop: N946/N966 counted [0,int),+1 (4 iters) pre has_sfpt >> Loop: N1483/N682 counted [int,int),+4 (65 iters) main rc has_sfpt >> Loop: N857/N877 counted [int,int),+1 (4 iters) post has_sfpt >> PredicatesOff >> >> >> Then, C2 unrolled the loop with `unroll-factor=4` and also did some other opts, which actually simplified the loop IR representation. >> >> And then, comes the next round of loop unrolling analysis, in which C2 would check if `future_unroll_cnt=8` [2] is OK for unrolling. >> C2 rejected `future_unroll_cnt=8` for this example and returned false immediately [4] without doing a second slp analysis, leaving `cl->slp_max_unroll() == 0`. >> But if we re-do the slp analysis with `future_unroll_cnt=4` before returning false, it would pass. >> >> So the key idea is: >> >> slp analysis may fail due to the loop IR is too complicated especially during the early stage of loop unrolling analysis. >> But after several rounds of loop unrolling and other optimizations, it's possible that the loop IR becomes simple enough to pass the slp analysis. >> So C2 can try one more slp analysis instead of returning false immediately here [4]. >> >> >> We have observed up to 1.7x performance improvement by our micro benchmarks. >> >> ![image](https://user-images.githubusercontent.com/19923746/147344527-b4d9c0ae-c0d4-4cac-b17a-48474648b21a.png) >> >> Testing: >> - tier1 ~ tier3 on Linux/x64, no regression. >> >> Thanks. >> Best regards, >> Jie >> >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L129 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L137 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L910 > > Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Address review comments > - Merge branch 'master' into JDK-8279258 > - 8279258: Auto-vectorization enhancement for two-dimensional array operations Please, wait after Christmas break. I am planning to look on it. You do need second review. src/hotspot/share/opto/loopTransform.cpp line 912: > 910: is_residual_iters_large(future_unroll_cnt, cl) && > 911: 1.2 * cl->node_count_before_unroll() < (double)_body.size()) { > 912: if (UseSuperWord && (cl->slp_max_unroll() == 0) && UseSuperWord was already checked above. ------------- PR: https://git.openjdk.java.net/jdk/pull/6933 From jiefu at openjdk.java.net Thu Dec 30 23:25:46 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 30 Dec 2021 23:25:46 GMT Subject: RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations [v3] In-Reply-To: References: Message-ID: > Hi all, > > Happy Christmas Day! > > We have observed that C2 fails to auto-vectorize two-dimensional array operations in our machine learning programs. > And we have made an reproducer in the JBS. > > Now let's discuss the reproducer. > The auto-vectorization fails due to `cl->slp_max_unroll() == 0` [1], which means the previous slp analysis never passed. > > As for our example, C2 had tried its first slp analysis with `future_unroll_cnt=4` [2]. > But unfortunately, it failed due to the loop IR is too complicated [3] like the following. > > SuperWord::transform_loop: loop too complicated, cl_exit->in(0) != lpt->_head > cl_exit 823 823 CountedLoopEnd === 738 822 [[ 907 682 ]] [lt] P=0.999999, C=-1.000000 !orig=[680] > cl_exit->in(0) 738 738 IfTrue === 735 [[ 823 ]] #1 !orig=[442] !jvms: DoubleArray2::test @ bci:17 (line 10) > lpt->_head 1267 1267 CountedLoop === 1267 1224 682 [[ 1267 1278 1283 1284 1288 1254 1282 ]] inner stride: 2 main of N1267 !orig=[824],[748],[687] !jvms: DoubleArray2::test @ bci:30 (line 11) > Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce > RangeCheck Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce > Unroll 4 Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce > Loop: N0/N0 has_sfpt > Loop: N493/N463 limit_check profile_predicated predicated counted [0,int),+1 (65 iters) sfpts={ 453 } > Loop: N946/N966 counted [0,int),+1 (4 iters) pre has_sfpt > Loop: N1483/N682 counted [int,int),+4 (65 iters) main rc has_sfpt > Loop: N857/N877 counted [int,int),+1 (4 iters) post has_sfpt > PredicatesOff > > > Then, C2 unrolled the loop with `unroll-factor=4` and also did some other opts, which actually simplified the loop IR representation. > > And then, comes the next round of loop unrolling analysis, in which C2 would check if `future_unroll_cnt=8` [2] is OK for unrolling. > C2 rejected `future_unroll_cnt=8` for this example and returned false immediately [4] without doing a second slp analysis, leaving `cl->slp_max_unroll() == 0`. > But if we re-do the slp analysis with `future_unroll_cnt=4` before returning false, it would pass. > > So the key idea is: > > slp analysis may fail due to the loop IR is too complicated especially during the early stage of loop unrolling analysis. > But after several rounds of loop unrolling and other optimizations, it's possible that the loop IR becomes simple enough to pass the slp analysis. > So C2 can try one more slp analysis instead of returning false immediately here [4]. > > > We have observed up to 1.7x performance improvement by our micro benchmarks. > > ![image](https://user-images.githubusercontent.com/19923746/147344527-b4d9c0ae-c0d4-4cac-b17a-48474648b21a.png) > > Testing: > - tier1 ~ tier3 on Linux/x64, no regression. > > Thanks. > Best regards, > Jie > > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L129 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L137 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L910 Jie Fu has updated the pull request incrementally with one additional commit since the last revision: Remove redundant UseSuperWord check ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/6933/files - new: https://git.openjdk.java.net/jdk/pull/6933/files/1a9b4c84..3af74828 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=6933&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=6933&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 1 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/6933.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6933/head:pull/6933 PR: https://git.openjdk.java.net/jdk/pull/6933 From jiefu at openjdk.java.net Thu Dec 30 23:25:48 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 30 Dec 2021 23:25:48 GMT Subject: RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations [v2] In-Reply-To: References: <6Xo2LbTXntR1TIGyJ1eR7drZi0dtSY3GZ2Ya480c0ok=.6a8d0a96-4a91-4441-a10f-1bc9fc46c5f9@github.com> Message-ID: On Thu, 30 Dec 2021 21:14:55 GMT, Vladimir Kozlov wrote: > UseSuperWord was already checked above. Thanks @vnkozlov for your review. The redundant `UseSuperWord` check had been removed. ------------- PR: https://git.openjdk.java.net/jdk/pull/6933 From jiefu at openjdk.java.net Fri Dec 31 02:18:12 2021 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 31 Dec 2021 02:18:12 GMT Subject: RFR: 8279258: Auto-vectorization enhancement for two-dimensional array operations [v3] In-Reply-To: References: Message-ID: On Thu, 30 Dec 2021 23:25:46 GMT, Jie Fu wrote: >> Hi all, >> >> Happy Christmas Day! >> >> We have observed that C2 fails to auto-vectorize two-dimensional array operations in our machine learning programs. >> And we have made an reproducer in the JBS. >> >> Now let's discuss the reproducer. >> The auto-vectorization fails due to `cl->slp_max_unroll() == 0` [1], which means the previous slp analysis never passed. >> >> As for our example, C2 had tried its first slp analysis with `future_unroll_cnt=4` [2]. >> But unfortunately, it failed due to the loop IR is too complicated [3] like the following. >> >> SuperWord::transform_loop: loop too complicated, cl_exit->in(0) != lpt->_head >> cl_exit 823 823 CountedLoopEnd === 738 822 [[ 907 682 ]] [lt] P=0.999999, C=-1.000000 !orig=[680] >> cl_exit->in(0) 738 738 IfTrue === 735 [[ 823 ]] #1 !orig=[442] !jvms: DoubleArray2::test @ bci:17 (line 10) >> lpt->_head 1267 1267 CountedLoop === 1267 1224 682 [[ 1267 1278 1283 1284 1288 1254 1282 ]] inner stride: 2 main of N1267 !orig=[824],[748],[687] !jvms: DoubleArray2::test @ bci:30 (line 11) >> Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce >> RangeCheck Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce >> Unroll 4 Loop: N1267/N682 counted [int,int),+2 (65 iters) main rc has_sfpt rce >> Loop: N0/N0 has_sfpt >> Loop: N493/N463 limit_check profile_predicated predicated counted [0,int),+1 (65 iters) sfpts={ 453 } >> Loop: N946/N966 counted [0,int),+1 (4 iters) pre has_sfpt >> Loop: N1483/N682 counted [int,int),+4 (65 iters) main rc has_sfpt >> Loop: N857/N877 counted [int,int),+1 (4 iters) post has_sfpt >> PredicatesOff >> >> >> Then, C2 unrolled the loop with `unroll-factor=4` and also did some other opts, which actually simplified the loop IR representation. >> >> And then, comes the next round of loop unrolling analysis, in which C2 would check if `future_unroll_cnt=8` [2] is OK for unrolling. >> C2 rejected `future_unroll_cnt=8` for this example and returned false immediately [4] without doing a second slp analysis, leaving `cl->slp_max_unroll() == 0`. >> But if we re-do the slp analysis with `future_unroll_cnt=4` before returning false, it would pass. >> >> So the key idea is: >> >> slp analysis may fail due to the loop IR is too complicated especially during the early stage of loop unrolling analysis. >> But after several rounds of loop unrolling and other optimizations, it's possible that the loop IR becomes simple enough to pass the slp analysis. >> So C2 can try one more slp analysis instead of returning false immediately here [4]. >> >> >> We have observed up to 1.7x performance improvement by our micro benchmarks. >> >> ![image](https://user-images.githubusercontent.com/19923746/147344527-b4d9c0ae-c0d4-4cac-b17a-48474648b21a.png) >> >> Testing: >> - tier1 ~ tier3 on Linux/x64, no regression. >> >> Thanks. >> Best regards, >> Jie >> >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L129 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L908 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L137 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L910 > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Remove redundant UseSuperWord check gc/metaspace/TestMetaspacePerfCounters.java#id3 failed on Linux/x86_32. But it does pass with the previous pre-submit tests and my local tests. So I don't think it was caused by this change. ------------- PR: https://git.openjdk.java.net/jdk/pull/6933