From luhenry at openjdk.org Tue Nov 1 06:16:27 2022 From: luhenry at openjdk.org (Ludovic Henry) Date: Tue, 1 Nov 2022 06:16:27 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v4] In-Reply-To: References: <5QCLl4R86LlhX9dkwbK7-NtPwkiN9tgQvj0VFoApvzU=.0b12f837-47d4-470a-9b40-961ccd8e181e@github.com>

Message-ID: <3Ri8tG00c3_Ks_AKRvTYyBxa6UINFqjX24aRsKr43-M=.1a793566-8645-4f75-8c45-0a57d6551791@github.com> On Mon, 31 Oct 2022 22:06:20 GMT, Claes Redestad wrote: >> No you don't need to, the vector loop can be calculated as: >> >> IntVector accumulation = IntVector.zero(INT_SPECIES); >> for (int i = 0; i < bound; i += INT_SPECIES.length()) { >> IntVector current = IntVector.load(INT_SPECIES, array, i); >> accumulation = accumulation.mul(31**(INT_SPECIES.length())).add(current); >> } >> return accumulation.mul(IntVector.of(31**INT_SPECIES.length() - 1, ..., 31**2, 31, 1).reduce(ADD); >> >> Each iteration only requires a multiplication and an addition. The weight of lanes can be calculated just before the reduction operation. > > Ok, I can try rewriting as @merykitty suggests and compare. I'm running out of time to spend on this right now, though, so I sort of hope we can do this experiment as a follow-up RFE. You?re right, we can go forward indeed. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From luhenry at openjdk.org Tue Nov 1 06:19:25 2022 From: luhenry at openjdk.org (Ludovic Henry) Date: Tue, 1 Nov 2022 06:19:25 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v4] In-Reply-To: References: <5QCLl4R86LlhX9dkwbK7-NtPwkiN9tgQvj0VFoApvzU=.0b12f837-47d4-470a-9b40-961ccd8e181e@github.com>

Message-ID: On Mon, 31 Oct 2022 22:06:20 GMT, Claes Redestad wrote: >> No you don't need to, the vector loop can be calculated as: >> >> IntVector accumulation = IntVector.zero(INT_SPECIES); >> for (int i = 0; i < bound; i += INT_SPECIES.length()) { >> IntVector current = IntVector.load(INT_SPECIES, array, i); >> accumulation = accumulation.mul(31**(INT_SPECIES.length())).add(current); >> } >> return accumulation.mul(IntVector.of(31**INT_SPECIES.length() - 1, ..., 31**2, 31, 1).reduce(ADD); >> >> Each iteration only requires a multiplication and an addition. The weight of lanes can be calculated just before the reduction operation. > > Ok, I can try rewriting as @merykitty suggests and compare. I'm running out of time to spend on this right now, though, so I sort of hope we can do this experiment as a follow-up RFE. @cl4es i can write the assembly and send it your way if you want ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Tue Nov 1 09:01:30 2022 From: redestad at openjdk.org (Claes Redestad) Date: Tue, 1 Nov 2022 09:01:30 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v4] In-Reply-To: References: <5QCLl4R86LlhX9dkwbK7-NtPwkiN9tgQvj0VFoApvzU=.0b12f837-47d4-470a-9b40-961ccd8e181e@github.com>

Message-ID: On Tue, 1 Nov 2022 06:17:16 GMT, Ludovic Henry wrote: >> Ok, I can try rewriting as @merykitty suggests and compare. I'm running out of time to spend on this right now, though, so I sort of hope we can do this experiment as a follow-up RFE. > > @cl4es i can write the assembly and send it your way if you want @luhenry if you have some time to translate that snippet then feel free! ------------- PR: https://git.openjdk.org/jdk/pull/10847 From kdnilsen at openjdk.org Tue Nov 1 16:28:28 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Tue, 1 Nov 2022 16:28:28 GMT Subject: RFR: Fix assertion error with advance promotion budgeting [v2] In-Reply-To: <6OaYOfQqvIkYAE6dl93o_GprE8uB7dPnchns732AQzg=.1fe8d2de-641c-48c8-bfc1-c20e25aea153@github.com> References: <6OaYOfQqvIkYAE6dl93o_GprE8uB7dPnchns732AQzg=.1fe8d2de-641c-48c8-bfc1-c20e25aea153@github.com> Message-ID: <6HPlrnxFvHmK75wkv_mHkTkjCPF8P-kI0sD0FFr1sHk=.2f529d88-0dc6-444a-abf8-eff853eb01d4@github.com> > Round-off errors were resulting in an assertion error. Budgeting calculations are "complicated" because only regions that are fully empty may be loaned from old-gen to young-gen. This change recalculates certain values during budgeting adjustments that follow collection set selection rather than endeavoring to make changes to the values computed before collection set selection. The API is simpler as a result. Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: Allow round-off errors to impact assert ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/165/files - new: https://git.openjdk.org/shenandoah/pull/165/files/55eb39a4..e9430135 Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=165&range=01 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=165&range=00-01 Stats: 5 lines in 1 file changed: 3 ins; 0 del; 2 mod Patch: https://git.openjdk.org/shenandoah/pull/165.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/165/head:pull/165 PR: https://git.openjdk.org/shenandoah/pull/165 From kdnilsen at openjdk.org Tue Nov 1 16:32:08 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Tue, 1 Nov 2022 16:32:08 GMT Subject: Integrated: Fix assertion error with advance promotion budgeting In-Reply-To: <6OaYOfQqvIkYAE6dl93o_GprE8uB7dPnchns732AQzg=.1fe8d2de-641c-48c8-bfc1-c20e25aea153@github.com> References: <6OaYOfQqvIkYAE6dl93o_GprE8uB7dPnchns732AQzg=.1fe8d2de-641c-48c8-bfc1-c20e25aea153@github.com> Message-ID: On Mon, 31 Oct 2022 14:31:49 GMT, Kelvin Nilsen wrote: > Round-off errors were resulting in an assertion error. Budgeting calculations are "complicated" because only regions that are fully empty may be loaned from old-gen to young-gen. This change recalculates certain values during budgeting adjustments that follow collection set selection rather than endeavoring to make changes to the values computed before collection set selection. The API is simpler as a result. This pull request has now been integrated. Changeset: 22ff4b91 Author: Kelvin Nilsen URL: https://git.openjdk.org/shenandoah/commit/22ff4b919f7b8af0fd880f56a3a23edd701bb322 Stats: 282 lines in 2 files changed: 173 ins; 53 del; 56 mod Fix assertion error with advance promotion budgeting Reviewed-by: rkennke ------------- PR: https://git.openjdk.org/shenandoah/pull/165 From coleenp at openjdk.org Wed Nov 2 16:47:01 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 2 Nov 2022 16:47:01 GMT Subject: RFR: 8256072: Eliminate JVMTI tagmap rehashing Message-ID: Use identity_hash for objects in the JVMTI TagMap table. If the object has no hashcode, it's not in the table. Tested with tier1-6. ------------- Commit messages: - 8256072: Eliminate JVMTI tagmap rehashing Changes: https://git.openjdk.org/jdk/pull/10938/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10938&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8256072 Stats: 108 lines in 12 files changed: 10 ins; 93 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/10938.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10938/head:pull/10938 PR: https://git.openjdk.org/jdk/pull/10938 From coleenp at openjdk.org Wed Nov 2 20:44:58 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 2 Nov 2022 20:44:58 GMT Subject: RFR: 8256072: Eliminate JVMTI tagmap rehashing [v2] In-Reply-To: References: Message-ID: > Use identity_hash for objects in the JVMTI TagMap table. If the object has no hashcode, it's not in the table. > Tested with tier1-6. Coleen Phillimore has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' into jvmti - 8256072: Eliminate JVMTI tagmap rehashing ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10938/files - new: https://git.openjdk.org/jdk/pull/10938/files/29fb0c2f..e549dcb5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10938&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10938&range=00-01 Stats: 32521 lines in 114 files changed: 2966 ins; 29135 del; 420 mod Patch: https://git.openjdk.org/jdk/pull/10938.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10938/head:pull/10938 PR: https://git.openjdk.org/jdk/pull/10938 From kbarrett at openjdk.org Wed Nov 2 20:55:26 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 2 Nov 2022 20:55:26 GMT Subject: RFR: 8256072: Eliminate JVMTI tagmap rehashing [v2] In-Reply-To: References:

Message-ID: On Wed, 2 Nov 2022 20:44:58 GMT, Coleen Phillimore wrote: >> Use identity_hash for objects in the JVMTI TagMap table. If the object has no hashcode, it's not in the table. >> Tested with tier1-6. > > Coleen Phillimore has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' into jvmti > - 8256072: Eliminate JVMTI tagmap rehashing Thanks for the code review Kim. I removed the function that you noticed is now unused. ------------- PR: https://git.openjdk.org/jdk/pull/10938 From kdnilsen at openjdk.org Wed Nov 2 23:28:37 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 2 Nov 2022 23:28:37 GMT Subject: RFR: Fix preemption of coalesce and fill Message-ID: It is necessary to check the return status of entry_coalesce_and_fill() during ShenandoahOldGeneration::prepare_gc(). Otherwise, when coalesce-and-fill is preempted before it is completed, we will reset the old-gen mark bitmap and invoke old_generation->set_mark_incomplete() before the coalesce-and-fill effort is resumed. This results in an assertion failure with debug builds, or incomplete coalesce-and-fill results with release builds. ------------- Commit messages: - Merge remote-tracking branch 'GitFarmBranch/fix-preemption-of-coalesce-and-fill' into fix-preemption-of-coalesce-and-fill - Check for preemption of coalesce-and-fill during prep for old gc Changes: https://git.openjdk.org/shenandoah/pull/166/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=166&range=00 Stats: 7 lines in 1 file changed: 3 ins; 2 del; 2 mod Patch: https://git.openjdk.org/shenandoah/pull/166.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/166/head:pull/166 PR: https://git.openjdk.org/shenandoah/pull/166 From wkemper at openjdk.org Wed Nov 2 23:28:37 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 2 Nov 2022 23:28:37 GMT Subject: RFR: Fix preemption of coalesce and fill In-Reply-To: References: Message-ID: On Wed, 2 Nov 2022 23:22:20 GMT, Kelvin Nilsen wrote: > It is necessary to check the return status of entry_coalesce_and_fill() during ShenandoahOldGeneration::prepare_gc(). > > Otherwise, when coalesce-and-fill is preempted before it is completed, we will reset the old-gen mark bitmap and invoke old_generation->set_mark_incomplete() before the coalesce-and-fill effort is resumed. This results in an assertion failure with debug builds, or incomplete coalesce-and-fill results with release builds. Good catch! Thank you! ------------- Marked as reviewed by wkemper (Committer). PR: https://git.openjdk.org/shenandoah/pull/166 From kdnilsen at openjdk.org Wed Nov 2 23:30:38 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 2 Nov 2022 23:30:38 GMT Subject: git: openjdk/shenandoah: master: Fix preemption of coalesce and fill Message-ID: <5bc48bea-aaea-4f72-977c-ea8a30ca3f44@openjdk.org> Changeset: 50c54581 Author: Kelvin Nilsen Date: 2022-11-02 23:30:07 +0000 URL: https://git.openjdk.org/shenandoah/commit/50c54581270be169eb5c931dc16dfb4f4c8552bc Fix preemption of coalesce and fill Reviewed-by: wkemper ! src/hotspot/share/gc/shenandoah/shenandoahOldGeneration.cpp From kdnilsen at openjdk.org Wed Nov 2 23:33:34 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 2 Nov 2022 23:33:34 GMT Subject: Integrated: Fix preemption of coalesce and fill In-Reply-To: References: Message-ID: <_k52P9tPdQO-2mQiSu8GTGW1xCohZPbpl0Lxh5L1nF0=.afddd749-9950-433e-9fd5-6cc6b41fe52a@github.com> On Wed, 2 Nov 2022 23:22:20 GMT, Kelvin Nilsen wrote: > It is necessary to check the return status of entry_coalesce_and_fill() during ShenandoahOldGeneration::prepare_gc(). > > Otherwise, when coalesce-and-fill is preempted before it is completed, we will reset the old-gen mark bitmap and invoke old_generation->set_mark_incomplete() before the coalesce-and-fill effort is resumed. This results in an assertion failure with debug builds, or incomplete coalesce-and-fill results with release builds. This pull request has now been integrated. Changeset: 50c54581 Author: Kelvin Nilsen URL: https://git.openjdk.org/shenandoah/commit/50c54581270be169eb5c931dc16dfb4f4c8552bc Stats: 7 lines in 1 file changed: 3 ins; 2 del; 2 mod Fix preemption of coalesce and fill Reviewed-by: wkemper ------------- PR: https://git.openjdk.org/shenandoah/pull/166 From kbarrett at openjdk.org Thu Nov 3 03:44:47 2022 From: kbarrett at openjdk.org (Kim Barrett) Date: Thu, 3 Nov 2022 03:44:47 GMT Subject: RFR: 8256072: Eliminate JVMTI tagmap rehashing [v3] In-Reply-To: References:

Message-ID: On Wed, 2 Nov 2022 22:23:57 GMT, Coleen Phillimore wrote: >> Use identity_hash for objects in the JVMTI TagMap table. If the object has no hashcode, it's not in the table. >> Tested with tier1-6. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Remove now-unused function that I missed. Looks good. Maybe just a little more deletion. src/hotspot/share/gc/z/zHeap.inline.hpp line 48: > 46: inline uint32_t ZHeap::hash_oop(uintptr_t addr) const { > 47: const uintptr_t offset = ZAddress::offset(addr); > 48: return ZHash::address_to_uint32(offset); I think removal of the call to `ZHash::address_to_uint32` means this file no longer needs to include zHash.inline.hpp. ------------- Marked as reviewed by kbarrett (Reviewer). PR: https://git.openjdk.org/jdk/pull/10938 From eosterlund at openjdk.org Thu Nov 3 05:19:13 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Thu, 3 Nov 2022 05:19:13 GMT Subject: RFR: 8256072: Eliminate JVMTI tagmap rehashing [v3] In-Reply-To: References:

Message-ID: On Thu, 3 Nov 2022 03:41:26 GMT, Kim Barrett wrote: >> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove now-unused function that I missed. > > src/hotspot/share/gc/z/zHeap.inline.hpp line 48: > >> 46: inline uint32_t ZHeap::hash_oop(uintptr_t addr) const { >> 47: const uintptr_t offset = ZAddress::offset(addr); >> 48: return ZHash::address_to_uint32(offset); > > I think removal of the call to `ZHash::address_to_uint32` means this file no longer needs to include zHash.inline.hpp. I think you are right Kim. ------------- PR: https://git.openjdk.org/jdk/pull/10938 From eosterlund at openjdk.org Thu Nov 3 05:31:41 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Thu, 3 Nov 2022 05:31:41 GMT Subject: RFR: 8256072: Eliminate JVMTI tagmap rehashing [v3] In-Reply-To: References:

Message-ID: On Wed, 2 Nov 2022 22:23:57 GMT, Coleen Phillimore wrote: >> Use identity_hash for objects in the JVMTI TagMap table. If the object has no hashcode, it's not in the table. >> Tested with tier1-6. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Remove now-unused function that I missed. I love the use of identity hash code instead of address bits. There might be an issue with displaced markWords though where we need to be careful. src/hotspot/share/prims/jvmtiTagMapTable.cpp line 116: > 114: > 115: JvmtiTagMapEntry* JvmtiTagMapTable::find(oop obj) { > 116: if (obj->has_no_hash()) { This new function you added checks if the markWord has a hashCode. If there is a displaced markWord, then it very well might be that there is a hashCode, but it is in the displaced markWord - either in a stack lock or an ObjectMonitor. Bailing here does not seem correct, as it might actually be in the table even if there is no hashCode in the markWord. Is this an optimization? ------------- Changes requested by eosterlund (Reviewer). PR: https://git.openjdk.org/jdk/pull/10938 From coleenp at openjdk.org Thu Nov 3 11:50:56 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 3 Nov 2022 11:50:56 GMT Subject: RFR: 8256072: Eliminate JVMTI tagmap rehashing [v3] In-Reply-To: References:

Message-ID: On Thu, 3 Nov 2022 05:28:23 GMT, Erik ?sterlund wrote: >> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove now-unused function that I missed. > > src/hotspot/share/prims/jvmtiTagMapTable.cpp line 116: > >> 114: >> 115: JvmtiTagMapEntry* JvmtiTagMapTable::find(oop obj) { >> 116: if (obj->has_no_hash()) { > > This new function you added checks if the markWord has a hashCode. If there is a displaced markWord, then it very well might be that there is a hashCode, but it is in the displaced markWord - either in a stack lock or an ObjectMonitor. Bailing here does not seem correct, as it might actually be in the table even if there is no hashCode in the markWord. Is this an optimization? It is an optimization. I don't think we want to create an identity hash for all oops just for lookup. Is there a better way to find if an oop hashCode? ------------- PR: https://git.openjdk.org/jdk/pull/10938 From coleenp at openjdk.org Thu Nov 3 12:46:00 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 3 Nov 2022 12:46:00 GMT Subject: RFR: 8256072: Eliminate JVMTI tagmap rehashing [v4] In-Reply-To: References: Message-ID: > Use identity_hash for objects in the JVMTI TagMap table. If the object has no hashcode, it's not in the table. > Tested with tier1-6. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Fix has_no_hash into fast_no_hash_check(). ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10938/files - new: https://git.openjdk.org/jdk/pull/10938/files/f214791d..8770d38c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10938&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10938&range=02-03 Stats: 8 lines in 4 files changed: 3 ins; 1 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/10938.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10938/head:pull/10938 PR: https://git.openjdk.org/jdk/pull/10938 From coleenp at openjdk.org Thu Nov 3 12:46:03 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 3 Nov 2022 12:46:03 GMT Subject: RFR: 8256072: Eliminate JVMTI tagmap rehashing [v3] In-Reply-To: References:

Message-ID: On Thu, 3 Nov 2022 11:39:51 GMT, Coleen Phillimore wrote: >> src/hotspot/share/prims/jvmtiTagMapTable.cpp line 116: >> >>> 114: >>> 115: JvmtiTagMapEntry* JvmtiTagMapTable::find(oop obj) { >>> 116: if (obj->has_no_hash()) { >> >> This new function you added checks if the markWord has a hashCode. If there is a displaced markWord, then it very well might be that there is a hashCode, but it is in the displaced markWord - either in a stack lock or an ObjectMonitor. Bailing here does not seem correct, as it might actually be in the table even if there is no hashCode in the markWord. Is this an optimization? > > It is an optimization. I don't think we want to create an identity hash for all oops just for lookup. Is there a better way to find if an oop hashCode? I was hoping there was just a bit but you're right. I renamed it to fast_no_hash_check() and only return true if the object is unlocked and added a comment. ------------- PR: https://git.openjdk.org/jdk/pull/10938 From coleenp at openjdk.org Thu Nov 3 12:46:03 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 3 Nov 2022 12:46:03 GMT Subject: RFR: 8256072: Eliminate JVMTI tagmap rehashing [v3] In-Reply-To: References:

Message-ID: <3nvtr_Oq3qlti6PwenIt2Vykd4bugYyBAufKPgWS0JI=.cfda77dd-18eb-4690-a11f-f73c0a62448f@github.com> On Thu, 3 Nov 2022 12:39:28 GMT, Coleen Phillimore wrote: >> It is an optimization. I don't think we want to create an identity hash for all oops just for lookup. Is there a better way to find if an oop hashCode? > > I was hoping there was just a bit but you're right. I renamed it to fast_no_hash_check() and only return true if the object is unlocked and added a comment. I'm rerunning jvmti and jdi tests locally. ------------- PR: https://git.openjdk.org/jdk/pull/10938 From eosterlund at openjdk.org Thu Nov 3 13:31:30 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Thu, 3 Nov 2022 13:31:30 GMT Subject: RFR: 8256072: Eliminate JVMTI tagmap rehashing [v4] In-Reply-To: References:

Message-ID: On Thu, 3 Nov 2022 12:46:00 GMT, Coleen Phillimore wrote: >> Use identity_hash for objects in the JVMTI TagMap table. If the object has no hashcode, it's not in the table. >> Tested with tier1-6. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix has_no_hash into fast_no_hash_check(). Good thing Erik caught the displaced markword issue. Looks even better now. src/hotspot/share/oops/oop.hpp line 294: > 292: > 293: // identity hash; returns the identity hash key (computes it if necessary) > 294: inline bool fast_no_hash_check(); Seems like `fast_no_hash_check` ought to be later in this grouping. The preceding comment is about `identity_hash`. ------------- Marked as reviewed by kbarrett (Reviewer). PR: https://git.openjdk.org/jdk/pull/10938 From coleenp at openjdk.org Thu Nov 3 17:24:59 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 3 Nov 2022 17:24:59 GMT Subject: RFR: 8256072: Eliminate JVMTI tagmap rehashing [v5] In-Reply-To: References: Message-ID: > Use identity_hash for objects in the JVMTI TagMap table. If the object has no hashcode, it's not in the table. > Tested with tier1-6. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Fix has_no_hash into fast_no_hash_check(). ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10938/files - new: https://git.openjdk.org/jdk/pull/10938/files/8770d38c..7bda861f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10938&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10938&range=03-04 Stats: 2 lines in 1 file changed: 1 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10938.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10938/head:pull/10938 PR: https://git.openjdk.org/jdk/pull/10938 From coleenp at openjdk.org Thu Nov 3 17:25:02 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 3 Nov 2022 17:25:02 GMT Subject: RFR: 8256072: Eliminate JVMTI tagmap rehashing [v4] In-Reply-To: References:

Message-ID: On Thu, 3 Nov 2022 17:14:39 GMT, Kim Barrett wrote: >> Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix has_no_hash into fast_no_hash_check(). > > src/hotspot/share/oops/oop.hpp line 294: > >> 292: >> 293: // identity hash; returns the identity hash key (computes it if necessary) >> 294: inline bool fast_no_hash_check(); > > Seems like `fast_no_hash_check` ought to be later in this grouping. The preceding comment is about `identity_hash`. Ok, this makes sense. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10938 From coleenp at openjdk.org Thu Nov 3 17:28:12 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 3 Nov 2022 17:28:12 GMT Subject: RFR: 8256072: Eliminate JVMTI tagmap rehashing [v5] In-Reply-To: References:

Message-ID: On Thu, 3 Nov 2022 17:24:59 GMT, Coleen Phillimore wrote: >> Use identity_hash for objects in the JVMTI TagMap table. If the object has no hashcode, it's not in the table. >> Tested with tier1-6. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Fix has_no_hash into fast_no_hash_check(). Thanks for the re-review Kim. Recompiled with the trivial change. ------------- PR: https://git.openjdk.org/jdk/pull/10938 From coleenp at openjdk.org Thu Nov 3 17:30:33 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 3 Nov 2022 17:30:33 GMT Subject: Integrated: 8256072: Eliminate JVMTI tagmap rehashing In-Reply-To: References: Message-ID: <1O2rw3n-DpaBg_p13z3wUDImRCGKpKN2tKaLfFoZP5E=.c35305bc-1a06-42a0-8c8d-b3392a22a2ce@github.com> On Tue, 1 Nov 2022 22:31:03 GMT, Coleen Phillimore wrote: > Use identity_hash for objects in the JVMTI TagMap table. If the object has no hashcode, it's not in the table. > Tested with tier1-6. This pull request has now been integrated. Changeset: 94eb25a4 Author: Coleen Phillimore URL: https://git.openjdk.org/jdk/commit/94eb25a4f1ffb0f8c834a03101d98fbff5dd0c5c Stats: 132 lines in 18 files changed: 13 ins; 113 del; 6 mod 8256072: Eliminate JVMTI tagmap rehashing Reviewed-by: kbarrett, eosterlund ------------- PR: https://git.openjdk.org/jdk/pull/10938 From mcimadamore at openjdk.org Fri Nov 4 18:23:17 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Fri, 4 Nov 2022 18:23:17 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v2] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: - Merge branch 'master' into PR_20 - Merge branch 'master' into PR_20 - Merge pull request #14 from minborg/small-javadoc Update some javadocs - Update some javadocs - Revert some javadoc changes - Merge branch 'master' into PR_20 - Fix benchmark and test failure - Merge pull request #13 from minborg/revert-factories Revert MemorySegment factories - Update javadocs after comments - Revert MemorySegment factories - ... and 7 more: https://git.openjdk.org/jdk/compare/7eb59e41...3d933028 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/ac7733da..3d933028 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=00-01 Stats: 51371 lines in 672 files changed: 16181 ins; 32391 del; 2799 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From jvernee at openjdk.org Sat Nov 5 19:48:32 2022 From: jvernee at openjdk.org (Jorn Vernee) Date: Sat, 5 Nov 2022 19:48:32 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v2] In-Reply-To: References:

Message-ID: <5pJEOFSd5uXPZ5W8_SC3dPizt8x83yjMPbtt2gmFwfA=.38b93d3f-13f3-4455-ac0a-33dbca8f44bd@github.com> On Fri, 4 Nov 2022 18:23:17 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: > > - Merge branch 'master' into PR_20 > - Merge branch 'master' into PR_20 > - Merge pull request #14 from minborg/small-javadoc > > Update some javadocs > - Update some javadocs > - Revert some javadoc changes > - Merge branch 'master' into PR_20 > - Fix benchmark and test failure > - Merge pull request #13 from minborg/revert-factories > > Revert MemorySegment factories > - Update javadocs after comments > - Revert MemorySegment factories > - ... and 7 more: https://git.openjdk.org/jdk/compare/e1e4e45b...3d933028 Some preliminary comments about some changes I think are missing from this PR (noticed while I was making a patch for the VM changes) I will do a more thorough review after the changes from https://github.com/openjdk/panama-foreign/pull/750 are included as well. src/java.base/share/classes/jdk/internal/foreign/AbstractMemorySegmentImpl.java line 474: > 472: long bbAddress = NIO_ACCESS.getBufferAddress(bb); > 473: Object base = NIO_ACCESS.getBufferBase(bb); > 474: UnmapperProxy unmapper = NIO_ACCESS.unmapper(bb); Looks like here is also missing the fix that rejects StringCharBuffer: https://github.com/openjdk/panama-foreign/pull/741 I think that is good to include as well. src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java line 477: > 475: case UNBOX_ADDRESS -> emitUnboxAddress(); > 476: case DUP -> emitDupBinding(); > 477: case CAST -> emitCast((Binding.Cast) binding); This contains the CAST binding, but not the accompanying VM changes from: https://github.com/openjdk/panama-foreign/pull/720 which removes the now dead code. Preferably both changes go together (and the code removal is pretty trivial, so I suggest including it here) src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java line 491: > 489: emitLoad(highLevelType, paramIndex2ParamSlot[paramIndex]); > 490: > 491: if (shouldAcquire(paramIndex)) { I can't comment on the actual line below, but this is also missing the fix from: https://github.com/openjdk/panama-foreign/pull/739 (that is a Java-only change as well). I suggest adding that as well. src/java.base/share/classes/jdk/internal/foreign/abi/x64/windows/CallArranger.java line 165: > 163: assert forArguments : "no stack returns"; > 164: // stack > 165: long alignment = Math.max(layout.byteAlignment(), STACK_SLOT_SIZE); This is also missing part of the changes from: https://github.com/openjdk/panama-foreign/pull/728/ but other changes to the shared code are present. The `layout` parameter is not needed here. (see the changes to this file in the original PR) ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Mon Nov 7 09:24:04 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Mon, 7 Nov 2022 09:24:04 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v2] In-Reply-To: <5pJEOFSd5uXPZ5W8_SC3dPizt8x83yjMPbtt2gmFwfA=.38b93d3f-13f3-4455-ac0a-33dbca8f44bd@github.com> References:

<5pJEOFSd5uXPZ5W8_SC3dPizt8x83yjMPbtt2gmFwfA=.38b93d3f-13f3-4455-ac0a-33dbca8f44bd@github.com> Message-ID: On Sat, 5 Nov 2022 18:02:33 GMT, Jorn Vernee wrote: >> Maurizio Cimadamore has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: >> >> - Merge branch 'master' into PR_20 >> - Merge branch 'master' into PR_20 >> - Merge pull request #14 from minborg/small-javadoc >> >> Update some javadocs >> - Update some javadocs >> - Revert some javadoc changes >> - Merge branch 'master' into PR_20 >> - Fix benchmark and test failure >> - Merge pull request #13 from minborg/revert-factories >> >> Revert MemorySegment factories >> - Update javadocs after comments >> - Revert MemorySegment factories >> - ... and 7 more: https://git.openjdk.org/jdk/compare/d314527d...3d933028 > > src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java line 477: > >> 475: case UNBOX_ADDRESS -> emitUnboxAddress(); >> 476: case DUP -> emitDupBinding(); >> 477: case CAST -> emitCast((Binding.Cast) binding); > > This contains the CAST binding, but not the accompanying VM changes from: https://github.com/openjdk/panama-foreign/pull/720 which removes the now dead code. Preferably both changes go together (and the code removal is pretty trivial, so I suggest including it here) Why did the normalization test passed even w/o VM changes? Is that because the VM code changes are just removing what is now dead code, right? ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Mon Nov 7 09:30:19 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Mon, 7 Nov 2022 09:30:19 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v2] In-Reply-To: <5pJEOFSd5uXPZ5W8_SC3dPizt8x83yjMPbtt2gmFwfA=.38b93d3f-13f3-4455-ac0a-33dbca8f44bd@github.com> References:

<5pJEOFSd5uXPZ5W8_SC3dPizt8x83yjMPbtt2gmFwfA=.38b93d3f-13f3-4455-ac0a-33dbca8f44bd@github.com> Message-ID: On Sat, 5 Nov 2022 18:04:38 GMT, Jorn Vernee wrote: >> Maurizio Cimadamore has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: >> >> - Merge branch 'master' into PR_20 >> - Merge branch 'master' into PR_20 >> - Merge pull request #14 from minborg/small-javadoc >> >> Update some javadocs >> - Update some javadocs >> - Revert some javadoc changes >> - Merge branch 'master' into PR_20 >> - Fix benchmark and test failure >> - Merge pull request #13 from minborg/revert-factories >> >> Revert MemorySegment factories >> - Update javadocs after comments >> - Revert MemorySegment factories >> - ... and 7 more: https://git.openjdk.org/jdk/compare/1fd35b8a...3d933028 > > src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java line 491: > >> 489: emitLoad(highLevelType, paramIndex2ParamSlot[paramIndex]); >> 490: >> 491: if (shouldAcquire(paramIndex)) { > > I can't comment on the actual line below, but this is also missing the fix from: https://github.com/openjdk/panama-foreign/pull/739 (that is a Java-only change as well). I suggest adding that as well. Few changes were missing - but the `dontrelease` tests was passing... odd ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Mon Nov 7 09:42:25 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Mon, 7 Nov 2022 09:42:25 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v2] In-Reply-To: <5pJEOFSd5uXPZ5W8_SC3dPizt8x83yjMPbtt2gmFwfA=.38b93d3f-13f3-4455-ac0a-33dbca8f44bd@github.com> References:

<5pJEOFSd5uXPZ5W8_SC3dPizt8x83yjMPbtt2gmFwfA=.38b93d3f-13f3-4455-ac0a-33dbca8f44bd@github.com> Message-ID: <09WcJa8UMXNVYkW_eI7w-tW7fnZ6KXt1gt_4cu5Y2KE=.8be58947-1169-42e0-8c6f-45b66ff63c10@github.com> On Sat, 5 Nov 2022 18:40:56 GMT, Jorn Vernee wrote: >> Maurizio Cimadamore has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: >> >> - Merge branch 'master' into PR_20 >> - Merge branch 'master' into PR_20 >> - Merge pull request #14 from minborg/small-javadoc >> >> Update some javadocs >> - Update some javadocs >> - Revert some javadoc changes >> - Merge branch 'master' into PR_20 >> - Fix benchmark and test failure >> - Merge pull request #13 from minborg/revert-factories >> >> Revert MemorySegment factories >> - Update javadocs after comments >> - Revert MemorySegment factories >> - ... and 7 more: https://git.openjdk.org/jdk/compare/d8bb7119...3d933028 > > src/java.base/share/classes/jdk/internal/foreign/abi/x64/windows/CallArranger.java line 165: > >> 163: assert forArguments : "no stack returns"; >> 164: // stack >> 165: long alignment = Math.max(layout.byteAlignment(), STACK_SLOT_SIZE); > > This is also missing part of the changes from: https://github.com/openjdk/panama-foreign/pull/728/ but other changes to the shared code are present. The `layout` parameter is not needed here. (see the changes to this file in the original PR) Actually, this patch is missing most of the stuff in PR 728. I was under the impression that, in order to fully support that, some VM changes were needed (e.g. to have better granularity in call shuffling - as per https://github.com/openjdk/panama-foreign/pull/699). As a result, this PR only contains changes to SharedUtil (to remove unused alignment functions) - but nothing else. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Mon Nov 7 09:47:34 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Mon, 7 Nov 2022 09:47:34 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v3] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with three additional commits since the last revision: - Fix mismmatched acquire/release in BindingSpecializer - Fix MemorySegment.ofBuffer when applied to StringCharBuffer - Remove VM dead code after implementation of Binding.Cast ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/3d933028..e8b95f83 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=01-02 Stats: 51 lines in 6 files changed: 12 ins; 34 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Mon Nov 7 12:29:37 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Mon, 7 Nov 2022 12:29:37 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v4] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Add missing tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/e8b95f83..0c70da2c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=02-03 Stats: 162 lines in 3 files changed: 162 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From jvernee at openjdk.org Mon Nov 7 13:45:33 2022 From: jvernee at openjdk.org (Jorn Vernee) Date: Mon, 7 Nov 2022 13:45:33 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v2] In-Reply-To: <09WcJa8UMXNVYkW_eI7w-tW7fnZ6KXt1gt_4cu5Y2KE=.8be58947-1169-42e0-8c6f-45b66ff63c10@github.com> References:

<5pJEOFSd5uXPZ5W8_SC3dPizt8x83yjMPbtt2gmFwfA=.38b93d3f-13f3-4455-ac0a-33dbca8f44bd@github.com> <09WcJa8UMXNVYkW_eI7w-tW7fnZ6KXt1gt_4cu5Y2KE=.8be58947-1169-42e0-8c6f-45b66ff63c10@github.com> Message-ID: On Mon, 7 Nov 2022 09:40:03 GMT, Maurizio Cimadamore wrote: >> src/java.base/share/classes/jdk/internal/foreign/abi/x64/windows/CallArranger.java line 165: >> >>> 163: assert forArguments : "no stack returns"; >>> 164: // stack >>> 165: long alignment = Math.max(layout.byteAlignment(), STACK_SLOT_SIZE); >> >> This is also missing part of the changes from: https://github.com/openjdk/panama-foreign/pull/728/ but other changes to the shared code are present. The `layout` parameter is not needed here. (see the changes to this file in the original PR) > > Actually, this patch is missing most of the stuff in PR 728. I was under the impression that, in order to fully support that, some VM changes were needed (e.g. to have better granularity in call shuffling - as per https://github.com/openjdk/panama-foreign/pull/699). As a result, this PR only contains changes to SharedUtil (to remove unused alignment functions) - but nothing else. 699 is not needed for this. 728 is a pure Java change that simply rejects layouts that don't have their natural alignment (so, it will rejects packed structs for instance, since the implementation doesn't support them on all platforms). All the other changes from 728 are here (most notably the code in AbstractLinker that checks the alignment), except the change that ignores the `layout` here and turns the code around the line above into an `assert`. The mac stack spilling patch requires 699 though (https://github.com/openjdk/panama-foreign/pull/746). I will put that in the PR with the VM changes. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From redestad at openjdk.org Mon Nov 7 13:58:14 2022 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 7 Nov 2022 13:58:14 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v5] In-Reply-To: References: Message-ID: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. Claes Redestad has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 43 commits: - Merge branch 'master' into 8282664-polyhash - Merge branch 'master' into 8282664-polyhash - Change scalar unroll to 2 element stride, minding dependency chain - Require UseSSE >= 3 due transitive use of sse3 instructions from ReduceI - Reorder loops and some other suggestions from @merykitty - ws - Add ArraysHashCode microbenchmarks - Fixed vector loops for int and char arrays - Split up Arrays/HashCode tests - Fixes, optimized short inputs, temporarily disabled vector loop for Arrays.hashCode cases, added and improved tests - ... and 33 more: https://git.openjdk.org/jdk/compare/d634ddef...95c10b5f ------------- Changes: https://git.openjdk.org/jdk/pull/10847/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=04 Stats: 1151 lines in 32 files changed: 1093 ins; 32 del; 26 mod Patch: https://git.openjdk.org/jdk/pull/10847.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847 PR: https://git.openjdk.org/jdk/pull/10847 From mcimadamore at openjdk.org Mon Nov 7 14:06:39 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Mon, 7 Nov 2022 14:06:39 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v2] In-Reply-To: References:

<5pJEOFSd5uXPZ5W8_SC3dPizt8x83yjMPbtt2gmFwfA=.38b93d3f-13f3-4455-ac0a-33dbca8f44bd@github.com> <09WcJa8UMXNVYkW_eI7w-tW7fnZ6KXt1gt_4cu5Y2KE=.8be58947-1169-42e0-8c6f-45b66ff63c10@github.com> Message-ID: On Mon, 7 Nov 2022 13:43:17 GMT, Jorn Vernee wrote: >> Actually, this patch is missing most of the stuff in PR 728. I was under the impression that, in order to fully support that, some VM changes were needed (e.g. to have better granularity in call shuffling - as per https://github.com/openjdk/panama-foreign/pull/699). As a result, this PR only contains changes to SharedUtil (to remove unused alignment functions) - but nothing else. > > 699 is not needed for this. 728 is a pure Java change that simply rejects layouts that don't have their natural alignment (so, it will rejects packed structs for instance, since the implementation doesn't support them on all platforms). All the other changes from 728 are here (most notably the code in AbstractLinker that checks the alignment), except the change that ignores the `layout` here and turns the code around the line above into an `assert`. > > The mac stack spilling patch requires 699 though (https://github.com/openjdk/panama-foreign/pull/746). I will put that in the PR with the VM changes. Thanks for the clarification - I will incorporate those changes as well then. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Mon Nov 7 14:17:40 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Mon, 7 Nov 2022 14:17:40 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v5] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Bring windows CallArranger in sync with panama repo ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/0c70da2c..b98febff Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=03-04 Stats: 18 lines in 2 files changed: 0 ins; 1 del; 17 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From redestad at openjdk.org Mon Nov 7 14:23:44 2022 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 7 Nov 2022 14:23:44 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v6] In-Reply-To: References: Message-ID: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: Remove UseSSE >= 3 precondition now that UseAVX > 0 implies UseSSE=4 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10847/files - new: https://git.openjdk.org/jdk/pull/10847/files/95c10b5f..cdf276de Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=04-05 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10847.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847 PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Mon Nov 7 14:25:21 2022 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 7 Nov 2022 14:25:21 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops In-Reply-To: References:

Message-ID: <9a0_6b99boFwuVxfQcswe4tQ3UYd7U1Yvv4EzePgH3o=.e8dcffc0-6bd5-4c56-909c-a21acda9a3de@github.com> On Tue, 25 Oct 2022 16:03:28 GMT, Ludovic Henry wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > I did a quick write up explaining the approach at https://gist.github.com/luhenry/2fc408be6f906ef79aaf4115525b9d0c. Also, you can find details in @richardstartin's [blog post](https://richardstartin.github.io/posts/vectorised-polynomial-hash-codes) I've started working on an aarch64 port while @luhenry is working on a forward-iterating variant of his vector algorithm (based on @merykitty's suggestions). However, I'd request this PR be accepted as-is and do ports and such enhancements as follow-ups. That'd simplify work since we can continue work in parallel with less coordination and merges. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From mcimadamore at openjdk.org Mon Nov 7 15:00:02 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Mon, 7 Nov 2022 15:00:02 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v6] In-Reply-To: References: Message-ID: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Make memory session a pure lifetime abstraction ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/b98febff..f04be0da Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=04-05 Stats: 2492 lines in 139 files changed: 600 ins; 771 del; 1121 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Mon Nov 7 15:01:39 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Mon, 7 Nov 2022 15:01:39 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v5] In-Reply-To: References:

Message-ID: <7ZPsmtKqqOeItVnPztGyLhwuHi5Q9WsGI8SYzGkyL8Q=.33d63f63-9cbb-42bd-8d81-6555ed3d67d2@github.com> On Mon, 7 Nov 2022 14:17:40 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Bring windows CallArranger in sync with panama repo I have incorporated additional API changes, described in this document: http://cr.openjdk.java.net/~mcimadamore/panama/session_arenas.html The main change is that `MemorySession` is now a pure lifetime abstraction and no longer implements `AutoCloseable`/`SegementAllocator`. Instead a new abstraction, called `Arena` should be used for deterministic deallocation use cases. This change allows several simplifications on the `MemorySession` API, as there's no more need to support non-closeable views. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From redestad at openjdk.org Mon Nov 7 15:53:26 2022 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 7 Nov 2022 15:53:26 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v7] In-Reply-To: References: Message-ID: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. Claes Redestad has updated the pull request incrementally with four additional commits since the last revision: - Merge pull request #1 from luhenry/dev/cl4es/8282664-polyhash Switch to forward approach for vectorization - Fix vector loop - fix indexing - Switch to forward approach for vectorization ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10847/files - new: https://git.openjdk.org/jdk/pull/10847/files/cdf276de..6f49b5aa Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=05-06 Stats: 241 lines in 4 files changed: 64 ins; 138 del; 39 mod Patch: https://git.openjdk.org/jdk/pull/10847.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847 PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Mon Nov 7 16:08:31 2022 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 7 Nov 2022 16:08:31 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops In-Reply-To: References:

Message-ID: On Tue, 25 Oct 2022 16:03:28 GMT, Ludovic Henry wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > I did a quick write up explaining the approach at https://gist.github.com/luhenry/2fc408be6f906ef79aaf4115525b9d0c. Also, you can find details in @richardstartin's [blog post](https://richardstartin.github.io/posts/vectorised-polynomial-hash-codes) Scratch that. I've merged in the forward-iterating vector loop changes @luhenry has worked on, which give a 1.33x speed-up and simplifies the vector loop a lot. Also moves the coefficient array to shared memory. Benchmark (size) Mode Cnt Score Error Units StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1195.828 ? 10.956 ns/op StringHashCode.Algorithm.defaultUTF16 10000 avgt 5 1197.123 ? 10.007 ns/op Some micro-optimizations for smaller arrays were disabled for this, but we'll work on getting that back in place before calling it a day. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From wkemper at openjdk.org Mon Nov 7 17:16:47 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 7 Nov 2022 17:16:47 GMT Subject: RFR: Improve evacuation instrumentation Message-ID: This change adds instrumentation to reveal the following: * How many bytes were evacuated by gc workers and mutators * How many bytes were abandoned due to evacuation races * The age of survivors * The break down of targets for evacuation in the collection set The implementation is based on a new thread local structure to track and aggregate these statistics. ------------- Commit messages: - Fix whitespace - Use initial tenuring threshold when printing region ages - Factor promotion failure reporting into its own method - Merge branch 'shenandoah-master' into evacuation-instrumentation - Clarify log messages and add more detail - Fix confusing heuristic log message - Message formatting fix - WIP: Include age table for evacuated objects and regions - WIP: Finish thread local model - WIP: Adapt to thread local model - ... and 5 more: https://git.openjdk.org/shenandoah/compare/50c54581...201e8610 Changes: https://git.openjdk.org/shenandoah/pull/167/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=167&range=00 Stats: 423 lines in 14 files changed: 323 ins; 71 del; 29 mod Patch: https://git.openjdk.org/shenandoah/pull/167.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/167/head:pull/167 PR: https://git.openjdk.org/shenandoah/pull/167 From kdnilsen at openjdk.org Mon Nov 7 17:16:48 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Mon, 7 Nov 2022 17:16:48 GMT Subject: RFR: Improve evacuation instrumentation In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 21:24:21 GMT, William Kemper wrote: > This change adds instrumentation to reveal the following: > * How many bytes were evacuated by gc workers and mutators > * How many bytes were abandoned due to evacuation races > * The age of survivors > * The break down of targets for evacuation in the collection set > > The implementation is based on a new thread local structure to track and aggregate these statistics. Marked as reviewed by kdnilsen (Committer). src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 858: > 856: // We squelch excessive reports to reduce noise in logs. Squelch enforcement is not "perfect" because > 857: // this same code can be in-lined in multiple contexts, and each context will have its own copy of the static > 858: // last_report_epoch and this_epoch_report_count variables. Is this comment still relevant. Maybe not since the code is no longer in-lined in multiple contexts. Thanks for this improvement to the code structure. This should be relatively rare, so we don't want to in-line it everywhere. ------------- PR: https://git.openjdk.org/shenandoah/pull/167 From wkemper at openjdk.org Mon Nov 7 17:16:49 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 7 Nov 2022 17:16:49 GMT Subject: RFR: Improve evacuation instrumentation In-Reply-To: References:

Message-ID: On Mon, 7 Nov 2022 16:16:33 GMT, Kelvin Nilsen wrote: >> This change adds instrumentation to reveal the following: >> * How many bytes were evacuated by gc workers and mutators >> * How many bytes were abandoned due to evacuation races >> * The age of survivors >> * The break down of targets for evacuation in the collection set >> >> The implementation is based on a new thread local structure to track and aggregate these statistics. > > src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 858: > >> 856: // We squelch excessive reports to reduce noise in logs. Squelch enforcement is not "perfect" because >> 857: // this same code can be in-lined in multiple contexts, and each context will have its own copy of the static >> 858: // last_report_epoch and this_epoch_report_count variables. > > Is this comment still relevant. Maybe not since the code is no longer in-lined in multiple contexts. > > Thanks for this improvement to the code structure. This should be relatively rare, so we don't want to in-line it everywhere. Yes, I think the comment is still relevant. The main change here was to get the gc cycle id from the control thread, rather than the `GCId::current()` (which can apparently only be used from a named thread). ------------- PR: https://git.openjdk.org/shenandoah/pull/167 From wkemper at openjdk.org Mon Nov 7 17:42:19 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 7 Nov 2022 17:42:19 GMT Subject: Integrated: Improve evacuation instrumentation In-Reply-To: References: Message-ID: On Fri, 4 Nov 2022 21:24:21 GMT, William Kemper wrote: > This change adds instrumentation to reveal the following: > * How many bytes were evacuated by gc workers and mutators > * How many bytes were abandoned due to evacuation races > * The age of survivors > * The break down of targets for evacuation in the collection set > > The implementation is based on a new thread local structure to track and aggregate these statistics. This pull request has now been integrated. Changeset: 998f68b2 Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/998f68b26b8d2a5178a30a6c5b596194961e3821 Stats: 423 lines in 14 files changed: 323 ins; 71 del; 29 mod Improve evacuation instrumentation Reviewed-by: kdnilsen ------------- PR: https://git.openjdk.org/shenandoah/pull/167 From psandoz at openjdk.org Mon Nov 7 19:11:15 2022 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 7 Nov 2022 19:11:15 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v6] In-Reply-To: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> References: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> Message-ID: <6pOPurF5NgGOYP73lLadj7jzY6FVrCUl9Fkh7MkBfi0=.7ecd3ff9-a2ac-474f-8270-a30bb0f56c92@github.com> On Mon, 7 Nov 2022 15:00:02 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Make memory session a pure lifetime abstraction src/java.base/share/classes/java/lang/foreign/Arena.java line 37: > 35: * This session is created with the arena, and is closed when the arena is {@linkplain #close() closed}. > 36: * Furthermore, all the native segments {@linkplain #allocate(long, long) allocated} by the arena are associated > 37: * with that session. I think we can simplify the wording by saying an arena has a session: Suggestion: * An arena is a {@linkplain AutoCloseable closeable} segment allocator that has a {@link #session() memory session}. * The arena's session is created when the arena is created, and is closed when the arena is {@linkplain #close() closed}. * All native segments {@linkplain #allocate(long, long) allocated} by the arena are associated * with its session. src/java.base/share/classes/java/lang/foreign/Arena.java line 65: > 63: * The {@link MemorySegment#address()} of the returned memory segment is the starting address of the > 64: * allocated off-heap memory region backing the segment. Moreover, the {@linkplain MemorySegment#address() address} > 65: * of the returned segment is aligned according the provided alignment constraint. Suggestion: /** * Creates a native memory segment with the given size (in bytes) and alignment constraint (in bytes). * The returned segment is associated with the arena's memory session. * The segment's {@link MemorySegment#address() address} is the starting address of the * allocated off-heap memory region backing the segment, and the address is * aligned according the provided alignment constraint. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From psandoz at openjdk.org Mon Nov 7 19:23:28 2022 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 7 Nov 2022 19:23:28 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v6] In-Reply-To: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> References: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> Message-ID: On Mon, 7 Nov 2022 15:00:02 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Make memory session a pure lifetime abstraction src/java.base/share/classes/java/lang/foreign/MemoryLayout.java line 357: > 355: > 356: /** > 357: * Creates an access var handle that can be used to access a memory segment at the layout selected by the given layout path, Suggestion: * Creates a var handle that can be used to access a memory segment at the layout selected by the given layout path, ------------- PR: https://git.openjdk.org/jdk/pull/10872 From psandoz at openjdk.org Mon Nov 7 19:34:26 2022 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 7 Nov 2022 19:34:26 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v6] In-Reply-To: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> References: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> Message-ID: <_3OEoupwOnEbpnA1ApYUiH3GhzDlHyOxearctDT85a0=.493601cf-85d0-4a5d-959d-d01da03e4e83@github.com> On Mon, 7 Nov 2022 15:00:02 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Make memory session a pure lifetime abstraction src/java.base/share/classes/java/lang/foreign/MemorySegment.java line 104: > 102: * Every memory segment is associated with a {@linkplain MemorySession memory session}. This ensures that access operations > 103: * on a memory segment cannot occur when the region of memory which backs the memory segment is no longer available > 104: * (e.g. after the memory session associated with the accessed memory segment is no longer {@linkplain MemorySession#isAlive() alive}. Missing close brace: Suggestion: * (e.g., after the memory session associated with the accessed memory segment is no longer {@linkplain MemorySession#isAlive() alive}). ------------- PR: https://git.openjdk.org/jdk/pull/10872 From psandoz at openjdk.org Mon Nov 7 19:47:42 2022 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 7 Nov 2022 19:47:42 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v6] In-Reply-To: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> References: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> Message-ID: <4Sj58ZiAdyXGqIagMDMMD3dTWwjKt8pTudl1JHnwp4Q=.cf41a458-e19d-42b7-b465-b3c40db144ac@github.com> On Mon, 7 Nov 2022 15:00:02 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Make memory session a pure lifetime abstraction src/java.base/share/classes/java/lang/foreign/MemorySegment.java line 312: > 310: * > 311: * > 312: * Heap segment can only be accessed using a layout whose alignment is smaller or equal to the Suggestion: * Heap segments can only be accessed using a layout whose alignment is smaller or equal to the ------------- PR: https://git.openjdk.org/jdk/pull/10872 From psandoz at openjdk.org Mon Nov 7 20:00:39 2022 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 7 Nov 2022 20:00:39 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v6] In-Reply-To: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> References: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> Message-ID: On Mon, 7 Nov 2022 15:00:02 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Make memory session a pure lifetime abstraction src/java.base/share/classes/java/lang/foreign/MemorySession.java line 83: > 81: * MemorySegment segment = MemorySegment.allocateNative(100, MemorySession.implicit()); > 82: * ... > 83: * segment = null; // the segment session is unreacheable here and becomes available for implicit close Typo: Suggestion: * segment = null; // the segment session is unreachable here and becomes available for implicit close ------------- PR: https://git.openjdk.org/jdk/pull/10872 From jvernee at openjdk.org Mon Nov 7 20:09:27 2022 From: jvernee at openjdk.org (Jorn Vernee) Date: Mon, 7 Nov 2022 20:09:27 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v5] In-Reply-To: <7ZPsmtKqqOeItVnPztGyLhwuHi5Q9WsGI8SYzGkyL8Q=.33d63f63-9cbb-42bd-8d81-6555ed3d67d2@github.com> References:

<7ZPsmtKqqOeItVnPztGyLhwuHi5Q9WsGI8SYzGkyL8Q=.33d63f63-9cbb-42bd-8d81-6555ed3d67d2@github.com> Message-ID: On Mon, 7 Nov 2022 14:59:27 GMT, Maurizio Cimadamore wrote: >> Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: >> >> Bring windows CallArranger in sync with panama repo > > I have incorporated additional API changes, described in this document: > http://cr.openjdk.java.net/~mcimadamore/panama/session_arenas.html > > The main change is that `MemorySession` is now a pure lifetime abstraction and no longer implements `AutoCloseable`/`SegementAllocator`. Instead a new abstraction, called `Arena` should be used for deterministic deallocation use cases. This change allows several simplifications on the `MemorySession` API, as there's no more need to support non-closeable views. @mcimadamore looks like your latest merge also undid the changes from you `b98febf` commit again ------------- PR: https://git.openjdk.org/jdk/pull/10872 From psandoz at openjdk.org Mon Nov 7 20:42:16 2022 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 7 Nov 2022 20:42:16 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v6] In-Reply-To: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> References: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> Message-ID: On Mon, 7 Nov 2022 15:00:02 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Make memory session a pure lifetime abstraction src/java.base/share/classes/java/lang/foreign/ValueLayout.java line 329: > 327: /** > 328: * Returns an unbounded address layout with the same carrier, alignment constraint, name and order as this address layout, > 329: * but with the specified pointee layout. An unbounded address layouts allow raw addresses to be accessed Suggestion: * but with the specified pointee layout. An unbounded address layout allow raw addresses to be accessed ------------- PR: https://git.openjdk.org/jdk/pull/10872 From psandoz at openjdk.org Mon Nov 7 20:46:37 2022 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 7 Nov 2022 20:46:37 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v6] In-Reply-To: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> References: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> Message-ID: On Mon, 7 Nov 2022 15:00:02 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Make memory session a pure lifetime abstraction src/java.base/share/classes/java/lang/foreign/package-info.java line 103: > 101: * the memory session associated with the segment being accessed has not been closed prematurely. > 102: * We call this guarantee temporal safety. Together, spatial and temporal safety ensure that each memory access > 103: * operation either succeeds - and accesses a valid location of the region of memory backing the memory segment - or fails. Suggestion: * operation either succeeds - and accesses a valid location within the region of memory backing the memory segment - or fails. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 8 10:36:49 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 8 Nov 2022 10:36:49 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v7] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with seven additional commits since the last revision: - Update src/java.base/share/classes/java/lang/foreign/MemorySession.java Co-authored-by: Paul Sandoz - Update src/java.base/share/classes/java/lang/foreign/MemorySegment.java Co-authored-by: Paul Sandoz - Update src/java.base/share/classes/java/lang/foreign/MemorySegment.java Co-authored-by: Paul Sandoz - Update src/java.base/share/classes/java/lang/foreign/MemoryLayout.java Co-authored-by: Paul Sandoz - Update src/java.base/share/classes/java/lang/foreign/Arena.java Co-authored-by: Paul Sandoz - Update src/java.base/share/classes/java/lang/foreign/Arena.java Co-authored-by: Paul Sandoz - Bring windows CallArranger in sync with panama repo (again) ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/f04be0da..cc4ff582 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=05-06 Stats: 31 lines in 6 files changed: 0 ins; 1 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From alanb at openjdk.org Tue Nov 8 10:36:49 2022 From: alanb at openjdk.org (Alan Bateman) Date: Tue, 8 Nov 2022 10:36:49 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v6] In-Reply-To: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> References: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> Message-ID: On Mon, 7 Nov 2022 15:00:02 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Make memory session a pure lifetime abstraction src/java.base/share/classes/java/lang/ModuleLayer.java line 331: > 329: "enableNativeAccess"); > 330: target.implAddEnableNativeAccess(); > 331: return this; ModuelLayer.enableNativeAccess looks fine, we iterated on that in panama-foreign/pull/729. I assume you'll add @since 20. Also you might want to check the alignment, it looks like the method is indented by 5 instead of the usual 4 spaces. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 8 10:42:13 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 8 Nov 2022 10:42:13 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v8] In-Reply-To: References: Message-ID: <1i9WM9QJ9WZsXv_3ZCsHRBDpG65nehSUjtww0HHTYcw=.c53b9cc2-45c9-49d4-bedf-1de71bf86f99@github.com> > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with three additional commits since the last revision: - Fix bad indent on ModuleLayer.Controller - Update src/java.base/share/classes/java/lang/foreign/package-info.java Co-authored-by: Paul Sandoz - Update src/java.base/share/classes/java/lang/foreign/ValueLayout.java Co-authored-by: Paul Sandoz ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/cc4ff582..afb36a95 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=06-07 Stats: 11 lines in 3 files changed: 0 ins; 0 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From alanb at openjdk.org Tue Nov 8 11:27:03 2022 From: alanb at openjdk.org (Alan Bateman) Date: Tue, 8 Nov 2022 11:27:03 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v8] In-Reply-To: <1i9WM9QJ9WZsXv_3ZCsHRBDpG65nehSUjtww0HHTYcw=.c53b9cc2-45c9-49d4-bedf-1de71bf86f99@github.com> References: <1i9WM9QJ9WZsXv_3ZCsHRBDpG65nehSUjtww0HHTYcw=.c53b9cc2-45c9-49d4-bedf-1de71bf86f99@github.com> Message-ID: On Tue, 8 Nov 2022 10:42:13 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with three additional commits since the last revision: > > - Fix bad indent on ModuleLayer.Controller > - Update src/java.base/share/classes/java/lang/foreign/package-info.java > > Co-authored-by: Paul Sandoz > - Update src/java.base/share/classes/java/lang/foreign/ValueLayout.java > > Co-authored-by: Paul Sandoz src/java.base/share/classes/java/lang/foreign/Arena.java line 34: > 32: * An arena allocates and manages the lifecycle of native segments. > 33: *

> 34: * An arena is a {@linkplain AutoCloseable closeable} segment allocator that has a {@link #session() memory session}. Should this is link MemorySession or linkplan memory session ? src/java.base/share/classes/java/lang/foreign/Arena.java line 98: > 96: * that memory session are also released. > 97: * @throws IllegalStateException if the session associated with this arena is not {@linkplain MemorySession#isAlive() alive}. > 98: * @throws WrongThreadException if this method is called from a thread other than the thread Should this be qualified to say that when the session is confined, and thread is called from a thread other than the owner? ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 8 13:28:58 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 8 Nov 2022 13:28:58 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v9] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Rework package-level javadoc for restricted methods ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/afb36a95..e2840232 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=07-08 Stats: 17 lines in 1 file changed: 9 ins; 1 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From jvernee at openjdk.org Tue Nov 8 16:20:51 2022 From: jvernee at openjdk.org (Jorn Vernee) Date: Tue, 8 Nov 2022 16:20:51 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v9] In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 13:28:58 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Rework package-level javadoc for restricted methods Did a full review. Only some minor comments. Also, please add attribution with `/contributor add @` for the people that contributed. (I think you have to add yourself as well, if you do that). src/java.base/share/classes/java/lang/foreign/GroupLayout.java line 57: > 55: > 56: @Override > 57: GroupLayout withName(String name); It looks like this method, and `withBitAlignment` below have no javadoc? Does this need `inheritDoc`? src/java.base/share/classes/java/lang/foreign/Linker.java line 75: > 73: *

if {@code L} is a {@link ValueLayout} with carrier {@code E} then {@code C = E}; or
if {@code L} is a {@link GroupLayout}, then {@code C} is set to {@code MemorySegment.class}

if {@code L} is a {@link ValueLayout} with carrier {@code E} then {@code C = E}; or
if {@code L} is a {@link GroupLayout}, then {@code C} is set to {@code MemorySegment.class}
The memory session of {@code A} is {@linkplain MemorySession#isAlive() alive}. Otherwise, the invocation throws > 118: * {@link IllegalStateException};
The invocation occurs in same thread as the one {@linkplain MemorySession#isOwnedBy(Thread) owning} the memory session of {@code R}, Suggestion: *
The invocation occurs in same thread as the one {@linkplain MemorySession#isOwnedBy(Thread) owning} the memory session of {@code A}, src/java.base/share/classes/java/lang/foreign/Linker.java line 121: > 119: *
The invocation occurs in same thread as the one {@linkplain MemorySession#isOwnedBy(Thread) owning} the memory session of {@code R}, > 120: * if said session is confined. Otherwise, the invocation throws {@link WrongThreadException}; and
The memory session of {@code R} is kept alive (and cannot be closed) during the invocation.
The memory session of {@code A} is kept alive (and cannot be closed) during the invocation.

> 44: * A variable argument list segment can be created using the {@link #make(Consumer, MemorySession)} factory, as follows: Suggestion: * A variable argument list can be created using the {@link #make(Consumer, MemorySession)} factory, as follows: src/java.base/share/classes/java/lang/foreign/VaList.java line 50: > 48: * .addVarg(C_DOUBLE, 3.8d)); > 49: *} > 50: * Once created, clients can obtain the platform-dependent {@linkplain #segment() memory segment} associated a variable Suggestion: * Once created, clients can obtain the platform-dependent {@linkplain #segment() memory segment} associated with a variable src/java.base/share/classes/java/lang/foreign/ValueLayout.java line 134: > 132: > 133: @Override > 134: ValueLayout withName(String name); Missing `inheritDoc` here as well, and on other withers below. src/java.base/share/classes/java/lang/foreign/ValueLayout.java line 356: > 354: * Equivalent to the following code: > 355: * {@snippet lang=java : > 356: * ADDRESS.of(ByteOrder.nativeOrder()) This code doesn't look correct. It also looks like OfAddress layouts have their alignment set to the address size already, so the alignment adjustment here seems unnecessary as well. src/java.base/share/classes/java/lang/foreign/ValueLayout.java line 367: > 365: * Equivalent to the following code: > 366: * {@snippet lang=java : > 367: * JAVA_BYTE.of(ByteOrder.nativeOrder()).withBitAlignment(8); Same here (and for the other snippets below), `OfByte` doesn't have an `of` method. This looks maybe like a regex-replace error. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From jvernee at openjdk.org Tue Nov 8 16:20:53 2022 From: jvernee at openjdk.org (Jorn Vernee) Date: Tue, 8 Nov 2022 16:20:53 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v6] In-Reply-To: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> References: <6KFOS0uVml9eRkWm9inRT0um8oEV_kUw3UZPKT_p67Q=.f330d3e5-5579-4361-8963-763928018e9a@github.com> Message-ID: On Mon, 7 Nov 2022 15:00:02 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Make memory session a pure lifetime abstraction src/java.base/share/classes/jdk/internal/foreign/AbstractMemorySegmentImpl.java line 157: > 155: public long mismatch(MemorySegment other) { > 156: Objects.requireNonNull(other); > 157: return MemorySegment.mismatch(this, 0, byteSize(), other, 0, other.byteSize()); Bit strange to see this calling back up to a method in the interface. Maybe this should just be a `default` method in `MemorySegment`? src/java.base/share/classes/jdk/internal/foreign/AbstractMemorySegmentImpl.java line 163: > 161: * Mismatch over long lengths. > 162: */ > 163: public static long vectorizedMismatchLargeForBytes(MemorySessionImpl aSession, MemorySessionImpl bSession, Does this need to be `public`? Only seems to be referenced below. src/java.base/share/classes/jdk/internal/foreign/MemorySessionImpl.java line 179: > 177: @ForceInline > 178: public static MemorySessionImpl toSessionImpl(MemorySession session) { > 179: return (MemorySessionImpl)session; Maybe calls to this method should just be replaced with a cast. src/java.base/share/classes/jdk/internal/foreign/abi/aarch64/linux/LinuxAArch64VaList.java line 136: > 134: long ptr = UNSAFE.allocateMemory(LAYOUT.byteSize()); > 135: MemorySegment ms = MemorySegment.ofAddress(ptr, LAYOUT.byteSize(), > 136: MemorySession.implicit(), () -> UNSAFE.freeMemory(ptr)); pre-existing, but it seems like this could just use `MemorySegment.allocateNative(LAYOUT, MemorySession.implicit())`? Suggestion: MemorySegment base = MemorySegment.allocateNative(LAYOUT, MemorySession.implicit()); (and remove the dependency on `Unsafe` altogether) src/java.base/share/classes/jdk/internal/foreign/abi/aarch64/linux/LinuxAArch64VaList.java line 142: > 140: VH_gr_offs.set(ms, 0); > 141: VH_vr_offs.set(ms, 0); > 142: return ms; I suggest doing Suggestion: return ms.asSlice(0, 0); To create an opaque segment, just like the `segment()` accessor does. Or maybe update the implementation of `SharedUtils.emptyVaList` to do this. src/java.base/share/classes/jdk/internal/foreign/abi/aarch64/linux/LinuxAArch64VaList.java line 408: > 406: @Override > 407: public MemorySegment segment() { > 408: return segment.asSlice(0, 0); A comment about what is happening here would be nice. (making sure the returned segment is opaque?) src/java.base/share/classes/jdk/internal/foreign/abi/aarch64/macos/MacOsAArch64VaList.java line 176: > 174: @Override > 175: public MemorySegment segment() { > 176: return segment.asSlice(0, 0); Same here. src/java.base/share/classes/jdk/internal/foreign/abi/x64/sysv/SysVVaList.java line 145: > 143: long ptr = U.allocateMemory(LAYOUT.byteSize()); > 144: MemorySegment base = MemorySegment.ofAddress(ptr, LAYOUT.byteSize(), > 145: MemorySession.implicit(), () -> U.freeMemory(ptr)); Same here: `MemorySegment base = MemorySegment.allocateNative(LAYOUT, MemorySession.implicit());` src/java.base/share/classes/jdk/internal/foreign/abi/x64/sysv/SysVVaList.java line 150: > 148: VH_overflow_arg_area.set(base, MemorySegment.NULL); > 149: VH_reg_save_area.set(base, MemorySegment.NULL); > 150: return base; Suggestion: return base.asSlice(0, 0); test/jdk/java/foreign/normalize/TestNormalize.java line 203: > 201: public static Object[][] bools() { > 202: return new Object[][]{ > 203: { 0b01, true }, // zero least significant bit, but non-zero first byte According to the comment this should actually be: Suggestion: { 0b10, true }, // zero least significant bit, but non-zero first byte Looks like I wrote this by mistake :( ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 8 16:30:14 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 8 Nov 2022 16:30:14 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v10] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with two additional commits since the last revision: - Revamp javadoc of Arena/MemorySession Rename MemorySession::isOwnedBy to MemorySession::isAccessibleBy Add Arena::isOwnedBy - Javadoc tweaks in MemorySession/Arena ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/e2840232..fd367106 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=08-09 Stats: 311 lines in 10 files changed: 63 ins; 28 del; 220 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From jvernee at openjdk.org Tue Nov 8 16:36:34 2022 From: jvernee at openjdk.org (Jorn Vernee) Date: Tue, 8 Nov 2022 16:36:34 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v10] In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 16:30:14 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with two additional commits since the last revision: > > - Revamp javadoc of Arena/MemorySession > Rename MemorySession::isOwnedBy to MemorySession::isAccessibleBy > Add Arena::isOwnedBy > - Javadoc tweaks in MemorySession/Arena src/java.base/share/classes/java/lang/foreign/Linker.java line 119: > 117: *

The memory session of {@code A} is {@linkplain MemorySession#isAlive() alive}. Otherwise, the invocation throws > 118: * {@link IllegalStateException};
The invocation occurs in same thread as the one {@linkplain MemorySession#isAccessibleBy(Thread) owning} the memory session of {@code R}, Suggestion: *
The invocation occurs in same thread as the one {@linkplain MemorySession#isAccessibleBy(Thread) owning} the memory session of {@code A}, ------------- PR: https://git.openjdk.org/jdk/pull/10872 From redestad at openjdk.org Tue Nov 8 17:20:39 2022 From: redestad at openjdk.org (Claes Redestad) Date: Tue, 8 Nov 2022 17:20:39 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v8] In-Reply-To: References: Message-ID: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. Claes Redestad has updated the pull request incrementally with five additional commits since the last revision: - Merge pull request #2 from luhenry/dev/cl4es/8282664-polyhash Unroll + Reorder BBs - fixup! Handle size=0 and size=1 in Java - Handle size=0 and size=1 in Java - reorder BB to do single scalar first to avoid slowdown of short arrays, longer arrays jumps will be amortized by speedups - Unroll loop for cnt1 < 32 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10847/files - new: https://git.openjdk.org/jdk/pull/10847/files/6f49b5aa..a4d898a3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=06-07 Stats: 216 lines in 7 files changed: 154 ins; 19 del; 43 mod Patch: https://git.openjdk.org/jdk/pull/10847.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847 PR: https://git.openjdk.org/jdk/pull/10847 From mcimadamore at openjdk.org Tue Nov 8 18:18:07 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 8 Nov 2022 18:18:07 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v11] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with two additional commits since the last revision: - Address review comments - More javadoc tweaks ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/fd367106..bb39bef3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=09-10 Stats: 190 lines in 21 files changed: 106 ins; 34 del; 50 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From jvernee at openjdk.org Tue Nov 8 18:18:08 2022 From: jvernee at openjdk.org (Jorn Vernee) Date: Tue, 8 Nov 2022 18:18:08 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v11] In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 18:14:21 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with two additional commits since the last revision: > > - Address review comments > - More javadoc tweaks Marked as reviewed by jvernee (Reviewer). src/java.base/share/classes/java/lang/foreign/MemorySession.java line 67: > 65: * cannot be easily determined. As shown in the example above, a memory session that is managed implicitly cannot end > 66: * if a program references to one or more segments associated with that session. This means that memory segments associated > 67: * with implicitly managed can be safely {@linkplain #isAccessibleBy(Thread) accessed} from multiple threads. Suggestion: * with implicitly managed sessions can be safely {@linkplain #isAccessibleBy(Thread) accessed} from multiple threads. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 8 18:28:40 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 8 Nov 2022 18:28:40 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v12] In-Reply-To: References: Message-ID: <3_cNn7GNS1M_3ouTex59atRvhCZX3_-cTeDtlGsLfuk=.4699a537-73b1-427a-a42c-a81ba874d658@github.com> > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Update src/java.base/share/classes/java/lang/foreign/MemorySession.java Co-authored-by: Jorn Vernee ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/bb39bef3..fff83ca8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=10-11 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From rriggs at openjdk.org Tue Nov 8 19:22:10 2022 From: rriggs at openjdk.org (Roger Riggs) Date: Tue, 8 Nov 2022 19:22:10 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v8] In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 17:20:39 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with five additional commits since the last revision: > > - Merge pull request #2 from luhenry/dev/cl4es/8282664-polyhash > > Unroll + Reorder BBs > - fixup! Handle size=0 and size=1 in Java > - Handle size=0 and size=1 in Java > - reorder BB to do single scalar first to avoid slowdown of short arrays, longer arrays jumps will be amortized by speedups > - Unroll loop for cnt1 < 32 src/java.base/share/classes/jdk/internal/module/ModuleHashes.java line 141: > 139: * > 140: * @param supplier supplies the module reader to access the module content > 141: * Revert, there are no other changes to ModuleHashes.java ------------- PR: https://git.openjdk.org/jdk/pull/10847 From mcimadamore at openjdk.org Tue Nov 8 22:07:07 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 8 Nov 2022 22:07:07 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v13] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: More javadoc fixes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/fff83ca8..9be0c97b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=11-12 Stats: 10 lines in 3 files changed: 0 ins; 1 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 8 22:12:46 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 8 Nov 2022 22:12:46 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v14] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Fix typo ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/9be0c97b..df29e6a0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=12-13 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From redestad at openjdk.org Tue Nov 8 23:48:22 2022 From: redestad at openjdk.org (Claes Redestad) Date: Tue, 8 Nov 2022 23:48:22 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v9] In-Reply-To: References: Message-ID: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. Claes Redestad has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 55 commits: - Revert accidental ModuleHashes change - Merge branch 'master' into 8282664-polyhash - Merge pull request #2 from luhenry/dev/cl4es/8282664-polyhash Unroll + Reorder BBs - fixup! Handle size=0 and size=1 in Java - Handle size=0 and size=1 in Java - reorder BB to do single scalar first to avoid slowdown of short arrays, longer arrays jumps will be amortized by speedups - Unroll loop for cnt1 < 32 - Merge pull request #1 from luhenry/dev/cl4es/8282664-polyhash Switch to forward approach for vectorization - Fix vector loop - fix indexing - ... and 45 more: https://git.openjdk.org/jdk/compare/dd5d4df5...853a7575 ------------- Changes: https://git.openjdk.org/jdk/pull/10847/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=08 Stats: 1186 lines in 33 files changed: 1130 ins; 9 del; 47 mod Patch: https://git.openjdk.org/jdk/pull/10847.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847 PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Tue Nov 8 23:48:24 2022 From: redestad at openjdk.org (Claes Redestad) Date: Tue, 8 Nov 2022 23:48:24 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v8] In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 19:14:25 GMT, Roger Riggs wrote: >> Claes Redestad has updated the pull request incrementally with five additional commits since the last revision: >> >> - Merge pull request #2 from luhenry/dev/cl4es/8282664-polyhash >> >> Unroll + Reorder BBs >> - fixup! Handle size=0 and size=1 in Java >> - Handle size=0 and size=1 in Java >> - reorder BB to do single scalar first to avoid slowdown of short arrays, longer arrays jumps will be amortized by speedups >> - Unroll loop for cnt1 < 32 > > src/java.base/share/classes/jdk/internal/module/ModuleHashes.java line 141: > >> 139: * >> 140: * @param supplier supplies the module reader to access the module content >> 141: * > > Revert, there are no other changes to ModuleHashes.java Fixed. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Tue Nov 8 23:57:02 2022 From: redestad at openjdk.org (Claes Redestad) Date: Tue, 8 Nov 2022 23:57:02 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops In-Reply-To: References: Message-ID: On Tue, 25 Oct 2022 16:03:28 GMT, Ludovic Henry wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > I did a quick write up explaining the approach at https://gist.github.com/luhenry/2fc408be6f906ef79aaf4115525b9d0c. Also, you can find details in @richardstartin's [blog post](https://richardstartin.github.io/posts/vectorised-polynomial-hash-codes) Most optimizations for small arrays are now back - thanks @luhenry! - I'll do a pass tomorrow and see if there's something we can simplify or enhance before calling it done. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From duke at openjdk.org Wed Nov 9 02:48:52 2022 From: duke at openjdk.org (David Schlosnagle) Date: Wed, 9 Nov 2022 02:48:52 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v9] In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 23:48:22 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 55 commits: > > - Revert accidental ModuleHashes change > - Merge branch 'master' into 8282664-polyhash > - Merge pull request #2 from luhenry/dev/cl4es/8282664-polyhash > > Unroll + Reorder BBs > - fixup! Handle size=0 and size=1 in Java > - Handle size=0 and size=1 in Java > - reorder BB to do single scalar first to avoid slowdown of short arrays, longer arrays jumps will be amortized by speedups > - Unroll loop for cnt1 < 32 > - Merge pull request #1 from luhenry/dev/cl4es/8282664-polyhash > > Switch to forward approach for vectorization > - Fix vector loop > - fix indexing > - ... and 45 more: https://git.openjdk.org/jdk/compare/dd5d4df5...853a7575 Overall I am excited to see these changes land as this will be a nice boost for many strong heavy applications! src/hotspot/share/opto/matcher.cpp line 1707: > 1705: if (x >= _LAST_MACH_OPER) { > 1706: fprintf(stderr, "x = %d, _LAST_MACH_OPER = %d\n", x, _LAST_MACH_OPER); > 1707: fprintf(stderr, "dump n\n"); Should this be removed before merging? Suggestion: src/hotspot/share/opto/matcher.cpp line 1709: > 1707: fprintf(stderr, "dump n\n"); > 1708: n->dump(); > 1709: fprintf(stderr, "dump svec\n"); Remove? Suggestion: ------------- PR: https://git.openjdk.org/jdk/pull/10847 From alanb at openjdk.org Wed Nov 9 09:20:35 2022 From: alanb at openjdk.org (Alan Bateman) Date: Wed, 9 Nov 2022 09:20:35 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v14] In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 22:12:46 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Fix typo src/java.base/share/classes/java/lang/foreign/Arena.java line 131: > 129: * @param thread the thread to be tested. > 130: */ > 131: boolean isOwnedBy(Thread thread); A shared Arena can be closed by any thread. Should a shared Arena be considered as being owned by all threads so that this method always returns true for a non-null thread? In the old API, a shared memory session has no owner so it was a bit clearer. I think my comment is mostly about the method name being about ownership, whereas the javadoc is about who can close. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Wed Nov 9 11:00:33 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Wed, 9 Nov 2022 11:00:33 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v14] In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 09:16:49 GMT, Alan Bateman wrote: >> Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix typo > > src/java.base/share/classes/java/lang/foreign/Arena.java line 131: > >> 129: * @param thread the thread to be tested. >> 130: */ >> 131: boolean isOwnedBy(Thread thread); > > A shared Arena can be closed by any thread. Should a shared Arena be considered as being owned by all threads so that this method always returns true for a non-null thread? In the old API, a shared memory session has no owner so it was a bit clearer. I think my comment is mostly about the method name being about ownership, whereas the javadoc is about who can close. Very good point - all threads are owners. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Wed Nov 9 11:42:47 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Wed, 9 Nov 2022 11:42:47 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v15] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Rename isOwnedBy -> isCloseableBy Fix minor typos Fix StrLenTest/RingAllocator ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/df29e6a0..2d75f954 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=14 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=13-14 Stats: 11 lines in 3 files changed: 2 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From alanb at openjdk.org Wed Nov 9 12:05:35 2022 From: alanb at openjdk.org (Alan Bateman) Date: Wed, 9 Nov 2022 12:05:35 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v14] In-Reply-To: References: Message-ID: On Tue, 8 Nov 2022 22:12:46 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Fix typo src/java.base/share/classes/java/lang/foreign/Arena.java line 125: > 123: */ > 124: @Override > 125: void close(); I'm trying to understand how close interacts with whileAlive on its memory session. Does close throw or block when there is a critical action running? The javadoc doesn't say right now. src/java.base/share/classes/java/lang/foreign/MemorySession.java line 43: > 41: *
> 42: * Conversely, a bounded memory session has a start and an end. Bounded memory sessions can be managed either > 43: * explicitly, (i.e. using an {@linkplain Arena arena}) or implicitly, by the garbage collector. When a bounded memory A minor style thing here is that this should probably be "using an {@link Arena}" as you really mean the Arena. This helps a bit with the generate docs as it shows up currently as "arenaPREVIEW", if you see what I mean. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Wed Nov 9 12:57:00 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Wed, 9 Nov 2022 12:57:00 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v16] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with four additional commits since the last revision: - Merge pull request #15 from minborg/test Add @apiNote to package-info - Add @apiNote to package-info - Merge pull request #16 from minborg/fix-tests2 Fix failing tests - Fix failing tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/2d75f954..39521344 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=15 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=14-15 Stats: 11 lines in 3 files changed: 4 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Wed Nov 9 13:24:54 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Wed, 9 Nov 2022 13:24:54 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v17] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Tweak Arena::close javadoc ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/39521344..cd3fbe7c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=16 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=15-16 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From wkemper at openjdk.org Thu Nov 10 00:29:21 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 10 Nov 2022 00:29:21 GMT Subject: RFR: Merge openjdk/jdk:master Message-ID: This PR merges tag jdk-20+20 ------------- Commit messages: - Merge tag 'jdk-20+22' into merge-jdk-20+22 - 8295670: Remove duplication in java/util/Formatter/Basic*.java - 8296115: Allow for compiling the JDK with strict standards conformance - 8289838: ZGC: OOM before clearing all SoftReferences - 8290063: IGV: Give the graphs a unique number in the outline - 8296235: IGV: Change shortcut to delete graph from ctrl+del to del - 8295991: java/net/httpclient/CancelRequestTest.java fails intermittently - 8296142: CertAttrSet::(getName|getElements|delete) are mostly useless - 8294845: Make globals accessed by G1 young gen revising atomic - 8295990: Improve make handling of strip flags - ... and 141 more: https://git.openjdk.org/shenandoah/compare/998f68b2...db5f25ad The webrevs contain the adjustments done while merging with regards to each parent branch: - master: https://webrevs.openjdk.org/?repo=shenandoah&pr=168&range=00.0 - openjdk/jdk:master: https://webrevs.openjdk.org/?repo=shenandoah&pr=168&range=00.1 Changes: https://git.openjdk.org/shenandoah/pull/168/files Stats: 190495 lines in 1429 files changed: 96638 ins; 61219 del; 32638 mod Patch: https://git.openjdk.org/shenandoah/pull/168.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/168/head:pull/168 PR: https://git.openjdk.org/shenandoah/pull/168 From wkemper at openjdk.org Thu Nov 10 00:53:45 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 10 Nov 2022 00:53:45 GMT Subject: RFR: Improve some defaults and remove unused options for generational mode Message-ID: `ShenandoahPromoteTenuredObjects` and `ShenandoahUsePLABs` are removed. The default `NewRatio` for generational mode is changed to 1. The default percentage of garbage for regions included in the old generation collection set is lowered to 10. ------------- Commit messages: - Remove declaration of unused ShenandoahUsePLAB - Adjust defaults Changes: https://git.openjdk.org/shenandoah/pull/169/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=169&range=00 Stats: 37 lines in 3 files changed: 2 ins; 10 del; 25 mod Patch: https://git.openjdk.org/shenandoah/pull/169.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/169/head:pull/169 PR: https://git.openjdk.org/shenandoah/pull/169 From rkennke at openjdk.org Thu Nov 10 12:39:19 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 10 Nov 2022 12:39:19 GMT Subject: RFR: Improve some defaults and remove unused options for generational mode In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 00:47:14 GMT, William Kemper wrote: > `ShenandoahPromoteTenuredObjects` and `ShenandoahUsePLABs` are removed. The default `NewRatio` for generational mode is changed to 1. The default percentage of garbage for regions included in the old generation collection set is lowered to 10. Changes look ok. Does it show performance changes/improvements? ------------- Marked as reviewed by rkennke (Lead). PR: https://git.openjdk.org/shenandoah/pull/169 From rkennke at openjdk.org Thu Nov 10 12:39:23 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 10 Nov 2022 12:39:23 GMT Subject: RFR: Merge openjdk/jdk:master In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 00:19:55 GMT, William Kemper wrote: > This PR merges tag jdk-20+20 Go for it! Thank you! ------------- Marked as reviewed by rkennke (Lead). PR: https://git.openjdk.org/shenandoah/pull/168 From redestad at openjdk.org Thu Nov 10 14:54:45 2022 From: redestad at openjdk.org (Claes Redestad) Date: Thu, 10 Nov 2022 14:54:45 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v10] In-Reply-To: References: Message-ID: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. Claes Redestad has updated the pull request incrementally with two additional commits since the last revision: - Final touch-ups, restored 2-stride with dependency chain breakage - Minor cleanup ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10847/files - new: https://git.openjdk.org/jdk/pull/10847/files/853a7575..af197062 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=08-09 Stats: 182 lines in 8 files changed: 43 ins; 74 del; 65 mod Patch: https://git.openjdk.org/jdk/pull/10847.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847 PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Thu Nov 10 14:54:48 2022 From: redestad at openjdk.org (Claes Redestad) Date: Thu, 10 Nov 2022 14:54:48 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v9] In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 02:35:24 GMT, David Schlosnagle wrote: >> Claes Redestad has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 55 commits: >> >> - Revert accidental ModuleHashes change >> - Merge branch 'master' into 8282664-polyhash >> - Merge pull request #2 from luhenry/dev/cl4es/8282664-polyhash >> >> Unroll + Reorder BBs >> - fixup! Handle size=0 and size=1 in Java >> - Handle size=0 and size=1 in Java >> - reorder BB to do single scalar first to avoid slowdown of short arrays, longer arrays jumps will be amortized by speedups >> - Unroll loop for cnt1 < 32 >> - Merge pull request #1 from luhenry/dev/cl4es/8282664-polyhash >> >> Switch to forward approach for vectorization >> - Fix vector loop >> - fix indexing >> - ... and 45 more: https://git.openjdk.org/jdk/compare/dd5d4df5...853a7575 > > src/hotspot/share/opto/matcher.cpp line 1707: > >> 1705: if (x >= _LAST_MACH_OPER) { >> 1706: fprintf(stderr, "x = %d, _LAST_MACH_OPER = %d\n", x, _LAST_MACH_OPER); >> 1707: fprintf(stderr, "dump n\n"); > > Should this be removed before merging? > Suggestion: Yes, fixed these in the latest version. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Thu Nov 10 14:57:53 2022 From: redestad at openjdk.org (Claes Redestad) Date: Thu, 10 Nov 2022 14:57:53 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v11] In-Reply-To: References: Message-ID: <2bXXJpyuGGH_dyzzfxu4cN3NFGmwjgjcCxz2mUONkc0=.81046071-5562-4e7e-bf2f-fbfd1076258c@github.com> > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: Whitespace ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10847/files - new: https://git.openjdk.org/jdk/pull/10847/files/af197062..2522625c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=09-10 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10847.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847 PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Thu Nov 10 15:03:26 2022 From: redestad at openjdk.org (Claes Redestad) Date: Thu, 10 Nov 2022 15:03:26 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v12] In-Reply-To: References: Message-ID: <4DATzQcc3E5BBS0xrbxkKDyI64Lt-vpKvtgTGDh6Rew=.5bb45e2c-65bd-4c38-9a30-47feac3a32ca@github.com> > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: Qualified guess on shenandoahSupport fix-up ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10847/files - new: https://git.openjdk.org/jdk/pull/10847/files/2522625c..871f6cef Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=10-11 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10847.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847 PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Thu Nov 10 15:07:14 2022 From: redestad at openjdk.org (Claes Redestad) Date: Thu, 10 Nov 2022 15:07:14 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops In-Reply-To: References: Message-ID: On Tue, 25 Oct 2022 16:03:28 GMT, Ludovic Henry wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > I did a quick write up explaining the approach at https://gist.github.com/luhenry/2fc408be6f906ef79aaf4115525b9d0c. Also, you can find details in @richardstartin's [blog post](https://richardstartin.github.io/posts/vectorised-polynomial-hash-codes) I've restored the 2-stride dependency-chain breaking implementation that got lost in translation when me and @luhenry took turns on this. This helps keep things fast in the 1-31 size range, and allows for a decent speed-up on `byte[]` and `short[]` cases until we can figure out how to vectorize those properly. @luhenry baseline: Benchmark (size) Mode Cnt Score Error Units StringHashCode.Algorithm.defaultLatin1 0 avgt 5 0.786 ? 0.005 ns/op StringHashCode.Algorithm.defaultLatin1 1 avgt 5 1.068 ? 0.005 ns/op StringHashCode.Algorithm.defaultLatin1 2 avgt 5 2.513 ? 0.017 ns/op StringHashCode.Algorithm.defaultLatin1 31 avgt 5 22.837 ? 0.082 ns/op StringHashCode.Algorithm.defaultLatin1 32 avgt 5 16.622 ? 0.107 ns/op StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1193.884 ? 1.862 ns/op StringHashCode.Algorithm.defaultUTF16 0 avgt 5 0.786 ? 0.002 ns/op StringHashCode.Algorithm.defaultUTF16 1 avgt 5 1.884 ? 0.002 ns/op StringHashCode.Algorithm.defaultUTF16 2 avgt 5 2.512 ? 0.011 ns/op StringHashCode.Algorithm.defaultUTF16 31 avgt 5 23.061 ? 0.119 ns/op StringHashCode.Algorithm.defaultUTF16 32 avgt 5 16.429 ? 0.044 ns/op StringHashCode.Algorithm.defaultUTF16 10000 avgt 5 1191.283 ? 4.600 ns/op Patch: Benchmark (size) Mode Cnt Score Error Units StringHashCode.Algorithm.defaultLatin1 0 avgt 5 0.787 ? 0.004 ns/op StringHashCode.Algorithm.defaultLatin1 1 avgt 5 1.050 ? 0.009 ns/op StringHashCode.Algorithm.defaultLatin1 2 avgt 5 2.198 ? 0.010 ns/op StringHashCode.Algorithm.defaultLatin1 31 avgt 5 18.413 ? 0.516 ns/op StringHashCode.Algorithm.defaultLatin1 32 avgt 5 16.599 ? 0.074 ns/op StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1189.958 ? 8.420 ns/op StringHashCode.Algorithm.defaultUTF16 0 avgt 5 0.785 ? 0.002 ns/op StringHashCode.Algorithm.defaultUTF16 1 avgt 5 1.885 ? 0.006 ns/op StringHashCode.Algorithm.defaultUTF16 2 avgt 5 2.219 ? 0.146 ns/op StringHashCode.Algorithm.defaultUTF16 31 avgt 5 19.052 ? 1.203 ns/op StringHashCode.Algorithm.defaultUTF16 32 avgt 5 16.558 ? 0.107 ns/op StringHashCode.Algorithm.defaultUTF16 10000 avgt 5 1188.122 ? 9.394 ns/op The switches @luhenry added to help the 0 and 1 cases marginally help the by allowing the compilation to do early returns in these cases, avoiding jumping around as would be necessary in the inlined intrinsic. It allowed me to simplify the previous attempt at a 2-element stride routine, while ensuring the routine is correct even if we'd call it directly without the switch preamble. I think this is ready for a final review now. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From wkemper at openjdk.org Thu Nov 10 17:56:12 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 10 Nov 2022 17:56:12 GMT Subject: Integrated: Merge openjdk/jdk:master In-Reply-To: References: Message-ID: <48httiQhrkovidHBPBq2PpRrKwEH_PMEdculuMoTn2E=.19c6dbfb-f91a-4cb3-92cf-8042b976167a@github.com> On Thu, 10 Nov 2022 00:19:55 GMT, William Kemper wrote: > This PR merges tag jdk-20+20 This pull request has now been integrated. Changeset: a0b7ce0e Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/a0b7ce0ebcbf7bdc239bc4859ba6003c0dab2e1a Stats: 190495 lines in 1429 files changed: 96638 ins; 61219 del; 32638 mod Merge openjdk/jdk:master Reviewed-by: rkennke ------------- PR: https://git.openjdk.org/shenandoah/pull/168 From wkemper at openjdk.org Thu Nov 10 18:45:18 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 10 Nov 2022 18:45:18 GMT Subject: RFR: Improve some defaults and remove unused options for generational mode In-Reply-To: References: Message-ID: <0aEXnsnIfYLWA5z93T2ntas9FlpeNDfF-ANOCiX8yc4=.490f6b8b-aa12-450b-a54c-d146d286ac67@github.com> On Thu, 10 Nov 2022 00:47:14 GMT, William Kemper wrote: > `ShenandoahPromoteTenuredObjects` and `ShenandoahUsePLABs` are removed. The default `NewRatio` for generational mode is changed to 1. The default percentage of garbage for regions included in the old generation collection set is lowered to 10. It shows a substantial reduction in max rss for `lusearch` (which I expect is due to the lowering of the immediate garbage threshold). It also shows reduction in max rss for `tomcat` and `xalan` benchmarks. ------------- PR: https://git.openjdk.org/shenandoah/pull/169 From wkemper at openjdk.org Thu Nov 10 18:46:49 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 10 Nov 2022 18:46:49 GMT Subject: Integrated: Improve some defaults and remove unused options for generational mode In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 00:47:14 GMT, William Kemper wrote: > `ShenandoahPromoteTenuredObjects` and `ShenandoahUsePLABs` are removed. The default `NewRatio` for generational mode is changed to 1. The default percentage of garbage for regions included in the old generation collection set is lowered to 10. This pull request has now been integrated. Changeset: 1b110a43 Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/1b110a43ff270d9aa00c4425c049b1f21838d5ee Stats: 37 lines in 3 files changed: 2 ins; 10 del; 25 mod Improve some defaults and remove unused options for generational mode Reviewed-by: rkennke ------------- PR: https://git.openjdk.org/shenandoah/pull/169 From psandoz at openjdk.org Fri Nov 11 00:51:34 2022 From: psandoz at openjdk.org (Paul Sandoz) Date: Fri, 11 Nov 2022 00:51:34 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v17] In-Reply-To: References: Message-ID: On Wed, 9 Nov 2022 13:24:54 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Tweak Arena::close javadoc src/java.base/share/classes/java/lang/foreign/Arena.java line 101: > 99: * @throws IllegalArgumentException if {@code bytesSize < 0}, {@code alignmentBytes <= 0}, or if {@code alignmentBytes} > 100: * is not a power of 2. > 101: * @throws IllegalStateException if the session associated with this arena is not {@linkplain MemorySession#isAlive() alive}. Suggestion: * @throws IllegalStateException if arena's session is not {@linkplain MemorySession#isAlive() alive}. src/java.base/share/classes/java/lang/foreign/Arena.java line 121: > 119: * segments associated with that memory session are also released. > 120: * @throws IllegalStateException if the session associated with this arena is not {@linkplain MemorySession#isAlive() alive}. > 121: * @throws IllegalStateException if this session is {@linkplain MemorySession#whileAlive(Runnable) kept alive} by another client. Suggestion: * @throws IllegalStateException if the arena's session is not {@linkplain MemorySession#isAlive() alive}. * @throws IllegalStateException if the arena's session is {@linkplain MemorySession#whileAlive(Runnable) kept alive}. Note i removed "by another client". I wanted to say "by another thread", but then there is the case of calling close from within the Runnable passed to whileAlive, so i wanted to say "by another caller". But, i think this can all be implied and we don't need to say anything. src/java.base/share/classes/java/lang/foreign/MemorySession.java line 66: > 64: * is not critical, or in unstructured cases where the boundaries of the lifetime associated with a memory session > 65: * cannot be easily determined. As shown in the example above, a memory session that is managed implicitly cannot end > 66: * if a program references to one or more segments associated with that session. This means that memory segments associated Suggestion: * if a program references one or more segments associated with that session. This means that memory segments associated src/java.base/share/classes/java/lang/foreign/MemorySession.java line 89: > 87: > 88: /** > 89: * {@return {@code true} if the provided thread can access and/or obtain segments associated with this memory session} Is the following accurate and more concise? Suggestion: * {@return {@code true} if the provided thread can access and/or associate segments with this memory session} ------------- PR: https://git.openjdk.org/jdk/pull/10872 From redestad at openjdk.org Fri Nov 11 12:34:34 2022 From: redestad at openjdk.org (Claes Redestad) Date: Fri, 11 Nov 2022 12:34:34 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v12] In-Reply-To: <4DATzQcc3E5BBS0xrbxkKDyI64Lt-vpKvtgTGDh6Rew=.5bb45e2c-65bd-4c38-9a30-47feac3a32ca@github.com> References: <4DATzQcc3E5BBS0xrbxkKDyI64Lt-vpKvtgTGDh6Rew=.5bb45e2c-65bd-4c38-9a30-47feac3a32ca@github.com> Message-ID: On Thu, 10 Nov 2022 15:03:26 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Qualified guess on shenandoahSupport fix-up The test failures in GHA are unrelated. Passed tier1-tier3 in our CI. Full benchmark results pending. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Fri Nov 11 12:43:12 2022 From: redestad at openjdk.org (Claes Redestad) Date: Fri, 11 Nov 2022 12:43:12 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v12] In-Reply-To: References: <5QCLl4R86LlhX9dkwbK7-NtPwkiN9tgQvj0VFoApvzU=.0b12f837-47d4-470a-9b40-961ccd8e181e@github.com> Message-ID: On Mon, 31 Oct 2022 12:25:43 GMT, Claes Redestad wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3484: >> >>> 3482: decrementl(index); >>> 3483: jmpb(LONG_SCALAR_LOOP_BEGIN); >>> 3484: bind(LONG_SCALAR_LOOP_END); >> >> You can share this loop with the scalar ones above. > > This might be messier than it first looks, since the two different loops use different temp registers based (long scalar can scratch cnt1, short scalar scratches the coef register). I'll have to think about this for a bit. As it happens in the latest version the vector loop drops into the scalar loop after all 32-element chunks has been processed. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From dfuchs at openjdk.org Fri Nov 11 12:43:14 2022 From: dfuchs at openjdk.org (Daniel Fuchs) Date: Fri, 11 Nov 2022 12:43:14 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v12] In-Reply-To: <4DATzQcc3E5BBS0xrbxkKDyI64Lt-vpKvtgTGDh6Rew=.5bb45e2c-65bd-4c38-9a30-47feac3a32ca@github.com> References: <4DATzQcc3E5BBS0xrbxkKDyI64Lt-vpKvtgTGDh6Rew=.5bb45e2c-65bd-4c38-9a30-47feac3a32ca@github.com> Message-ID: On Thu, 10 Nov 2022 15:03:26 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Qualified guess on shenandoahSupport fix-up src/java.base/share/classes/java/lang/StringLatin1.java line 194: > 192: return switch (value.length) { > 193: case 0 -> 0; > 194: case 1 -> value[0]; shouldn't that be: case 1 -> value[0] & 0xff; ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Fri Nov 11 12:43:15 2022 From: redestad at openjdk.org (Claes Redestad) Date: Fri, 11 Nov 2022 12:43:15 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v12] In-Reply-To: References: <4DATzQcc3E5BBS0xrbxkKDyI64Lt-vpKvtgTGDh6Rew=.5bb45e2c-65bd-4c38-9a30-47feac3a32ca@github.com> Message-ID: On Fri, 11 Nov 2022 12:36:20 GMT, Daniel Fuchs wrote: >> Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: >> >> Qualified guess on shenandoahSupport fix-up > > src/java.base/share/classes/java/lang/StringLatin1.java line 194: > >> 192: return switch (value.length) { >> 193: case 0 -> 0; >> 194: case 1 -> value[0]; > > shouldn't that be: > > case 1 -> value[0] & 0xff; Yes, good catch. I'll add a test case for negative latin1 bytes, too. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Fri Nov 11 13:00:06 2022 From: redestad at openjdk.org (Claes Redestad) Date: Fri, 11 Nov 2022 13:00:06 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: Missing & 0xff in StringLatin1::hashCode ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10847/files - new: https://git.openjdk.org/jdk/pull/10847/files/871f6cef..f08a656c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=11-12 Stats: 3 lines in 2 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10847.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847 PR: https://git.openjdk.org/jdk/pull/10847 From duke at openjdk.org Fri Nov 11 13:25:35 2022 From: duke at openjdk.org (Piotr Tarsa) Date: Fri, 11 Nov 2022 13:25:35 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Missing & 0xff in StringLatin1::hashCode I think that microbenchmarking the string and array hash code computation with fixed lengths is hiding branch misprediction penalties and they can be pretty high (double digits of cycles lost), even on modern high performance CPU cores that have relatively short pipeline (compared to e.g. Pentium 4). Real world scenarios will probably entail varying, unpredictable, but still short string lengths, so that should be reflected in microbenchmarks and also be given high importance. I see you've added benchmarks like that already: https://github.com/openjdk/jdk/pull/10847/files#diff-0b5a3d8f2d9f485100f701d0917ffac9cf090a023055398154fa9ef1a9681b64R126-R156 (multibytes, multiints, etc) but you don't report on their measurements. Could you add their results? Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Fri Nov 11 13:37:33 2022 From: redestad at openjdk.org (Claes Redestad) Date: Fri, 11 Nov 2022 13:37:33 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Missing & 0xff in StringLatin1::hashCode > Yes, I had the same concern as @luhenry was obsessing about 0 and 1-element inputs and added the switches that we might be optimizing for extremely well-predicted micros, so he added those multi* variants. The overall result on both our setups is that we behave well even with mixed inputs, and with the new intrinsics the generated code end up on total a bit less branchy than the baseline across the range of input sizes. I'll upload full results for the multi*-micros once I have run the baseline and patched version thoroughly with no shortcuts. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From rkennke at openjdk.org Fri Nov 11 14:37:38 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 11 Nov 2022 14:37:38 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <6KaO6YDJAQZSps49h6TddX8-aXFEfOFCfLgpi1_90Ag=.d7fe0ac9-d392-4784-a13e-85f5212e00f1@github.com> References: <6KaO6YDJAQZSps49h6TddX8-aXFEfOFCfLgpi1_90Ag=.d7fe0ac9-d392-4784-a13e-85f5212e00f1@github.com> Message-ID: On Fri, 28 Oct 2022 01:47:23 GMT, David Holmes wrote: >> \-\-\-\-\- Original Message \-\-\-\-\- >>> From\: \"John R Rose\" \ >>> To\: hotspot\-dev at openjdk\.org\, serviceability\-dev at openjdk\.org\, shenandoah\-dev at openjdk\.org >>> Sent\: Thursday\, October 27\, 2022 10\:41\:44 PM >>> Subject\: Re\: RFR\: 8291555\: Replace stack\-locking with fast\-locking \[v7\] >> >>> On Mon\, 24 Oct 2022 11\:01\:01 GMT\, Robbin Ehn \ wrote\: >>> >>>> Secondly\, a question\/suggestion\: Many recursive cases do not interleave locks\, >>>> meaning the recursive enter will happen with the lock\/oop top of lock stack >>>> already\. Why not peak at top lock\/oop in lock\-stack if the is current just push >>>> it again and the locking is done\? $instead of inflating$ $exit would need to >>>> check if this is the last one and then proper exit$ >>> >>> The CJM paper $Dice\/Kogan 2021$ mentions a \"nesting\" counter for this purpose\. >>> I suspect that a real counter is overkill\, and the \"unary\" representation >>> Robbin mentions would be fine\, especially if there were a point $when the >>> per\-thread stack gets too big$ at which we go and inflate anyway\. >>> >>> The CJM paper suggests a full search of the per\-thread array to detect the >>> recursive condition\, but again I like Robbin\'s idea of checking only the most >>> recent lock record\. >>> >>> So the data structure for lock records $per thread$ could consist of a series of >>> distinct values \[ A B C \] and each of the values could be repeated\, but only >>> adjacently\: \[ A A A B C C \] for example\. And there could be a depth limit as >>> well\. Any sequence of held locks not expressible within those limitations >>> could go to inflation as a backup\. >> >> Hi John\, >> a certainly stupid question\, i\'ve some trouble to see how it can be implemented given that because of lock coarsening $\+ may be OSR$\, the number of time a lock is held is different between the interpreted code and the compiled code\. >> >> R\?mi > >> So the data structure for lock records (per thread) could consist of a series of distinct values [ A B C ] and each of the values could be repeated, but only adjacently: [ A A A B C C ] for example. > @rose00 why only adjacently? Nested locking can be interleaved on different monitors. @dholmes-ora and all: I have prepared an alternative PR #10907 that implements the fast-locking behind a new experimental flag, and preserves the current stack-locking behavior as the default setting. It is currently implemented and tested on x86* and aarch64 arches. It is also less invasive because it keeps everything structurally the same (i.e. no method signature changes, no stack layout changes, etc). On the downside, it also means we can not have any of the associated cleanups and optimizations yet, but those are minor anyway. Also, there still is the risk that I make a mistake with the necessary factoring-out of current implementation. If we agree that this should be the way to go, then I would close this PR, and continue work on #10907. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From redestad at openjdk.org Fri Nov 11 14:44:41 2022 From: redestad at openjdk.org (Claes Redestad) Date: Fri, 11 Nov 2022 14:44:41 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Missing & 0xff in StringLatin1::hashCode Full JMH result comparison, linux-x64: https://jmh.morethan.io/?gist=014b1f9242ae3ad84cbbab893b738d48 Faster on all microbenchmarks and all input sizes. Up to 8x faster on large inputs. (Noting that the old StringHashCode::notCached and empty micros are not recalculating the hashCode since https://bugs.openjdk.org/browse/JDK-8221836 - the original intent of those microbenchmark was to test the hashing algorithm as per the new micros. We can probably remove those two..) ------------- PR: https://git.openjdk.org/jdk/pull/10847 From eosterlund at openjdk.org Fri Nov 11 16:23:38 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Fri, 11 Nov 2022 16:23:38 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code Message-ID: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. In particular, 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. 3) Refactoring the stack chunk allocation code Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. ------------- Commit messages: - Generational ZGC: Loom support Changes: https://git.openjdk.org/jdk/pull/11111/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11111&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8296875 Stats: 969 lines in 38 files changed: 636 ins; 225 del; 108 mod Patch: https://git.openjdk.org/jdk/pull/11111.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11111/head:pull/11111 PR: https://git.openjdk.org/jdk/pull/11111 From eosterlund at openjdk.org Fri Nov 11 19:44:27 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Fri, 11 Nov 2022 19:44:27 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code In-Reply-To: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: On Fri, 11 Nov 2022 16:16:18 GMT, Erik ?sterlund wrote: > The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. > > In particular, > 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. > > 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. > > 3) Refactoring the stack chunk allocation code > > Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. Nice to have PR 11111. It's gonna take a long time until we see 111111. ------------- PR: https://git.openjdk.org/jdk/pull/11111 From fyang at openjdk.org Sat Nov 12 01:33:34 2022 From: fyang at openjdk.org (Fei Yang) Date: Sat, 12 Nov 2022 01:33:34 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code In-Reply-To: References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: <9vLlu1jO4Rh1tE1-Fm5xb-79FvYNmyLw6ORbZEkZcvM=.3c47a0b1-1183-40c1-b86b-707d0ca03d18@github.com> On Fri, 11 Nov 2022 19:41:56 GMT, Erik ?sterlund wrote: > Nice to have PR 11111. It's gonna take a long time until we see 111111. Nice PR number :-) May I ask if you could also add handling for riscv while you are at it? We have ported loom to this platform recently [1]. [1] https://git.openjdk.org/jdk/commit/91292d56a9c2b8010466d105520e6e898ae53679 ------------- PR: https://git.openjdk.org/jdk/pull/11111 From vlivanov at openjdk.org Sat Nov 12 02:06:33 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Sat, 12 Nov 2022 02:06:33 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Missing & 0xff in StringLatin1::hashCode I haven't closely looked at the stub itself. Commented mostly on C2 and JDK parts. src/hotspot/cpu/x86/x86_64.ad line 12073: > 12071: legRegD tmp_vec13, rRegI tmp1, rRegI tmp2, rRegI tmp3, rFlagsReg cr) > 12072: %{ > 12073: predicate(UseAVX >= 2 && ((VectorizedHashCodeNode*)n)->mode() == VectorizedHashCodeNode::LATIN1); If you represent `VectorizedHashCodeNode::mode()` as an input, it would allow to abstract over supported modes and come up with a single AD instruction. Take a look at `VectorMaskCmp` for an example (not a perfect one though since it has both _predicate member and constant input which is redundant). src/hotspot/cpu/x86/x86_64.ad line 12081: > 12079: format %{ "Array HashCode byte[] $ary1,$cnt1 -> $result // KILL all" %} > 12080: ins_encode %{ > 12081: __ arrays_hashcode($ary1$$Register, $cnt1$$Register, $result$$Register, What's the motivation to keep the stub code inlined instead of calling into a stand-alone pre-generated version of the stub? src/hotspot/share/opto/intrinsicnode.hpp line 175: > 173: // as well as adjusting for special treatment of various encoding of String > 174: // arrays. Must correspond to declared constants in jdk.internal.util.ArraysSupport > 175: typedef enum HashModes { LATIN1 = 0, UTF16 = 1, BYTE = 2, CHAR = 3, SHORT = 4, INT = 5 } HashMode; I question the need for `LATIN1` and `UTF16` modes. If you lift some of input adjustments (initial value and input size) into JDK, it becomes indistinguishable from `BYTE`/`CHAR`. Then you can reuse existing constants for basic types. src/java.base/share/classes/jdk/internal/util/ArraysSupport.java line 185: > 183: */ > 184: @IntrinsicCandidate > 185: public static int vectorizedHashCode(Object array, byte mode) { The intrinsic can be generalized by: 1. expanding `array` input into `base`, `offset`, and `length`. It will make it applicable to any type of data source (on-heap/off-heap `ByteBuffer`s, `MemorySegment`s. 2. passing initial value as a parameter. Basically, hash code computation can be represented as a reduction: `reduce(initial_val, (acc, v) -> 31 * acc + v, data)`. You hardcode the operation, but can make the rest variable. (Even the operation can be slightly generalized if you make 31 variable and then precompute the table at runtime. But right now I don't see much value in investing into that.) ------------- PR: https://git.openjdk.org/jdk/pull/10847 From vlivanov at openjdk.org Sat Nov 12 02:06:33 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Sat, 12 Nov 2022 02:06:33 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Sat, 12 Nov 2022 00:55:56 GMT, Vladimir Ivanov wrote: >> Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: >> >> Missing & 0xff in StringLatin1::hashCode > > src/hotspot/cpu/x86/x86_64.ad line 12081: > >> 12079: format %{ "Array HashCode byte[] $ary1,$cnt1 -> $result // KILL all" %} >> 12080: ins_encode %{ >> 12081: __ arrays_hashcode($ary1$$Register, $cnt1$$Register, $result$$Register, > > What's the motivation to keep the stub code inlined instead of calling into a stand-alone pre-generated version of the stub? Also, switching to stand-alone stubs would enable us to compose a generic stub version (as we do in `StubGenerator::generate_generic_copy()` for arraycopy stubs). But it would be even better to do the dispatching on JDK side and always pass a constant into the intrinsic. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From vlivanov at openjdk.org Sat Nov 12 02:10:40 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Sat, 12 Nov 2022 02:10:40 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: <6lAQI6kDDTGbskylHcWReX8ExaB6qkwgqoai7E6ikZY=.8a69a63c-453d-4bbd-8c76-4d477bfb77fe@github.com> On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Missing & 0xff in StringLatin1::hashCode Also, I'd like to note that C2 auto-vectorization support is not too far away from being able to optimize hash code computations. At some point, I was able to achieve some promising results with modest tweaking of SuperWord pass: https://github.com/iwanowww/jdk/blob/superword/notes.txt http://cr.openjdk.java.net/~vlivanov/superword.reduction/webrev.00/ ------------- PR: https://git.openjdk.org/jdk/pull/10847 From fyang at openjdk.org Sat Nov 12 08:12:32 2022 From: fyang at openjdk.org (Fei Yang) Date: Sat, 12 Nov 2022 08:12:32 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code In-Reply-To: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: On Fri, 11 Nov 2022 16:16:18 GMT, Erik ?sterlund wrote: > The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. > > In particular, > 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. > > 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. > > 3) Refactoring the stack chunk allocation code > > Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. PS: I see JVM crashes when running Skynet with extra VM option: -XX:+VerifyContinuations on linux-aarch64 platform. $java --enable-preview -XX:+VerifyContinuations Skynet # A fatal error has been detected by the Java Runtime Environment: # after -XX: or in .hotspotrc: SuppressErrorAt=# # Internal Error/stackChunkOop.cpp (/home/realfyang/openjdk-jdk/src/hotspot/share/oops/stackChunkOop.cpp:433), pid=1904185:433, tid=1904206 [thread 1904216 also had an error]# assert(_chunk->bitmap().at(index)) failed: Bit not set at index 208 corresponding to 0x0000000637c512d0 # # JRE version: OpenJDK Runtime Environment (20.0) (fastdebug build 20-internal-adhoc.realfyang.openjdk-jdk) # Java VM: OpenJDK 64-Bit Server VM (fastdebug 20-internal-adhoc.realfyang.openjdk-jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) ------------- PR: https://git.openjdk.org/jdk/pull/11111 From duke at openjdk.org Sat Nov 12 15:29:21 2022 From: duke at openjdk.org (Piotr Tarsa) Date: Sat, 12 Nov 2022 15:29:21 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Missing & 0xff in StringLatin1::hashCode Out of curiosity: how does this intrinsic affect time-to-safepoint? Does it matter? I don't see any safepoint poll, but then I don't precisely know how safepoints work, so I could be missing something. Theoretically, with 2^31 elements count limit in Java, the whole computation is always a fraction of a second, but maybe it would matter with e.g. ZGC, which could ask for a safepoint while the thread is hashing an array with 2 billion ints. > 1. expanding `array` input into `base`, `offset`, and `length`. It will make it applicable to any type of data source (on-heap/off-heap `ByteBuffer`s, `MemorySegment`s. There could be memory-mapped ByteBuffers and MemorySegments and that would make the whole hashing operation much more prone to be exceedingly long and therefore possibly dramatically affecting time-to-safepoint. Again, this could be misunderstanding on my side, but I'm curious how safepoints interplay with this intrinsic. Also, even without memory mapping, MemorySegments can be much larger than 2^31 elements in size, so hashing a huge MemorySegment could take much longer than hashing even the biggest ordinary (limited by 31-bit indexing) array of primitives. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Sun Nov 13 19:54:44 2022 From: redestad at openjdk.org (Claes Redestad) Date: Sun, 13 Nov 2022 19:54:44 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Sat, 12 Nov 2022 15:27:09 GMT, Piotr Tarsa wrote: > Out of curiosity: how does this intrinsic affect time-to-safepoint? Does it matter? I don't see any safepoint poll, but then I don't precisely know how safepoints work, so I could be missing something. Theoretically, with 2^31 elements count limit in Java, the whole computation is always a fraction of a second, but maybe it would matter with e.g. ZGC, which could ask for a safepoint while the thread is hashing an array with 2 billion ints. This intrinsic - like several others before it - does not add safepoint checks. There's at least one RFE filed to address this deficiency, and hopefully we can come up with a shared strategy to interleave safepoint checks in the various intrinsics that operate over Strings and arrays: https://bugs.openjdk.org/browse/JDK-8233300 When I brought this up to an internal discussion with @TobiHartmann and @fisk last week several challenges were brought up to the table, including how to deal with all the different contingencies that might be the result of a safepoint, including deoptimization. I think enhancing these intrinsics to poll for safepoints is important to tackle tail-end latencies for extremely latency sensitive applications. In the meantime those applications could (should?) turn off such intrinsics, avoid huge arrays altogether, or both. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Sun Nov 13 21:01:08 2022 From: redestad at openjdk.org (Claes Redestad) Date: Sun, 13 Nov 2022 21:01:08 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Sat, 12 Nov 2022 01:06:27 GMT, Vladimir Ivanov wrote: >> Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: >> >> Missing & 0xff in StringLatin1::hashCode > > src/hotspot/cpu/x86/x86_64.ad line 12073: > >> 12071: legRegD tmp_vec13, rRegI tmp1, rRegI tmp2, rRegI tmp3, rFlagsReg cr) >> 12072: %{ >> 12073: predicate(UseAVX >= 2 && ((VectorizedHashCodeNode*)n)->mode() == VectorizedHashCodeNode::LATIN1); > > If you represent `VectorizedHashCodeNode::mode()` as an input, it would allow to abstract over supported modes and come up with a single AD instruction. Take a look at `VectorMaskCmp` for an example (not a perfect one though since it has both _predicate member and constant input which is redundant). Thanks for the pointer, I'll check it out! ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Sun Nov 13 21:01:09 2022 From: redestad at openjdk.org (Claes Redestad) Date: Sun, 13 Nov 2022 21:01:09 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Sat, 12 Nov 2022 01:10:50 GMT, Vladimir Ivanov wrote: >> src/hotspot/cpu/x86/x86_64.ad line 12081: >> >>> 12079: format %{ "Array HashCode byte[] $ary1,$cnt1 -> $result // KILL all" %} >>> 12080: ins_encode %{ >>> 12081: __ arrays_hashcode($ary1$$Register, $cnt1$$Register, $result$$Register, >> >> What's the motivation to keep the stub code inlined instead of calling into a stand-alone pre-generated version of the stub? > > Also, switching to stand-alone stubs would enable us to compose a generic stub version (as we do in `StubGenerator::generate_generic_copy()` for arraycopy stubs). But it would be even better to do the dispatching on JDK side and always pass a constant into the intrinsic. There are no single reason this code evolved the way it did. @luhenry worked on it initially and was guided towards intrinsifying what was originally a JDK-level unrolling. Then I took over and have tried to find a path of least resistance from there. @luhenry have discussed rewriting part or all of this as a stub, for various reasons. I've been scoping that out, but with no experience writing stub versions I figured perhaps this could be done in a follow-up. If you think there's a compelling enough reason to rewrite this as a stub up front I can try and find the time to do so. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Sun Nov 13 21:03:26 2022 From: redestad at openjdk.org (Claes Redestad) Date: Sun, 13 Nov 2022 21:03:26 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Sat, 12 Nov 2022 01:28:51 GMT, Vladimir Ivanov wrote: >> Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: >> >> Missing & 0xff in StringLatin1::hashCode > > src/hotspot/share/opto/intrinsicnode.hpp line 175: > >> 173: // as well as adjusting for special treatment of various encoding of String >> 174: // arrays. Must correspond to declared constants in jdk.internal.util.ArraysSupport >> 175: typedef enum HashModes { LATIN1 = 0, UTF16 = 1, BYTE = 2, CHAR = 3, SHORT = 4, INT = 5 } HashMode; > > I question the need for `LATIN1` and `UTF16` modes. If you lift some of input adjustments (initial value and input size) into JDK, it becomes indistinguishable from `BYTE`/`CHAR`. Then you can reuse existing constants for basic types. UTF16 can easily be replaced with CHAR by lifting up the shift as you say, but LATIN1 needs to be distinguished from BYTE since the former needs unsigned semantics. Modeling in a signed/unsigned input is possible, but I figured we might as well call it UNSIGNED_BYTE and decouple it logically from String::LATIN1. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Sun Nov 13 21:07:38 2022 From: redestad at openjdk.org (Claes Redestad) Date: Sun, 13 Nov 2022 21:07:38 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Sat, 12 Nov 2022 01:35:39 GMT, Vladimir Ivanov wrote: >> Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: >> >> Missing & 0xff in StringLatin1::hashCode > > src/java.base/share/classes/jdk/internal/util/ArraysSupport.java line 185: > >> 183: */ >> 184: @IntrinsicCandidate >> 185: public static int vectorizedHashCode(Object array, byte mode) { > > The intrinsic can be generalized by: > 1. expanding `array` input into `base`, `offset`, and `length`. It will make it applicable to any type of data source (on-heap/off-heap `ByteBuffer`s, `MemorySegment`s. > 2. passing initial value as a parameter. > > Basically, hash code computation can be represented as a reduction: `reduce(initial_val, (acc, v) -> 31 * acc + v, data)`. You hardcode the operation, but can make the rest variable. > > (Even the operation can be slightly generalized if you make 31 variable and then precompute the table at runtime. But right now I don't see much value in investing into that.) I've been thinking of generalizing as thus as a possible follow-up: get the base operation on entire arrays in, then generalize carefully while ensuring that doesn't add too much complexity, introduce unforeseen overheads etc. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Sun Nov 13 21:12:21 2022 From: redestad at openjdk.org (Claes Redestad) Date: Sun, 13 Nov 2022 21:12:21 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: <6lAQI6kDDTGbskylHcWReX8ExaB6qkwgqoai7E6ikZY=.8a69a63c-453d-4bbd-8c76-4d477bfb77fe@github.com> References: <6lAQI6kDDTGbskylHcWReX8ExaB6qkwgqoai7E6ikZY=.8a69a63c-453d-4bbd-8c76-4d477bfb77fe@github.com> Message-ID: On Sat, 12 Nov 2022 02:08:19 GMT, Vladimir Ivanov wrote: > Also, I'd like to note that C2 auto-vectorization support is not too far away from being able to optimize hash code computations. At some point, I was able to achieve some promising results with modest tweaking of SuperWord pass: https://github.com/iwanowww/jdk/blob/superword/notes.txt http://cr.openjdk.java.net/~vlivanov/superword.reduction/webrev.00/ Intriguing. How far off is this - and do you think it'll be able to match the efficiency we see here with a memoized coefficient table etc? If we turn this intrinsic into a stub we might also be able to reuse the optimization in other places, including from within the VM (calculating String hashCodes happen in a couple of places, including String deduplication). So I think there are still a few compelling reasons to go the manual route and continue on this path. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From dholmes at openjdk.org Sun Nov 13 22:56:20 2022 From: dholmes at openjdk.org (David Holmes) Date: Sun, 13 Nov 2022 22:56:20 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <6KaO6YDJAQZSps49h6TddX8-aXFEfOFCfLgpi1_90Ag=.d7fe0ac9-d392-4784-a13e-85f5212e00f1@github.com> Message-ID: On Fri, 11 Nov 2022 14:35:22 GMT, Roman Kennke wrote: >>> So the data structure for lock records (per thread) could consist of a series of distinct values [ A B C ] and each of the values could be repeated, but only adjacently: [ A A A B C C ] for example. >> @rose00 why only adjacently? Nested locking can be interleaved on different monitors. > > @dholmes-ora and all: I have prepared an alternative PR #10907 that implements the fast-locking behind a new experimental flag, and preserves the current stack-locking behavior as the default setting. It is currently implemented and tested on x86* and aarch64 arches. It is also less invasive because it keeps everything structurally the same (i.e. no method signature changes, no stack layout changes, etc). On the downside, it also means we can not have any of the associated cleanups and optimizations yet, but those are minor anyway. Also, there still is the risk that I make a mistake with the necessary factoring-out of current implementation. If we agree that this should be the way to go, then I would close this PR, and continue work on #10907. @rkennke not unexpectedly I greatly prefer the optional and opt-in version in PR https://github.com/openjdk/jdk/pull/10907. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From thartmann at openjdk.org Mon Nov 14 06:16:33 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 14 Nov 2022 06:16:33 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Missing & 0xff in StringLatin1::hashCode For the record, we have [JDK-8233300](https://bugs.openjdk.org/browse/JDK-8233300) to investigate safepoint-aware intrinsics. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From eosterlund at openjdk.org Mon Nov 14 10:15:41 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Mon, 14 Nov 2022 10:15:41 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v2] In-Reply-To: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: > The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. > > In particular, > 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. > > 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. > > 3) Refactoring the stack chunk allocation code > > Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: Fix verification and RISC-V support ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11111/files - new: https://git.openjdk.org/jdk/pull/11111/files/fc5996f2..7becc31e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11111&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11111&range=00-01 Stats: 9 lines in 4 files changed: 6 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11111.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11111/head:pull/11111 PR: https://git.openjdk.org/jdk/pull/11111 From eosterlund at openjdk.org Mon Nov 14 10:15:41 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Mon, 14 Nov 2022 10:15:41 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code In-Reply-To: <9vLlu1jO4Rh1tE1-Fm5xb-79FvYNmyLw6ORbZEkZcvM=.3c47a0b1-1183-40c1-b86b-707d0ca03d18@github.com> References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> <9vLlu1jO4Rh1tE1-Fm5xb-79FvYNmyLw6ORbZEkZcvM=.3c47a0b1-1183-40c1-b86b-707d0ca03d18@github.com> Message-ID: On Sat, 12 Nov 2022 01:30:57 GMT, Fei Yang wrote: > > Nice to have PR 11111. It's gonna take a long time until we see 111111. > > Nice PR number :-) May I ask if you could also add handling for riscv while you are at it? We have ported loom to this platform recently [1]. I can help perform the necessary testing if needed. > > [1] https://git.openjdk.org/jdk/commit/91292d56a9c2b8010466d105520e6e898ae53679 Sure. Included what I think is the required RISC-V fix in my last update. Please check it out, and hope it works for you. ------------- PR: https://git.openjdk.org/jdk/pull/11111 From eosterlund at openjdk.org Mon Nov 14 10:15:41 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Mon, 14 Nov 2022 10:15:41 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code In-Reply-To: References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: On Sat, 12 Nov 2022 08:08:15 GMT, Fei Yang wrote: > PS: I see JVM crashes when running Skynet with extra VM option: -XX:+VerifyContinuations on linux-aarch64 platform. > > $java --enable-preview -XX:+VerifyContinuations Skynet > > ``` > # A fatal error has been detected by the Java Runtime Environment: > > # after -XX: or in .hotspotrc: SuppressErrorAt=# > # Internal Error/stackChunkOop.cpp (/home/realfyang/openjdk-jdk/src/hotspot/share/oops/stackChunkOop.cpp:433), pid=1904185:433, tid=1904206 > > [thread 1904216 also had an error]# assert(_chunk->bitmap().at(index)) failed: Bit not set at index 208 corresponding to 0x0000000637c512d0 > > # > # JRE version: OpenJDK Runtime Environment (20.0) (fastdebug build 20-internal-adhoc.realfyang.openjdk-jdk) > # Java VM: OpenJDK 64-Bit Server VM (fastdebug 20-internal-adhoc.realfyang.openjdk-jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) > ``` Thanks for finding that. Turns out that the verification code for the stack chunk bitmap expected entries even when the value is null, while the logic that added bitmap entries didn't add if it was null. I fixed it by making sure even null entries are added to the bitmap. While it doesn't really matter if they are added or not, I think it would be the least surprising if iterating over the oops with and without the bitmap yields the same result. I have verified manually with all GCs that Skynet works with the verification flag, on x86_64 and AArch64. ------------- PR: https://git.openjdk.org/jdk/pull/11111 From stefank at openjdk.org Mon Nov 14 10:46:36 2022 From: stefank at openjdk.org (Stefan Karlsson) Date: Mon, 14 Nov 2022 10:46:36 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v2] In-Reply-To: References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: On Mon, 14 Nov 2022 10:15:41 GMT, Erik ?sterlund wrote: >> The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. >> >> In particular, >> 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. >> >> 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. >> >> 3) Refactoring the stack chunk allocation code >> >> Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. > > Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: > > Fix verification and RISC-V support Looks good to me. I wrote parts of this code, so I want wan extra Reviewer on this patch. I wonder if we should rename the title to something less ZGC specific? src/hotspot/share/prims/stackwalk.hpp line 102: > 100: Method* method() override { return _vfst.method(); } > 101: int bci() override { return _vfst.bci(); } > 102: oop cont() override { return _vfst.continuation(); } Revert ------------- Marked as reviewed by stefank (Reviewer). PR: https://git.openjdk.org/jdk/pull/11111 From eosterlund at openjdk.org Mon Nov 14 10:52:14 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Mon, 14 Nov 2022 10:52:14 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v2] In-Reply-To: References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: On Mon, 14 Nov 2022 10:43:23 GMT, Stefan Karlsson wrote: >> Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix verification and RISC-V support > > Looks good to me. I wrote parts of this code, so I want wan extra Reviewer on this patch. > > I wonder if we should rename the title to something less ZGC specific? Thanks for the review @stefank! ------------- PR: https://git.openjdk.org/jdk/pull/11111 From ngasson at openjdk.org Mon Nov 14 10:53:17 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Mon, 14 Nov 2022 10:53:17 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac [v2] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 17:05:26 GMT, Nick Gasson wrote: >> The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. >> >> See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html >> >> Also tested `hotspot_gc_shenandoah` on x86 and AArch64. > > Nick Gasson has updated the pull request incrementally with one additional commit since the last revision: > > Refactor Any more comments on this? ------------- PR: https://git.openjdk.org/jdk/pull/10573 From rkennke at openjdk.org Mon Nov 14 11:48:42 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Mon, 14 Nov 2022 11:48:42 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac [v2] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 17:05:26 GMT, Nick Gasson wrote: >> The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. >> >> See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html >> >> Also tested `hotspot_gc_shenandoah` on x86 and AArch64. > > Nick Gasson has updated the pull request incrementally with one additional commit since the last revision: > > Refactor This looks good to me now. Just one minor style nit, changing this doesn't require another review from me. Maybe @shipilev wants to review it, too? Thank you! src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.cpp line 117: > 115: } > 116: > 117: ShenandoahEvacOOMCounter *ShenandoahEvacOOMHandler::counter_for_thread(Thread* t) { Very minor nit: we ususally put the * next to the type, i.e. ShenandoahEvacOOMCounter*. ------------- Marked as reviewed by rkennke (Reviewer). PR: https://git.openjdk.org/jdk/pull/10573 From ngasson at openjdk.org Mon Nov 14 14:19:52 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Mon, 14 Nov 2022 14:19:52 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac [v3] In-Reply-To: References: Message-ID: <-edTPYr00w2RIDq4OdbxvI9S1h8-Bdmh_MIda5eLE9g=.d0aeb76f-afc2-4e55-abcc-2246efd26f66@github.com> > The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. > > See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html > > Also tested `hotspot_gc_shenandoah` on x86 and AArch64. Nick Gasson has updated the pull request incrementally with one additional commit since the last revision: Put the * next to the type ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10573/files - new: https://git.openjdk.org/jdk/pull/10573/files/14cec5ed..09447b38 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10573&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10573&range=01-02 Stats: 11 lines in 3 files changed: 0 ins; 0 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/10573.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10573/head:pull/10573 PR: https://git.openjdk.org/jdk/pull/10573 From ngasson at openjdk.org Mon Nov 14 14:43:12 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Mon, 14 Nov 2022 14:43:12 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac [v4] In-Reply-To: References: Message-ID: > The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. > > See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html > > Also tested `hotspot_gc_shenandoah` on x86 and AArch64. Nick Gasson has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Merge branch 'master' into 8294775 - Put the * next to the type - Refactor - 8294775: Shenandoah: reduce contention on _threads_in_evac ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10573/files - new: https://git.openjdk.org/jdk/pull/10573/files/09447b38..2d8327cc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10573&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10573&range=02-03 Stats: 241174 lines in 3117 files changed: 124576 ins; 76223 del; 40375 mod Patch: https://git.openjdk.org/jdk/pull/10573.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10573/head:pull/10573 PR: https://git.openjdk.org/jdk/pull/10573 From shade at openjdk.org Mon Nov 14 15:15:29 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 14 Nov 2022 15:15:29 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac [v4] In-Reply-To: <7mpUhXJtmnGLJ1qqMtbAYNnGPIdTaYVjQjEIAhecNds=.7d5d02c4-6506-440f-969e-4a26e5f057ca@github.com> References: <7mpUhXJtmnGLJ1qqMtbAYNnGPIdTaYVjQjEIAhecNds=.7d5d02c4-6506-440f-969e-4a26e5f057ca@github.com> Message-ID: On Tue, 11 Oct 2022 12:28:27 GMT, Nick Gasson wrote: >> src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.cpp line 55: >> >>> 53: // *and* the counter is zero. >>> 54: while (Atomic::load_acquire(ptr) != OOM_MARKER_MASK) { >>> 55: os::naked_short_sleep(1); >> >> Not sure if SpinPause() may be better here? @shipilev probably knows more. > > I think we'd probably want some back-off here rather than spinning indefinitely? E.g. spin N times and then start sleeping. This code does what old code did, and on that grounds it is fine to keep `naked_short_sleep`. ------------- PR: https://git.openjdk.org/jdk/pull/10573 From shade at openjdk.org Mon Nov 14 15:15:28 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 14 Nov 2022 15:15:28 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac [v4] In-Reply-To: References: Message-ID: On Mon, 14 Nov 2022 14:43:12 GMT, Nick Gasson wrote: >> The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. >> >> See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html >> >> Also tested `hotspot_gc_shenandoah` on x86 and AArch64. > > Nick Gasson has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'master' into 8294775 > - Put the * next to the type > - Refactor > - 8294775: Shenandoah: reduce contention on _threads_in_evac This looks fine to me with minor nits. src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.cpp line 35: > 33: > 34: ShenandoahEvacOOMCounter::ShenandoahEvacOOMCounter() > 35: : _bits(0) { `:` should remain on the same line. src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.cpp line 93: > 91: _threads_in_evac = NEW_C_HEAP_ARRAY(ShenandoahEvacOOMCounter, _num_counters, mtGC); > 92: for (int i = 0; i < _num_counters; i++) { > 93: new (&_threads_in_evac[i]) ShenandoahEvacOOMCounter; Suggestion: new (&_threads_in_evac[i]) ShenandoahEvacOOMCounter(); src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.cpp line 119: > 117: ShenandoahEvacOOMCounter* ShenandoahEvacOOMHandler::counter_for_thread(Thread* t) { > 118: const uint64_t key = hash_pointer(t); > 119: assert(is_power_of_2(_num_counters), "must be"); I suggest asserting this once in constructor, to make debug builds faster. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/10573 From luhenry at openjdk.org Mon Nov 14 15:32:32 2022 From: luhenry at openjdk.org (Ludovic Henry) Date: Mon, 14 Nov 2022 15:32:32 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: <6lAQI6kDDTGbskylHcWReX8ExaB6qkwgqoai7E6ikZY=.8a69a63c-453d-4bbd-8c76-4d477bfb77fe@github.com> Message-ID: On Sun, 13 Nov 2022 21:08:53 GMT, Claes Redestad wrote: > Also, I'd like to note that C2 auto-vectorization support is not too far away from being able to optimize hash code computations. At some point, I was able to achieve some promising results with modest tweaking of SuperWord pass: https://github.com/iwanowww/jdk/blob/superword/notes.txt http://cr.openjdk.java.net/~vlivanov/superword.reduction/webrev.00/ That would be extremely helpful not just for this case but for many other cases that today require the Vector API or handrolled intrinsics. For cases that would be great to support, a good guide is the [gcc autovectorization support](https://gcc.gnu.org/projects/tree-ssa/vectorization.html) given they use SLP as well. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From eosterlund at openjdk.org Mon Nov 14 16:07:34 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Mon, 14 Nov 2022 16:07:34 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v3] In-Reply-To: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: > The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. > > In particular, > 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. > > 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. > > 3) Refactoring the stack chunk allocation code > > Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: Indentation fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11111/files - new: https://git.openjdk.org/jdk/pull/11111/files/7becc31e..b20563f5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11111&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11111&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/11111.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11111/head:pull/11111 PR: https://git.openjdk.org/jdk/pull/11111 From ngasson at openjdk.org Mon Nov 14 17:28:17 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Mon, 14 Nov 2022 17:28:17 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac [v5] In-Reply-To: References: Message-ID: > The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. > > See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html > > Also tested `hotspot_gc_shenandoah` on x86 and AArch64. Nick Gasson has updated the pull request incrementally with one additional commit since the last revision: Formatting fixes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10573/files - new: https://git.openjdk.org/jdk/pull/10573/files/2d8327cc..c2ec2e5c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10573&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10573&range=03-04 Stats: 8 lines in 1 file changed: 2 ins; 1 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/10573.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10573/head:pull/10573 PR: https://git.openjdk.org/jdk/pull/10573 From shade at openjdk.org Mon Nov 14 17:38:24 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 14 Nov 2022 17:38:24 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac [v5] In-Reply-To: References: Message-ID: <0PnLsg5KSWmGcIVkRfUsoyNlYlCxQFxkVSgrsvVsCbw=.264c6713-a42a-46fe-a7fb-e51d5021bec3@github.com> On Mon, 14 Nov 2022 17:28:17 GMT, Nick Gasson wrote: >> The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. >> >> See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html >> >> Also tested `hotspot_gc_shenandoah` on x86 and AArch64. > > Nick Gasson has updated the pull request incrementally with one additional commit since the last revision: > > Formatting fixes Marked as reviewed by shade (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10573 From vlivanov at openjdk.org Mon Nov 14 17:51:38 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Mon, 14 Nov 2022 17:51:38 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Sun, 13 Nov 2022 21:01:21 GMT, Claes Redestad wrote: >> src/hotspot/share/opto/intrinsicnode.hpp line 175: >> >>> 173: // as well as adjusting for special treatment of various encoding of String >>> 174: // arrays. Must correspond to declared constants in jdk.internal.util.ArraysSupport >>> 175: typedef enum HashModes { LATIN1 = 0, UTF16 = 1, BYTE = 2, CHAR = 3, SHORT = 4, INT = 5 } HashMode; >> >> I question the need for `LATIN1` and `UTF16` modes. If you lift some of input adjustments (initial value and input size) into JDK, it becomes indistinguishable from `BYTE`/`CHAR`. Then you can reuse existing constants for basic types. > > UTF16 can easily be replaced with CHAR by lifting up the shift as you say, but LATIN1 needs to be distinguished from BYTE since the former needs unsigned semantics. Modeling in a signed/unsigned input is possible, but I figured we might as well call it UNSIGNED_BYTE and decouple it logically from String::LATIN1. FTR `T_BOOLEAN` effectively represents unsigned byte. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From vlivanov at openjdk.org Mon Nov 14 18:18:21 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Mon, 14 Nov 2022 18:18:21 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Sun, 13 Nov 2022 19:50:46 GMT, Claes Redestad wrote: > ... several challenges were brought up to the table, including how to deal with all the different contingencies that might be the result of a safepoint, including deoptimization. FTR if the intrinsic is represented as a stand-alone stub, there's no need to care about deoptimization. (In such cases, deopts happen on return from the stub.) It wouldn't be allowed to be a leaf call anymore, but a safepoint check and an OOP map would do the job. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From vlivanov at openjdk.org Mon Nov 14 18:32:38 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Mon, 14 Nov 2022 18:32:38 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: <6lAQI6kDDTGbskylHcWReX8ExaB6qkwgqoai7E6ikZY=.8a69a63c-453d-4bbd-8c76-4d477bfb77fe@github.com> Message-ID: On Sun, 13 Nov 2022 21:08:53 GMT, Claes Redestad wrote: > How far off is this ...? Back then it looked way too constrained (tight constraints on code shapes). But I considered it as a generally applicable optimization. > ... do you think it'll be able to match the efficiency we see here with a memoized coefficient table etc? Yes, it is able to build the constant table at runtime when folding multiplications of constant coefficients produced during loop unrolling and then packing scalars into a constant vector. Moreover, briefly looking at the code shape, the vectorizer would produce a more optimal loop shape (pre-loop would align vector accesses and would use 512-bit vectors when available; vector post-loop could help as well). ------------- PR: https://git.openjdk.org/jdk/pull/10847 From adubrouski at linkedin.com Mon Nov 14 21:06:53 2022 From: adubrouski at linkedin.com (Alex Dubrouski) Date: Mon, 14 Nov 2022 21:06:53 +0000 Subject: Allocation pacing and graceful degradation in ShenandoahGC Message-ID: Good afternoon everyone, I checked all video presentations and slides by Alex Shipilev and Roman Kennke about ShenandoahGC to find the answer for my question with no luck. I am trying to find more details about transitions between modes in ShenandoahGC I am looking for solution to assess concurrent collector health in real time using different metrics. Here is the schema of transitions, and allocation failure causes degenerated GC cycle, but it does not mention allocation pacing at all: https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp#L361 I tried to dig further into this logic, but need your help to put all the pieces together I was not able to effectively trace entry point, but this might work, allocation on heap outside of TLAB: https://github.com/openjdk/jdk/blob/master/src/hotspot/share/gc/shared/memAllocator.cpp#L258 in case of ShenandoahGC I assume we call https://github.com/openjdk/jdk/blame/739769c8fc4b496f08a92225a12d07414537b6c0/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L901 which then calls https://github.com/openjdk/jdk/blame/739769c8fc4b496f08a92225a12d07414537b6c0/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L821 if mutator is allocating and pacer enabled (default) we enter Pacer: https://github.com/openjdk/jdk/blame/739769c8fc4b496f08a92225a12d07414537b6c0/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L828 https://github.com/openjdk/jdk/blame/master/src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp#L229 and I assume try to handle it nicely, if not we start pacing: https://github.com/openjdk/jdk/blame/master/src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp#L253 I have few questions here: - Could you please explain a bit how the system of taxes works? I assume mutators claim budget, while GC replenishes it async, but the details are missing and no comments in the code - To pace we use wait function from Monitor class https://github.com/openjdk/jdk/blame/master/src/hotspot/share/runtime/mutex.cpp#L232 but the first thing it gets current Java thread. Does that mean that each Java thread goes throw runtime -> heap to allocate, and that's how pacer paces it? So we just pace any allocating thread and threads that allocate more will just hit this code more often. - Pacer uses ShenandoahPacingMaxDelay (10ms) as max, but pace_for_allocation returns void https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L826 https://github.com/openjdk/jdk/blame/master/src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp#L225 so I assume if there is no budget available it will pace a thread for up to 10ms, but it does not imply allocation failure. Heap class tries to allocate under lock and if unsuccessful considers this as allocation failure and handles it by calling ShenandoahControlThread. Does it mean that Pacer can?t cause GC to switch to degenerated mode or I am missing something? - If Pacer doesn't have budget to allocate memory it paces thread, but is there any global budget for pacing time or it is only per thread max (ShenandoahPacingMaxDelay)? - It would really nice if you can sched some light on these transitions: Concurrent Mode -> Pacing (single thread and total pacing time for all threads) -> most importantly logic of transitioning from pacing to degenerated GC I am trying to build a model which can tell me whether GC is healthy (fully concurrent), a bit unhealthy (pacing), unhealthy (degenerated or full GC) and how close are to the edge of the next state (a bit unhealthy -> unhealthy) No rush and thanks a lot in advance. Regards, Alex Dubrouski -------------- next part -------------- An HTML attachment was scrubbed... URL: From wkemper at openjdk.org Mon Nov 14 22:18:38 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 14 Nov 2022 22:18:38 GMT Subject: RFR: Change affiliation representation In-Reply-To: <1pJV7ot6gQZCl7rnp_Pj5m-HF24GxEXHxCqHn5-6FYU=.c54f95e4-7b9f-4dc6-b8c3-5e4d76f9d26f@github.com> References: <1pJV7ot6gQZCl7rnp_Pj5m-HF24GxEXHxCqHn5-6FYU=.c54f95e4-7b9f-4dc6-b8c3-5e4d76f9d26f@github.com> Message-ID: On Mon, 14 Nov 2022 21:14:53 GMT, Kelvin Nilsen wrote: > With generational mode of Shenandoah, each region is associated with OLD, YOUNG, or FREE. During certain marking and update-refs activities, the region affiliation is frequently consulted. In the original implementation, the region affiliation was stored in a field of the ShenandoahHeapRegion object. In this code, we maintain a separate array, indexed by region number, to represent the affiliation of each region. This saves one level of indirection and improves cache locality for looking up the affiliation of each region. > > Measurements show significant improvement in throughput. One workload that was configured to perform back-to-back old-gen collections was able to increase the frequency of old-gen collections by almost 5 fold. With a 20-minute Extremem workload using a 48G heap, 20G old-gen, the P95 latency improvement was 0.54% (2.395 ms) and the P99.999 latency improvement was 58.21% (29.195 ms) in comparison to the implementation before this patch. src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 356: > 354: > 355: _regions = NEW_C_HEAP_ARRAY(ShenandoahHeapRegion*, _num_regions, mtGC); > 356: _affiliations = NEW_C_HEAP_ARRAY(uint8_t, _num_regions, mtGC); Should we only initialize this for generational mode? or are there calls to check affiliation in the other modes too? ------------- PR: https://git.openjdk.org/shenandoah/pull/170 From kdnilsen at openjdk.org Mon Nov 14 22:18:39 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Mon, 14 Nov 2022 22:18:39 GMT Subject: RFR: Change affiliation representation In-Reply-To: References: <1pJV7ot6gQZCl7rnp_Pj5m-HF24GxEXHxCqHn5-6FYU=.c54f95e4-7b9f-4dc6-b8c3-5e4d76f9d26f@github.com> Message-ID: On Mon, 14 Nov 2022 21:32:59 GMT, William Kemper wrote: >> With generational mode of Shenandoah, each region is associated with OLD, YOUNG, or FREE. During certain marking and update-refs activities, the region affiliation is frequently consulted. In the original implementation, the region affiliation was stored in a field of the ShenandoahHeapRegion object. In this code, we maintain a separate array, indexed by region number, to represent the affiliation of each region. This saves one level of indirection and improves cache locality for looking up the affiliation of each region. >> >> Measurements show significant improvement in throughput. One workload that was configured to perform back-to-back old-gen collections was able to increase the frequency of old-gen collections by almost 5 fold. With a 20-minute Extremem workload using a 48G heap, 20G old-gen, the P95 latency improvement was 0.54% (2.395 ms) and the P99.999 latency improvement was 58.21% (29.195 ms) in comparison to the implementation before this patch. > > src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 356: > >> 354: >> 355: _regions = NEW_C_HEAP_ARRAY(ShenandoahHeapRegion*, _num_regions, mtGC); >> 356: _affiliations = NEW_C_HEAP_ARRAY(uint8_t, _num_regions, mtGC); > > Should we only initialize this for generational mode? or are there calls to check affiliation in the other modes too? Good point. I think it's only needed for generational mode. ------------- PR: https://git.openjdk.org/shenandoah/pull/170 From kdnilsen at openjdk.org Mon Nov 14 22:18:38 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Mon, 14 Nov 2022 22:18:38 GMT Subject: RFR: Change affiliation representation Message-ID: <1pJV7ot6gQZCl7rnp_Pj5m-HF24GxEXHxCqHn5-6FYU=.c54f95e4-7b9f-4dc6-b8c3-5e4d76f9d26f@github.com> With generational mode of Shenandoah, each region is associated with OLD, YOUNG, or FREE. During certain marking and update-refs activities, the region affiliation is frequently consulted. In the original implementation, the region affiliation was stored in a field of the ShenandoahHeapRegion object. In this code, we maintain a separate array, indexed by region number, to represent the affiliation of each region. This saves one level of indirection and improves cache locality for looking up the affiliation of each region. Measurements show significant improvement in throughput. One workload that was configured to perform back-to-back old-gen collections was able to increase the frequency of old-gen collections by almost 5 fold. With a 20-minute Extremem workload using a 48G heap, 20G old-gen, the P95 latency improvement was 0.54% (2.395 ms) and the P99.999 latency improvement was 58.21% (29.195 ms) in comparison to the implementation before this patch. ------------- Commit messages: - Fix white space - Merge remote-tracking branch 'GitFarmBranch/change-affiliation-rep-rebase' into change-affiliation-representation - Add detail to assert message - Fix syntax errorx - Make ShenandoahHeap::is_in_active_generation() mimic previous behavior - Inline check for in-active-generation - Force pipeline tests - Optimize implementation of region affiliation Changes: https://git.openjdk.org/shenandoah/pull/170/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=170&range=00 Stats: 137 lines in 6 files changed: 89 ins; 33 del; 15 mod Patch: https://git.openjdk.org/shenandoah/pull/170.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/170/head:pull/170 PR: https://git.openjdk.org/shenandoah/pull/170 From jrose at openjdk.org Mon Nov 14 23:03:01 2022 From: jrose at openjdk.org (John R Rose) Date: Mon, 14 Nov 2022 23:03:01 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <6KaO6YDJAQZSps49h6TddX8-aXFEfOFCfLgpi1_90Ag=.d7fe0ac9-d392-4784-a13e-85f5212e00f1@github.com> References: <6KaO6YDJAQZSps49h6TddX8-aXFEfOFCfLgpi1_90Ag=.d7fe0ac9-d392-4784-a13e-85f5212e00f1@github.com> Message-ID: <_C2oCFsbq1QdFO_HjwfXHNt0XrtV06TqRK1a8lpiXsI=.4650c115-d734-4655-bc6a-ec46314ab5ed@github.com> On Fri, 28 Oct 2022 01:47:23 GMT, David Holmes wrote: > So the data structure for lock records (per thread) could consist of a series of distinct values [ A B C ] and each of the values could be repeated, but only adjacently: [ A A A B C C ] for example. > @rose00 why only adjacently? Nested locking can be interleaved on different monitors. Yes it can; you can have nesting A, B, A. But the thread-based fast-locking list might not cover that case. If it were restricted to only adjacent records in the way I sketched, it would need to use a different, slower technique for the A, B, A case. The trade-off is that if you only allow adjacent recursive locks on the list, you don't need to search the list beyond the first element, to detect re-locking. Dunno if that pencils out to a real advantage, though, since the fallback is slow. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From jrose at openjdk.org Mon Nov 14 23:17:03 2022 From: jrose at openjdk.org (John R Rose) Date: Mon, 14 Nov 2022 23:17:03 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v8] In-Reply-To: References: Message-ID: On Fri, 28 Oct 2022 09:32:58 GMT, Roman Kennke wrote: >> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. >> >> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. >> >> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR >> >> ### Benchmarks >> >> All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. >> >> #### DaCapo/AArch64 >> >> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? >> >> benchmark | baseline | fast-locking | % | size >> -- | -- | -- | -- | -- >> avrora | 27859 | 27563 | 1.07% | large >> batik | 20786 | 20847 | -0.29% | large >> biojava | 27421 | 27334 | 0.32% | default >> eclipse | 59918 | 60522 | -1.00% | large >> fop | 3670 | 3678 | -0.22% | default >> graphchi | 2088 | 2060 | 1.36% | default >> h2 | 297391 | 291292 | 2.09% | huge >> jme | 8762 | 8877 | -1.30% | default >> jython | 18938 | 18878 | 0.32% | default >> luindex | 1339 | 1325 | 1.06% | default >> lusearch | 918 | 936 | -1.92% | default >> pmd | 58291 | 58423 | -0.23% | large >> sunflow | 32617 | 24961 | 30.67% | large >> tomcat | 25481 | 25992 | -1.97% | large >> tradebeans | 314640 | 311706 | 0.94% | huge >> tradesoap | 107473 | 110246 | -2.52% | huge >> xalan | 6047 | 5882 | 2.81% | default >> zxing | 970 | 926 | 4.75% | default >> >> #### DaCapo/x86_64 >> >> The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. >> >> benchmark | baseline | fast-Locking | % | size >> -- | -- | -- | -- | -- >> avrora | 127690 | 126749 | 0.74% | large >> batik | 12736 | 12641 | 0.75% | large >> biojava | 15423 | 15404 | 0.12% | default >> eclipse | 41174 | 41498 | -0.78% | large >> fop | 2184 | 2172 | 0.55% | default >> graphchi | 1579 | 1560 | 1.22% | default >> h2 | 227614 | 230040 | -1.05% | huge >> jme | 8591 | 8398 | 2.30% | default >> jython | 13473 | 13356 | 0.88% | default >> luindex | 824 | 813 | 1.35% | default >> lusearch | 962 | 968 | -0.62% | default >> pmd | 40827 | 39654 | 2.96% | large >> sunflow | 53362 | 43475 | 22.74% | large >> tomcat | 27549 | 28029 | -1.71% | large >> tradebeans | 190757 | 190994 | -0.12% | huge >> tradesoap | 68099 | 67934 | 0.24% | huge >> xalan | 7969 | 8178 | -2.56% | default >> zxing | 1176 | 1148 | 2.44% | default >> >> #### Renaissance/AArch64 >> >> This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 2558.832 | 2513.594 | 1.80% >> Reactors | 14715.626 | 14311.246 | 2.83% >> Als | 1851.485 | 1869.622 | -0.97% >> ChiSquare | 1007.788 | 1003.165 | 0.46% >> GaussMix | 1157.491 | 1149.969 | 0.65% >> LogRegression | 717.772 | 733.576 | -2.15% >> MovieLens | 7916.181 | 8002.226 | -1.08% >> NaiveBayes | 395.296 | 386.611 | 2.25% >> PageRank | 4294.939 | 4346.333 | -1.18% >> FjKmeans | 496.076 | 493.873 | 0.45% >> FutureGenetic | 2578.504 | 2589.255 | -0.42% >> Mnemonics | 4898.886 | 4903.689 | -0.10% >> ParMnemonics | 4260.507 | 4210.121 | 1.20% >> Scrabble | 139.37 | 138.312 | 0.76% >> RxScrabble | 320.114 | 322.651 | -0.79% >> Dotty | 1056.543 | 1068.492 | -1.12% >> ScalaDoku | 3443.117 | 3449.477 | -0.18% >> ScalaKmeans | 259.384 | 258.648 | 0.28% >> Philosophers | 24333.311 | 23438.22 | 3.82% >> ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% >> FinagleChirper | 6814.192 | 6853.38 | -0.57% >> FinagleHttp | 4762.902 | 4807.564 | -0.93% >> >> #### Renaissance/x86_64 >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 1117.185 | 1116.425 | 0.07% >> Reactors | 11561.354 | 11812.499 | -2.13% >> Als | 1580.838 | 1575.318 | 0.35% >> ChiSquare | 459.601 | 467.109 | -1.61% >> GaussMix | 705.944 | 685.595 | 2.97% >> LogRegression | 659.944 | 656.428 | 0.54% >> MovieLens | 7434.303 | 7592.271 | -2.08% >> NaiveBayes | 413.482 | 417.369 | -0.93% >> PageRank | 3259.233 | 3276.589 | -0.53% >> FjKmeans | 946.429 | 938.991 | 0.79% >> FutureGenetic | 1760.672 | 1815.272 | -3.01% >> ParMnemonics | 2016.917 | 2033.101 | -0.80% >> Scrabble | 147.996 | 150.084 | -1.39% >> RxScrabble | 177.755 | 177.956 | -0.11% >> Dotty | 673.754 | 683.919 | -1.49% >> ScalaDoku | 2193.562 | 1958.419 | 12.01% >> ScalaKmeans | 165.376 | 168.925 | -2.10% >> ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% >> Philosophers | 14268.449 | 13308.87 | 7.21% >> FinagleChirper | 4722.13 | 4688.3 | 0.72% >> FinagleHttp | 3497.241 | 3605.118 | -2.99% >> >> Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. >> >> I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). >> >> Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. >> >> ### Testing >> - [x] tier1 (x86_64, aarch64, x86_32) >> - [x] tier2 (x86_64, aarch64) >> - [x] tier3 (x86_64, aarch64) >> - [x] tier4 (x86_64, aarch64) >> - [x] jcstress 3-days -t sync -af GLOBAL (x86_64, aarch64) > > Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 37 commits: > > - Merge remote-tracking branch 'upstream/master' into fast-locking > - Merge remote-tracking branch 'upstream/master' into fast-locking > - Merge remote-tracking branch 'upstream/master' into fast-locking > - More RISC-V fixes > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - RISC-V port > - Revert "Re-use r0 in call to unlock_object()" > > This reverts commit ebbcb615a788998596f403b47b72cf133cb9de46. > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - Fix number of rt args to complete_monitor_locking_C, remove some comments > - Re-use r0 in call to unlock_object() > - ... and 27 more: https://git.openjdk.org/jdk/compare/4b89fce0...3f0acba4 FTR I agree with Holmes that a conditional opt-in is better. While we are uncertain of the viability of the new scheme (which FTR I like!) for all our customers, we need to have a dynamic selection of the technique, so we can turn it on and off. Off by default at first, then later on by default, then on with no option at all if all goes well (which I hope it does). Perhaps Lilliput can have it turned on by default, and throw an error if (for some reason) the user tries to turn it off again. That's the way we phased in, and then phased out, biased locking, and it seems to me that this is a closely similar situation. Eventually, if all goes well, we can remove the stack locking code, as we did with biased locking. For more details about that long-running saga, one might look at the history of `-XX:+UseBiasedLocking` in the source base, perhaps starting with `$ git log -S UseBiasedLocking`. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From ysr at openjdk.org Mon Nov 14 23:26:32 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Mon, 14 Nov 2022 23:26:32 GMT Subject: RFR: Change affiliation representation In-Reply-To: <1pJV7ot6gQZCl7rnp_Pj5m-HF24GxEXHxCqHn5-6FYU=.c54f95e4-7b9f-4dc6-b8c3-5e4d76f9d26f@github.com> References: <1pJV7ot6gQZCl7rnp_Pj5m-HF24GxEXHxCqHn5-6FYU=.c54f95e4-7b9f-4dc6-b8c3-5e4d76f9d26f@github.com> Message-ID: On Mon, 14 Nov 2022 21:14:53 GMT, Kelvin Nilsen wrote: > With generational mode of Shenandoah, each region is associated with OLD, YOUNG, or FREE. During certain marking and update-refs activities, the region affiliation is frequently consulted. In the original implementation, the region affiliation was stored in a field of the ShenandoahHeapRegion object. In this code, we maintain a separate array, indexed by region number, to represent the affiliation of each region. This saves one level of indirection and improves cache locality for looking up the affiliation of each region. > > Measurements show significant improvement in throughput. One workload that was configured to perform back-to-back old-gen collections was able to increase the frequency of old-gen collections by almost 5 fold. With a 20-minute Extremem workload using a 48G heap, 20G old-gen, the P95 latency improvement was 0.54% (2.395 ms) and the P99.999 latency improvement was 58.21% (29.195 ms) in comparison to the implementation before this patch. The change looks good, modulo William's comment. (I also wonder if it might make sense to have a GenerationalShenandoahHeap derived from ShenandoahHeap that might make some of this kind of separation a bit cleaner. But that would be a bigger change for the future if one thought it was worthwhile to do a clean separation of the code for the two collectors.) A high level question, unrelated to the changes here. I imagine that we never consult a region's affiliation while it may be concurrently subject to update? (or when that happens, that any resulting race is benign)? Would it be worth adding a brief comment to that effect somewhere? (may be where the _affiliation array is declared.) I am guessing the affiliation state graph looks like: free -> {young, old} young -> {old, free} old -> free Would this be worth adding as a comment may be where set_affiliation() is implemented? None of these need to be done right away, but thought I'd leave these thoughts in the review anyway. ------------- Marked as reviewed by ysr (Author). PR: https://git.openjdk.org/shenandoah/pull/170 From wkemper at openjdk.org Tue Nov 15 00:04:54 2022 From: wkemper at openjdk.org (William Kemper) Date: Tue, 15 Nov 2022 00:04:54 GMT Subject: RFR: Change affiliation representation In-Reply-To: <1pJV7ot6gQZCl7rnp_Pj5m-HF24GxEXHxCqHn5-6FYU=.c54f95e4-7b9f-4dc6-b8c3-5e4d76f9d26f@github.com> References: <1pJV7ot6gQZCl7rnp_Pj5m-HF24GxEXHxCqHn5-6FYU=.c54f95e4-7b9f-4dc6-b8c3-5e4d76f9d26f@github.com> Message-ID: <2CO667eCc3G9zETOaCRk78-LwuNjyrXpaEas5V-Td2w=.e54415ca-ce18-459b-a4c9-35b79396768f@github.com> On Mon, 14 Nov 2022 21:14:53 GMT, Kelvin Nilsen wrote: > With generational mode of Shenandoah, each region is associated with OLD, YOUNG, or FREE. During certain marking and update-refs activities, the region affiliation is frequently consulted. In the original implementation, the region affiliation was stored in a field of the ShenandoahHeapRegion object. In this code, we maintain a separate array, indexed by region number, to represent the affiliation of each region. This saves one level of indirection and improves cache locality for looking up the affiliation of each region. > > Measurements show significant improvement in throughput. One workload that was configured to perform back-to-back old-gen collections was able to increase the frequency of old-gen collections by almost 5 fold. With a 20-minute Extremem workload using a 48G heap, 20G old-gen, the P95 latency improvement was 0.54% (2.395 ms) and the P99.999 latency improvement was 58.21% (29.195 ms) in comparison to the implementation before this patch. Marked as reviewed by wkemper (Committer). ------------- PR: https://git.openjdk.org/shenandoah/pull/170 From eosterlund at openjdk.org Tue Nov 15 08:27:59 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Tue, 15 Nov 2022 08:27:59 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code In-Reply-To: References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: On Sat, 12 Nov 2022 08:08:15 GMT, Fei Yang wrote: >> The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. >> >> In particular, >> 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. >> >> 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. >> >> 3) Refactoring the stack chunk allocation code >> >> Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. > > PS: I see JVM crashes when running Skynet with extra VM option: -XX:+VerifyContinuations on linux-aarch64 platform. > > $java --enable-preview -XX:+VerifyContinuations Skynet > > > # A fatal error has been detected by the Java Runtime Environment: > > # after -XX: or in .hotspotrc: SuppressErrorAt=# > # Internal Error/stackChunkOop.cpp (/home/realfyang/openjdk-jdk/src/hotspot/share/oops/stackChunkOop.cpp:433), pid=1904185:433, tid=1904206 > > [thread 1904216 also had an error]# assert(_chunk->bitmap().at(index)) failed: Bit not set at index 208 corresponding to 0x0000000637c512d0 > > # > # JRE version: OpenJDK Runtime Environment (20.0) (fastdebug build 20-internal-adhoc.realfyang.openjdk-jdk) > # Java VM: OpenJDK 64-Bit Server VM (fastdebug 20-internal-adhoc.realfyang.openjdk-jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) @RealFYang did you have a chance to see if my RISC-V changes worked out for you? ------------- PR: https://git.openjdk.org/jdk/pull/11111 From ngasson at openjdk.org Tue Nov 15 09:27:58 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Tue, 15 Nov 2022 09:27:58 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac [v5] In-Reply-To: References: Message-ID: <21c7UgQbws1el5OEqpzNflxzmiQGWjLh9_nyRKXAbAg=.73855ffd-98aa-4bc9-ad5f-87ecf33c0495@github.com> On Mon, 14 Nov 2022 17:28:17 GMT, Nick Gasson wrote: >> The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. >> >> See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html >> >> Also tested `hotspot_gc_shenandoah` on x86 and AArch64. > > Nick Gasson has updated the pull request incrementally with one additional commit since the last revision: > > Formatting fixes Thanks for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/10573 From ngasson at openjdk.org Tue Nov 15 09:34:55 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Tue, 15 Nov 2022 09:34:55 GMT Subject: Integrated: 8294775: Shenandoah: reduce contention on _threads_in_evac In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 11:10:29 GMT, Nick Gasson wrote: > The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. > > See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html > > Also tested `hotspot_gc_shenandoah` on x86 and AArch64. This pull request has now been integrated. Changeset: 8ab70d3b Author: Nick Gasson URL: https://git.openjdk.org/jdk/commit/8ab70d3b592db58f47ff538ae0a796237cd29f36 Stats: 181 lines in 4 files changed: 136 ins; 15 del; 30 mod 8294775: Shenandoah: reduce contention on _threads_in_evac Reviewed-by: rkennke, shade ------------- PR: https://git.openjdk.org/jdk/pull/10573 From fyang at openjdk.org Tue Nov 15 09:41:57 2022 From: fyang at openjdk.org (Fei Yang) Date: Tue, 15 Nov 2022 09:41:57 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code In-Reply-To: References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: On Sat, 12 Nov 2022 08:08:15 GMT, Fei Yang wrote: >> The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. >> >> In particular, >> 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. >> >> 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. >> >> 3) Refactoring the stack chunk allocation code >> >> Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. > > PS: I see JVM crashes when running Skynet with extra VM option: -XX:+VerifyContinuations on linux-aarch64 platform. > > $java --enable-preview -XX:+VerifyContinuations Skynet > > > # A fatal error has been detected by the Java Runtime Environment: > > # after -XX: or in .hotspotrc: SuppressErrorAt=# > # Internal Error/stackChunkOop.cpp (/home/realfyang/openjdk-jdk/src/hotspot/share/oops/stackChunkOop.cpp:433), pid=1904185:433, tid=1904206 > > [thread 1904216 also had an error]# assert(_chunk->bitmap().at(index)) failed: Bit not set at index 208 corresponding to 0x0000000637c512d0 > > # > # JRE version: OpenJDK Runtime Environment (20.0) (fastdebug build 20-internal-adhoc.realfyang.openjdk-jdk) > # Java VM: OpenJDK 64-Bit Server VM (fastdebug 20-internal-adhoc.realfyang.openjdk-jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) > @RealFYang did you have a chance to see if my RISC-V changes worked out for you? Hi, I have performed tier1-3 tests on my linux-riscv64 HiFive Unmatched boards. Results looks good. Thanks for handling riscv at the same time :-) ------------- PR: https://git.openjdk.org/jdk/pull/11111 From mcimadamore at openjdk.org Tue Nov 15 10:08:22 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 15 Nov 2022 10:08:22 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v18] In-Reply-To: References: Message-ID: <0aDgn8bkT3gjULRqLX7_1doqGRJhDlva7S3Q-uYBtZ4=.23b372a9-8775-4d0c-900f-c8a12d1769b1@github.com> > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Tweak preview feature description for JEP 434 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/cd3fbe7c..9b97bad6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=17 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=16-17 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 15 10:12:12 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 15 Nov 2022 10:12:12 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v19] In-Reply-To: References: Message-ID: <43YEgUwCbX4IMeM2AjG_ZAytW-ibfIqCPW1fmBoYDpQ=.e2ef76bd-b10b-4785-976b-974501043f28@github.com> > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 49 additional commits since the last revision: - Merge branch 'master' into PR_20 - Tweak preview feature description for JEP 434 - Tweak Arena::close javadoc - Merge pull request #15 from minborg/test Add @apiNote to package-info - Add @apiNote to package-info - Merge pull request #16 from minborg/fix-tests2 Fix failing tests - Fix failing tests - Rename isOwnedBy -> isCloseableBy Fix minor typos Fix StrLenTest/RingAllocator - Fix typo - More javadoc fixes - ... and 39 more: https://git.openjdk.org/jdk/compare/0ecc71f0...20ee6e8d ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/9b97bad6..20ee6e8d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=18 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=17-18 Stats: 15095 lines in 530 files changed: 6855 ins; 6001 del; 2239 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 15 11:14:35 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 15 Nov 2022 11:14:35 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v20] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Rename MemorySession -> SegmentScope Improve javadoc of SegmentScope/Arena Address review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/20ee6e8d..5ae5864a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=19 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=18-19 Stats: 1298 lines in 125 files changed: 174 ins; 177 del; 947 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 15 11:16:22 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 15 Nov 2022 11:16:22 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v19] In-Reply-To: <43YEgUwCbX4IMeM2AjG_ZAytW-ibfIqCPW1fmBoYDpQ=.e2ef76bd-b10b-4785-976b-974501043f28@github.com> References: <43YEgUwCbX4IMeM2AjG_ZAytW-ibfIqCPW1fmBoYDpQ=.e2ef76bd-b10b-4785-976b-974501043f28@github.com> Message-ID: <8jsBP6xJ2lT5UEIEHaGfI_Juqtj_pD1Plp7oynz81Zo=.695ba1b6-bfa9-442d-9cf2-425a7ed5a352@github.com> On Tue, 15 Nov 2022 10:12:12 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 49 additional commits since the last revision: > > - Merge branch 'master' into PR_20 > - Tweak preview feature description for JEP 434 > - Tweak Arena::close javadoc > - Merge pull request #15 from minborg/test > > Add @apiNote to package-info > - Add @apiNote to package-info > - Merge pull request #16 from minborg/fix-tests2 > > Fix failing tests > - Fix failing tests > - Rename isOwnedBy -> isCloseableBy > Fix minor typos > Fix StrLenTest/RingAllocator > - Fix typo > - More javadoc fixes > - ... and 39 more: https://git.openjdk.org/jdk/compare/3ebf94de...20ee6e8d I've renamed `MemorySession` to `SegmentScope`, following some internal and external feedback. I've also greatly improved the javadoc of both `Arena` and `SegmentScope`. A javadoc of the API contained in this iteration can be found here: http://cr.openjdk.java.net/~mcimadamore/jdk/8295044/v3/javadoc/java.base/module-summary.html ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 15 11:19:26 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 15 Nov 2022 11:19:26 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v21] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Fix whitespace ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/5ae5864a..3d9cebde Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=20 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=19-20 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 15 12:34:43 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 15 Nov 2022 12:34:43 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v22] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Add `since` tag in Module/ModuleLayer preview methods ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/3d9cebde..b2dd8926 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=21 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=20-21 Stats: 4 lines in 2 files changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From pminborg at openjdk.org Tue Nov 15 14:43:08 2022 From: pminborg at openjdk.org (Per Minborg) Date: Tue, 15 Nov 2022 14:43:08 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v22] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 12:34:43 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Add `since` tag in Module/ModuleLayer preview methods src/java.base/share/classes/java/lang/foreign/Arena.java line 32: > 30: > 31: /** > 32: * An arena controls the lifecycle of one or more memory segments, providing both flexible allocation and timely deallocation. Strictly: "An arena controls the lifecycle of zero or more ...". A newly created Arena, for example, does not control the lifecycle of any segment. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From pminborg at openjdk.org Tue Nov 15 14:48:01 2022 From: pminborg at openjdk.org (Per Minborg) Date: Tue, 15 Nov 2022 14:48:01 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v22] In-Reply-To: References: Message-ID: <213057Saw0-m7uFTwDAgWYorxtjExq17nhZ0ULRUWGk=.1a10e475-baf5-4c71-a5af-c6288d3db6cc@github.com> On Tue, 15 Nov 2022 12:34:43 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Add `since` tag in Module/ModuleLayer preview methods src/java.base/share/classes/java/lang/foreign/Arena.java line 35: > 33: *
> 34: * An arena has a {@linkplain #scope() scope}, called the arena scope. When the arena is {@linkplain #close() closed}, > 35: * the arena scope becomes not {@linkplain SegmentScope#isAlive() alive}. As a result, all the Suggest "the arena scope is no longer {@linkplain SegmentScope#isAlive() alive}" instead. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From pminborg at openjdk.org Tue Nov 15 14:52:05 2022 From: pminborg at openjdk.org (Per Minborg) Date: Tue, 15 Nov 2022 14:52:05 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v22] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 12:34:43 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Add `since` tag in Module/ModuleLayer preview methods src/java.base/share/classes/java/lang/foreign/Arena.java line 63: > 61: * after the arena has been closed. The cost of providing this guarantee varies based on the > 62: * number of threads that have access to the memory segments allocated by the arena. For instance, if an arena > 63: * is always created and closed by one thread, and the memory segments associated with the arena's scope are always Strictly, if a shared segment is created and is only accessed by a single thread, then we need to track thread usage in order to trivially ensure safety. I think we could reword so that if access is only *allowed* by a single thread, it is trivial. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From pminborg at openjdk.org Tue Nov 15 14:55:14 2022 From: pminborg at openjdk.org (Per Minborg) Date: Tue, 15 Nov 2022 14:55:14 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v22] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 14:49:30 GMT, Per Minborg wrote: >> Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: >> >> Add `since` tag in Module/ModuleLayer preview methods > > src/java.base/share/classes/java/lang/foreign/Arena.java line 63: > >> 61: * after the arena has been closed. The cost of providing this guarantee varies based on the >> 62: * number of threads that have access to the memory segments allocated by the arena. For instance, if an arena >> 63: * is always created and closed by one thread, and the memory segments associated with the arena's scope are always > > ~~Strictly, if a shared segment is created and is only accessed by a single thread, then we need to track thread usage in order to trivially ensure safety. I think we could reword so that if access is only *allowed* by a single thread, it is trivial.~~ ok. So reading on the initial text makes sense. So, my comment above should be disregarded. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From pminborg at openjdk.org Tue Nov 15 15:07:04 2022 From: pminborg at openjdk.org (Per Minborg) Date: Tue, 15 Nov 2022 15:07:04 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v22] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 12:34:43 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Add `since` tag in Module/ModuleLayer preview methods src/java.base/share/classes/java/lang/foreign/Arena.java line 100: > 98: * MemorySegment.allocateNative(bytesSize, byteAlignment, scope()); > 99: *} > 100: * More generally implementations of this method must return a native method featuring the requested size, ... must return a native ~~method~~*segment* featuring ... ------------- PR: https://git.openjdk.org/jdk/pull/10872 From pminborg at openjdk.org Tue Nov 15 15:11:15 2022 From: pminborg at openjdk.org (Per Minborg) Date: Tue, 15 Nov 2022 15:11:15 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v22] In-Reply-To: References: Message-ID: <7T_4KxIMV_s9h3OjbniZId7ridKlYeJ1sXu3tgBor2c=.9e0d8f95-554b-4dc3-8b1b-27e78f85578d@github.com> On Tue, 15 Nov 2022 12:34:43 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Add `since` tag in Module/ModuleLayer preview methods src/java.base/share/classes/java/lang/foreign/Arena.java line 89: > 87: > 88: /** > 89: * Returns a native memory segment with the given size (in bytes) and alignment constraint (in bytes). It is noted that the current documentation does not require a **new** native memory segment to be returned. Would it not be better with: Creates a new native memory segment ... The new shared segment might share actual backing memory though. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From pminborg at openjdk.org Tue Nov 15 15:15:45 2022 From: pminborg at openjdk.org (Per Minborg) Date: Tue, 15 Nov 2022 15:15:45 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v22] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 12:34:43 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Add `since` tag in Module/ModuleLayer preview methods src/java.base/share/classes/java/lang/foreign/Arena.java line 119: > 117: > 118: /** > 119: * {@return the arena scope} Add a period ('.') after the closing curly bracket. src/java.base/share/classes/java/lang/foreign/Arena.java line 124: > 122: > 123: /** > 124: * Closes this arena. If this method completes normally, the arena scope becomes not {@linkplain SegmentScope#isAlive() alive}, See comment above "not alive" -> "is no longer alive" ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 15 15:28:50 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 15 Nov 2022 15:28:50 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v22] In-Reply-To: <7T_4KxIMV_s9h3OjbniZId7ridKlYeJ1sXu3tgBor2c=.9e0d8f95-554b-4dc3-8b1b-27e78f85578d@github.com> References: <7T_4KxIMV_s9h3OjbniZId7ridKlYeJ1sXu3tgBor2c=.9e0d8f95-554b-4dc3-8b1b-27e78f85578d@github.com> Message-ID: On Tue, 15 Nov 2022 15:09:02 GMT, Per Minborg wrote: >> Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: >> >> Add `since` tag in Module/ModuleLayer preview methods > > src/java.base/share/classes/java/lang/foreign/Arena.java line 89: > >> 87: >> 88: /** >> 89: * Returns a native memory segment with the given size (in bytes) and alignment constraint (in bytes). > > It is noted that the current documentation does not require a **new** native memory segment to be returned. Would it not be better with: > > Creates a new native memory segment ... > > The new shared segment might share actual backing memory though. My feeling is that being overly precise over identity might backfire. It is not important whether the segment is a new instance or not. But there is, perhaps, another invariant that is more semantically relevant: e.g. the returned segments (whether new or not, we don't care) should be backed by "disjoint" regions of memory. That is, if the method returns a segment with address `0` and size `100`, calling the method again cannot return a segment whose address is `50` and size is `100`. In principle, the segment allocator interface allows for this (see `SegmentAllocator::prefixAllocator`) - but for an arena, a behavior such as this would be indesirable, IMHO. > src/java.base/share/classes/java/lang/foreign/Arena.java line 119: > >> 117: >> 118: /** >> 119: * {@return the arena scope} > > Add a period ('.') after the closing curly bracket. This is a general comment. I don't think we did this consistently in other places, I'd prefer to leave as is. > src/java.base/share/classes/java/lang/foreign/Arena.java line 136: > >> 134: >> 135: /** >> 136: * {@return {@code true} if the provided thread can close this arena} > > I think this is equivalent and simpler: > > {@return if the provided thread can close this arena}. > > But I know there are many examples of {@code true} in the JDK. I'll leave as is - we can deal with this cosmetic javadoc issues at a later point. > src/java.base/share/classes/java/lang/foreign/GroupLayout.java line 46: > >> 44: >> 45: /** >> 46: * Returns the member layouts associated with this group. > > We may use {@return the member layouts associated with this group}. Same - I'll leave these tweaks for later. > src/java.base/share/classes/java/lang/foreign/Linker.java line 264: > >> 262: >> 263: /** >> 264: * Returns a symbol lookup for symbols in a set of commonly used libraries. > > Use {@return ...} Same - I'll leave these tweaks for later. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From pminborg at openjdk.org Tue Nov 15 15:34:17 2022 From: pminborg at openjdk.org (Per Minborg) Date: Tue, 15 Nov 2022 15:34:17 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v22] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 12:34:43 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Add `since` tag in Module/ModuleLayer preview methods src/java.base/share/classes/java/lang/foreign/MemorySegment.java line 163: > 161: * segment is derived from the address of the original segment, by adding an offset (expressed in bytes). The size of > 162: * the sliced segment is either derived implicitly (by subtracting the specified offset from the size of the original segment), > 163: * or provided explicitly. In other words, a sliced segment has stricter spatial bounds than those of the original segment: Strictly, a sliced segment can have the *same* spatial bounds as the original segment. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 15 15:38:49 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 15 Nov 2022 15:38:49 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v22] In-Reply-To: References: <7T_4KxIMV_s9h3OjbniZId7ridKlYeJ1sXu3tgBor2c=.9e0d8f95-554b-4dc3-8b1b-27e78f85578d@github.com> Message-ID: On Tue, 15 Nov 2022 15:22:07 GMT, Maurizio Cimadamore wrote: >> src/java.base/share/classes/java/lang/foreign/Arena.java line 89: >> >>> 87: >>> 88: /** >>> 89: * Returns a native memory segment with the given size (in bytes) and alignment constraint (in bytes). >> >> It is noted that the current documentation does not require a **new** native memory segment to be returned. Would it not be better with: >> >> Creates a new native memory segment ... >> >> The new shared segment might share actual backing memory though. > > My feeling is that being overly precise over identity might backfire. It is not important whether the segment is a new instance or not. But there is, perhaps, another invariant that is more semantically relevant: e.g. the returned segments (whether new or not, we don't care) should be backed by "disjoint" regions of memory. That is, if the method returns a segment with address `0` and size `100`, calling the method again cannot return a segment whose address is `50` and size is `100`. In principle, the segment allocator interface allows for this (see `SegmentAllocator::prefixAllocator`) - but for an arena, a behavior such as this would be indesirable, IMHO. I will add: Furthermore, for any two segments S1, S2 returned by this method, the following invariant must hold: S1.overlappingSlice(S2).isEmpty() == true ``` ------------- PR: https://git.openjdk.org/jdk/pull/10872 From pminborg at openjdk.org Tue Nov 15 15:00:08 2022 From: pminborg at openjdk.org (Per Minborg) Date: Tue, 15 Nov 2022 15:00:08 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v22] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 12:34:43 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Add `since` tag in Module/ModuleLayer preview methods src/java.base/share/classes/java/lang/foreign/Arena.java line 79: > 77: *
> 78: * Shared arenas, on the other hand, have no owner thread. The segments created by a shared arena > 79: * can be {@linkplain SegmentScope#isAccessibleBy(Thread) accessed} by multiple threads. This might be useful when Suggest "can be {@linkplain SegmentScope#isAccessibleBy(Thread) accessed} by ~~multiple~~ *any* thread" ------------- PR: https://git.openjdk.org/jdk/pull/10872 From pminborg at openjdk.org Tue Nov 15 15:28:52 2022 From: pminborg at openjdk.org (Per Minborg) Date: Tue, 15 Nov 2022 15:28:52 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v22] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 12:34:43 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Add `since` tag in Module/ModuleLayer preview methods src/java.base/share/classes/java/lang/foreign/Arena.java line 136: > 134: > 135: /** > 136: * {@return {@code true} if the provided thread can close this arena} I think this is equivalent and simpler: {@return if the provided thread can close this arena}. But I know there are many examples of {@code true} in the JDK. src/java.base/share/classes/java/lang/foreign/GroupLayout.java line 46: > 44: > 45: /** > 46: * Returns the member layouts associated with this group. We may use {@return the member layouts associated with this group}. src/java.base/share/classes/java/lang/foreign/Linker.java line 264: > 262: > 263: /** > 264: * Returns a symbol lookup for symbols in a set of commonly used libraries. Use {@return ...} ------------- PR: https://git.openjdk.org/jdk/pull/10872 From rkennke at openjdk.org Tue Nov 15 15:51:08 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 15 Nov 2022 15:51:08 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <_C2oCFsbq1QdFO_HjwfXHNt0XrtV06TqRK1a8lpiXsI=.4650c115-d734-4655-bc6a-ec46314ab5ed@github.com> References: <6KaO6YDJAQZSps49h6TddX8-aXFEfOFCfLgpi1_90Ag=.d7fe0ac9-d392-4784-a13e-85f5212e00f1@github.com> <_C2oCFsbq1QdFO_HjwfXHNt0XrtV06TqRK1a8lpiXsI=.4650c115-d734-4655-bc6a-ec46314ab5ed@github.com> Message-ID: On Mon, 14 Nov 2022 22:59:22 GMT, John R Rose wrote: > > So the data structure for lock records (per thread) could consist of a series of distinct values [ A B C ] and each of the values could be repeated, but only adjacently: [ A A A B C C ] for example. > > @rose00 why only adjacently? Nested locking can be interleaved on different monitors. > > Yes it can; you can have nesting A, B, A. But the thread-based fast-locking list might not cover that case. If it were restricted to only adjacent records in the way I sketched, it would need to use a different, slower technique for the A, B, A case. The trade-off is that if you only allow adjacent recursive locks on the list, you don't need to search the list beyond the first element, to detect re-locking. Dunno if that pencils out to a real advantage, though, since the fallback is slow. TBH, I don't currently think that making fast-locking recursive is very important. In-fact, the need for the fast-locking appears somewhat questionable to begin with - the scenario where it performs better than OM-locking is rather narrow and really only relevant for legacy code. Stack-locking and fast-locking only help workloads that 1. Do lots of uncontended, e.g. single-threaded locking and 2. Churn lots of monitor objects. It is not enough to use a single Vector a lot - the cost of allocating the OM would soon be amortized by lots of OM action. In order for stack-/fast-locking to be useful, you have to have a workload that keeps allocating new lock objects and use them only once or very few times. For example, I have seen this in OpenJDK's XML code, where the XSLT compiler would generate code that uses an ungodly amount StringBuffers (this probably warrants a separate fix). Now, where would recursive locking support for the fast-locking path be useful? I have yet to see a workloa d that suffers because of a lack of recursive locking support. Implementing recursive fast-locking means we'd have to add code in the fast-path, and that would affect non-recursive locking as well. I'd rather keep the implementation simple and fast. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From mcimadamore at openjdk.org Tue Nov 15 15:58:46 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 15 Nov 2022 15:58:46 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v24] In-Reply-To: References: Message-ID: <2K-hydg-uLovxuhq4-WgeYlZPtj-INuCGlEKieRg77E=.de717cd6-8104-4402-b935-7ccb90199e4f@github.com> > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Fix tests broken by MemorySession rename ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/19e0f6d5..54fb4856 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=23 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=22-23 Stats: 290 lines in 37 files changed: 0 ins; 2 del; 288 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 15 15:38:52 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 15 Nov 2022 15:38:52 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v22] In-Reply-To: References: Message-ID: On Tue, 15 Nov 2022 15:31:58 GMT, Per Minborg wrote: >> Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: >> >> Add `since` tag in Module/ModuleLayer preview methods > > src/java.base/share/classes/java/lang/foreign/MemorySegment.java line 163: > >> 161: * segment is derived from the address of the original segment, by adding an offset (expressed in bytes). The size of >> 162: * the sliced segment is either derived implicitly (by subtracting the specified offset from the size of the original segment), >> 163: * or provided explicitly. In other words, a sliced segment has stricter spatial bounds than those of the original segment: > > Strictly, a sliced segment can have the *same* spatial bounds as the original segment. True - but I think the current text is a good compromise e.g. it is narrative text. You can always dive into `MemorySegment::asSlice` and find out more. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 15 17:54:16 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 15 Nov 2022 17:54:16 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v25] In-Reply-To: References: Message-ID: <-RoscJ-7QuJ7y50zTBcxRISETzsAnuWdhDjOKhkcLoU=.99cc6f49-c850-4d93-a40b-4cd953a99cb2@github.com> > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Fix MapToMemorySegmentTest ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/54fb4856..b331a4fd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=24 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=23-24 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 15 15:48:28 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 15 Nov 2022 15:48:28 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v23] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Address review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/b2dd8926..19e0f6d5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=22 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=21-22 Stats: 11 lines in 1 file changed: 5 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 15 18:03:39 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 15 Nov 2022 18:03:39 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v26] In-Reply-To: References: Message-ID: <-rih5SODHs0oMsQlaTc_lny0Cz6YvYLa4Arjr3Sf0fA=.755847f0-6a14-4784-85ba-97be21e6656b@github.com> > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Fix @since tag in SegmentScope ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/b331a4fd..5f60d052 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=25 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=24-25 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Tue Nov 15 18:47:39 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 15 Nov 2022 18:47:39 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v27] In-Reply-To: References: Message-ID: <-Lw-dDGfVAZlOT815DeyvfwP0NTWWbj4X0lrl9ek_iQ=.70a5ad19-062f-488d-97fb-f8d923c2dc17@github.com> > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Fix typo in SegmentScope javadoc ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/5f60d052..876587c3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=26 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=25-26 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From eosterlund at openjdk.org Wed Nov 16 14:11:01 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Wed, 16 Nov 2022 14:11:01 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code In-Reply-To: References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: On Tue, 15 Nov 2022 09:39:27 GMT, Fei Yang wrote: >> PS: I see JVM crashes when running Skynet with extra VM option: -XX:+VerifyContinuations on linux-aarch64 platform. >> >> $java --enable-preview -XX:+VerifyContinuations Skynet >> >> >> # A fatal error has been detected by the Java Runtime Environment: >> >> # after -XX: or in .hotspotrc: SuppressErrorAt=# >> # Internal Error/stackChunkOop.cpp (/home/realfyang/openjdk-jdk/src/hotspot/share/oops/stackChunkOop.cpp:433), pid=1904185:433, tid=1904206 >> >> [thread 1904216 also had an error]# assert(_chunk->bitmap().at(index)) failed: Bit not set at index 208 corresponding to 0x0000000637c512d0 >> >> # >> # JRE version: OpenJDK Runtime Environment (20.0) (fastdebug build 20-internal-adhoc.realfyang.openjdk-jdk) >> # Java VM: OpenJDK 64-Bit Server VM (fastdebug 20-internal-adhoc.realfyang.openjdk-jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) > >> @RealFYang did you have a chance to see if my RISC-V changes worked out for you? > > Hi, I have performed tier1-3 tests on my linux-riscv64 HiFive Unmatched boards. Results looks good. > Thanks for handling riscv at the same time :-) > > PS: Also passed Skynet test with all GCs plus extra VM options: -XX:+VerifyStack -XX:+VerifyContinuations Thanks @RealFYang! ------------- PR: https://git.openjdk.org/jdk/pull/11111 From kdnilsen at openjdk.org Wed Nov 16 14:43:08 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 16 Nov 2022 14:43:08 GMT Subject: RFR: Change affiliation representation [v2] In-Reply-To: <1pJV7ot6gQZCl7rnp_Pj5m-HF24GxEXHxCqHn5-6FYU=.c54f95e4-7b9f-4dc6-b8c3-5e4d76f9d26f@github.com> References: <1pJV7ot6gQZCl7rnp_Pj5m-HF24GxEXHxCqHn5-6FYU=.c54f95e4-7b9f-4dc6-b8c3-5e4d76f9d26f@github.com> Message-ID: > With generational mode of Shenandoah, each region is associated with OLD, YOUNG, or FREE. During certain marking and update-refs activities, the region affiliation is frequently consulted. In the original implementation, the region affiliation was stored in a field of the ShenandoahHeapRegion object. In this code, we maintain a separate array, indexed by region number, to represent the affiliation of each region. This saves one level of indirection and improves cache locality for looking up the affiliation of each region. > > Measurements show significant improvement in throughput. One workload that was configured to perform back-to-back old-gen collections was able to increase the frequency of old-gen collections by almost 5 fold. With a 20-minute Extremem workload using a 48G heap, 20G old-gen, the P95 latency improvement was 0.54% (2.395 ms) and the P99.999 latency improvement was 58.21% (29.195 ms) in comparison to the implementation before this patch. Kelvin Nilsen has updated the pull request incrementally with six additional commits since the last revision: - Add comment to clarify that _affiliation required for non-generational - Add assertions to clarify locks required for setting affiliation - Adjust asserts for heap lock - Undo change to _affiliations add assertion to clarify heap lock - Only initialize _affiliations if mode is generational - Only initialize _affiiations in generational mode ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/170/files - new: https://git.openjdk.org/shenandoah/pull/170/files/1033b0cd..15a0825f Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=170&range=01 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=170&range=00-01 Stats: 56 lines in 5 files changed: 54 ins; 0 del; 2 mod Patch: https://git.openjdk.org/shenandoah/pull/170.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/170/head:pull/170 PR: https://git.openjdk.org/shenandoah/pull/170 From kdnilsen at openjdk.org Wed Nov 16 14:46:30 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 16 Nov 2022 14:46:30 GMT Subject: RFR: Change affiliation representation [v2] In-Reply-To: References: <1pJV7ot6gQZCl7rnp_Pj5m-HF24GxEXHxCqHn5-6FYU=.c54f95e4-7b9f-4dc6-b8c3-5e4d76f9d26f@github.com> Message-ID: On Wed, 16 Nov 2022 14:43:08 GMT, Kelvin Nilsen wrote: >> With generational mode of Shenandoah, each region is associated with OLD, YOUNG, or FREE. During certain marking and update-refs activities, the region affiliation is frequently consulted. In the original implementation, the region affiliation was stored in a field of the ShenandoahHeapRegion object. In this code, we maintain a separate array, indexed by region number, to represent the affiliation of each region. This saves one level of indirection and improves cache locality for looking up the affiliation of each region. >> >> Measurements show significant improvement in throughput. One workload that was configured to perform back-to-back old-gen collections was able to increase the frequency of old-gen collections by almost 5 fold. With a 20-minute Extremem workload using a 48G heap, 20G old-gen, the P95 latency improvement was 0.54% (2.395 ms) and the P99.999 latency improvement was 58.21% (29.195 ms) in comparison to the implementation before this patch. > > Kelvin Nilsen has updated the pull request incrementally with six additional commits since the last revision: > > - Add comment to clarify that _affiliation required for non-generational > - Add assertions to clarify locks required for setting affiliation > - Adjust asserts for heap lock > - Undo change to _affiliations add assertion to clarify heap lock > - Only initialize _affiliations if mode is generational > - Only initialize _affiiations in generational mode In the most recently committed refinements, I have added comments and assertions to clarify the use of heap lock and safepoints as mechanisms to assure coherency of region affiliations. ------------- PR: https://git.openjdk.org/shenandoah/pull/170 From kdnilsen at openjdk.org Wed Nov 16 14:46:33 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 16 Nov 2022 14:46:33 GMT Subject: RFR: Change affiliation representation [v2] In-Reply-To: References: <1pJV7ot6gQZCl7rnp_Pj5m-HF24GxEXHxCqHn5-6FYU=.c54f95e4-7b9f-4dc6-b8c3-5e4d76f9d26f@github.com> Message-ID: On Mon, 14 Nov 2022 21:53:49 GMT, Kelvin Nilsen wrote: >> src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 356: >> >>> 354: >>> 355: _regions = NEW_C_HEAP_ARRAY(ShenandoahHeapRegion*, _num_regions, mtGC); >>> 356: _affiliations = NEW_C_HEAP_ARRAY(uint8_t, _num_regions, mtGC); >> >> Should we only initialize this for generational mode? or are there calls to check affiliation in the other modes too? > > Good point. I think it's only needed for generational mode. After further investigation, I see that we need _affiliation for non-generational mode to keep track of FREE status. I've added a comment to this effect. ------------- PR: https://git.openjdk.org/shenandoah/pull/170 From kdnilsen at openjdk.org Wed Nov 16 14:50:05 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Wed, 16 Nov 2022 14:50:05 GMT Subject: Integrated: Change affiliation representation In-Reply-To: <1pJV7ot6gQZCl7rnp_Pj5m-HF24GxEXHxCqHn5-6FYU=.c54f95e4-7b9f-4dc6-b8c3-5e4d76f9d26f@github.com> References: <1pJV7ot6gQZCl7rnp_Pj5m-HF24GxEXHxCqHn5-6FYU=.c54f95e4-7b9f-4dc6-b8c3-5e4d76f9d26f@github.com> Message-ID: On Mon, 14 Nov 2022 21:14:53 GMT, Kelvin Nilsen wrote: > With generational mode of Shenandoah, each region is associated with OLD, YOUNG, or FREE. During certain marking and update-refs activities, the region affiliation is frequently consulted. In the original implementation, the region affiliation was stored in a field of the ShenandoahHeapRegion object. In this code, we maintain a separate array, indexed by region number, to represent the affiliation of each region. This saves one level of indirection and improves cache locality for looking up the affiliation of each region. > > Measurements show significant improvement in throughput. One workload that was configured to perform back-to-back old-gen collections was able to increase the frequency of old-gen collections by almost 5 fold. With a 20-minute Extremem workload using a 48G heap, 20G old-gen, the P95 latency improvement was 0.54% (2.395 ms) and the P99.999 latency improvement was 58.21% (29.195 ms) in comparison to the implementation before this patch. This pull request has now been integrated. Changeset: 419a3a2a Author: Kelvin Nilsen URL: https://git.openjdk.org/shenandoah/commit/419a3a2ab30f92a15a3c7205dfa0f22502ee69e5 Stats: 192 lines in 8 files changed: 143 ins; 33 del; 16 mod Change affiliation representation Reviewed-by: ysr, wkemper ------------- PR: https://git.openjdk.org/shenandoah/pull/170 From rrich at openjdk.org Wed Nov 16 15:50:17 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 16 Nov 2022 15:50:17 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v3] In-Reply-To: References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: <6p2iTiK-RvQtQUUvZHID1kpjZEB1wb72CHEi5X_-zuA=.c447ad16-8a1c-4af3-a062-0b1acbbcc1d0@github.com> On Mon, 14 Nov 2022 16:07:34 GMT, Erik ?sterlund wrote: >> The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. >> >> In particular, >> 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. >> >> 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. >> >> 3) Refactoring the stack chunk allocation code >> >> Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. > > Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: > > Indentation fix Hi @fisk, I've skimmed the changes. They look good to me. I do have a few comments/questions also. src/hotspot/cpu/riscv/sharedRuntime_riscv.cpp line 876: > 874: > 875: OopMap* map = new OopMap(((int)ContinuationEntry::size() + wordSize) / VMRegImpl::stack_slot_size, 0 /* arg_slots*/); > 876: ContinuationEntry::setup_oopmap(map); I'd suggest to add a comment where the oops are handled. src/hotspot/share/gc/shared/barrierSetStackChunk.cpp line 68: > 66: > 67: virtual void do_oop(oop* p) override { > 68: if (UseCompressedOops) { Wouldn't it be better to hoist the check for `UseCompressedOops`? src/hotspot/share/gc/shenandoah/shenandoahBarrierSetStackChunk.cpp line 30: > 28: > 29: void ShenandoahBarrierSetStackChunk::encode_gc_mode(stackChunkOop chunk, OopIterator* oop_iterator) { > 30: // Nothing to do Shenandoah allows `UseCompressedOops` enabled, doesn't it? Isn't it necessary then to do the encoding as in the super class? ------------- PR: https://git.openjdk.org/jdk/pull/11111 From alanb at openjdk.org Wed Nov 16 16:04:17 2022 From: alanb at openjdk.org (Alan Bateman) Date: Wed, 16 Nov 2022 16:04:17 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v27] In-Reply-To: <-Lw-dDGfVAZlOT815DeyvfwP0NTWWbj4X0lrl9ek_iQ=.70a5ad19-062f-488d-97fb-f8d923c2dc17@github.com> References: <-Lw-dDGfVAZlOT815DeyvfwP0NTWWbj4X0lrl9ek_iQ=.70a5ad19-062f-488d-97fb-f8d923c2dc17@github.com> Message-ID: On Tue, 15 Nov 2022 18:47:39 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Fix typo in SegmentScope javadoc src/java.base/share/classes/java/lang/foreign/Arena.java line 132: > 130: * and all the memory segments associated with it can no longer be accessed. Furthermore, any off-heap region of memory backing the > 131: * segments associated with that scope are also released. > 132: * @throws IllegalStateException if the arena has already been {@linkplain #close() closed}. It's not wrong to specify that close throw if already closed but it goes against the advice in AutoCloseable to try to have close methods be idempotent. There may be a good reason for this but I can't help wondering if there are error cases when wrapping that might lead to close being called more than once. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Wed Nov 16 16:16:21 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Wed, 16 Nov 2022 16:16:21 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v27] In-Reply-To: References: <-Lw-dDGfVAZlOT815DeyvfwP0NTWWbj4X0lrl9ek_iQ=.70a5ad19-062f-488d-97fb-f8d923c2dc17@github.com> Message-ID: On Wed, 16 Nov 2022 16:01:52 GMT, Alan Bateman wrote: >> Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix typo in SegmentScope javadoc > > src/java.base/share/classes/java/lang/foreign/Arena.java line 132: > >> 130: * and all the memory segments associated with it can no longer be accessed. Furthermore, any off-heap region of memory backing the >> 131: * segments associated with that scope are also released. >> 132: * @throws IllegalStateException if the arena has already been {@linkplain #close() closed}. > > It's not wrong to specify that close throw if already closed but it goes against the advice in AutoCloseable to try to have close methods be idempotent. There may be a good reason for this but I can't help wondering if there are error cases when wrapping that might lead to close being called more than once. In our experience with using the API, having exceptions when something is funny about close is very valuable info (as also stated in the javadoc). Almost always there's a subtle temporal bug going on which the ISE catches. I'm not sure if here you refer to the fact that the javadoc is being overly broad in saying "already been closed" instead of "already been closed _successfully_" ? What kind of problems are you thinking of? ------------- PR: https://git.openjdk.org/jdk/pull/10872 From alanb at openjdk.org Wed Nov 16 16:16:22 2022 From: alanb at openjdk.org (Alan Bateman) Date: Wed, 16 Nov 2022 16:16:22 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v27] In-Reply-To: <-Lw-dDGfVAZlOT815DeyvfwP0NTWWbj4X0lrl9ek_iQ=.70a5ad19-062f-488d-97fb-f8d923c2dc17@github.com> References: <-Lw-dDGfVAZlOT815DeyvfwP0NTWWbj4X0lrl9ek_iQ=.70a5ad19-062f-488d-97fb-f8d923c2dc17@github.com> Message-ID: On Tue, 15 Nov 2022 18:47:39 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Fix typo in SegmentScope javadoc src/java.base/share/classes/java/lang/foreign/SegmentScope.java line 8: > 6: > 7: /** > 8: * A segment scope controls access to a memory segment. A passing comment here is that "to a memory segment" hints of one-to-one relationship when it's actually one-to-many. Arena is specified to control the lifecycle "of memory segments". ------------- PR: https://git.openjdk.org/jdk/pull/10872 From alanb at openjdk.org Wed Nov 16 16:40:31 2022 From: alanb at openjdk.org (Alan Bateman) Date: Wed, 16 Nov 2022 16:40:31 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v27] In-Reply-To: References: <-Lw-dDGfVAZlOT815DeyvfwP0NTWWbj4X0lrl9ek_iQ=.70a5ad19-062f-488d-97fb-f8d923c2dc17@github.com> Message-ID: On Wed, 16 Nov 2022 16:13:16 GMT, Maurizio Cimadamore wrote: >> src/java.base/share/classes/java/lang/foreign/Arena.java line 132: >> >>> 130: * and all the memory segments associated with it can no longer be accessed. Furthermore, any off-heap region of memory backing the >>> 131: * segments associated with that scope are also released. >>> 132: * @throws IllegalStateException if the arena has already been {@linkplain #close() closed}. >> >> It's not wrong to specify that close throw if already closed but it goes against the advice in AutoCloseable to try to have close methods be idempotent. There may be a good reason for this but I can't help wondering if there are error cases when wrapping that might lead to close being called more than once. > > In our experience with using the API, having exceptions when something is funny about close is very valuable info (as also stated in the javadoc). Almost always there's a subtle temporal bug going on which the ISE catches. I'm not sure if here you refer to the fact that the javadoc is being overly broad in saying "already been closed" instead of "already been closed _successfully_" ? What kind of problems are you thinking of? Most of the AutoCloseable in the platform are Closeables where close is specified to have no effect when already closed. With a confined Arena it would be benign for the owner to invoke close again. If it's been useful at finding bugs then okay. The scenario that made me wonder about this is something like the follow where MyWrapper::close invokes Arena::close. try (var arena = Arena.openConfined(); var wrapper = new MyWrapper(arena)) { : } ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Wed Nov 16 16:44:31 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Wed, 16 Nov 2022 16:44:31 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v27] In-Reply-To: References: <-Lw-dDGfVAZlOT815DeyvfwP0NTWWbj4X0lrl9ek_iQ=.70a5ad19-062f-488d-97fb-f8d923c2dc17@github.com> Message-ID: On Wed, 16 Nov 2022 16:38:10 GMT, Alan Bateman wrote: >> In our experience with using the API, having exceptions when something is funny about close is very valuable info (as also stated in the javadoc). Almost always there's a subtle temporal bug going on which the ISE catches. I'm not sure if here you refer to the fact that the javadoc is being overly broad in saying "already been closed" instead of "already been closed _successfully_" ? What kind of problems are you thinking of? > > Most of the AutoCloseable in the platform are Closeables where close is specified to have no effect when already closed. With a confined Arena it would be benign for the owner to invoke close again. If it's been useful at finding bugs then okay. The scenario that made me wonder about this is something like the follow where MyWrapper::close invokes Arena::close. > > try (var arena = Arena.openConfined(); > var wrapper = new MyWrapper(arena)) { > : > } Actually, I see that the `@apiNote` we used to have has disappeared in the API reshuffling. I will add it back. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Wed Nov 16 16:54:41 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Wed, 16 Nov 2022 16:54:41 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v27] In-Reply-To: References: <-Lw-dDGfVAZlOT815DeyvfwP0NTWWbj4X0lrl9ek_iQ=.70a5ad19-062f-488d-97fb-f8d923c2dc17@github.com> Message-ID: On Wed, 16 Nov 2022 16:41:45 GMT, Maurizio Cimadamore wrote: >> Most of the AutoCloseable in the platform are Closeables where close is specified to have no effect when already closed. With a confined Arena it would be benign for the owner to invoke close again. If it's been useful at finding bugs then okay. The scenario that made me wonder about this is something like the follow where MyWrapper::close invokes Arena::close. >> >> try (var arena = Arena.openConfined(); >> var wrapper = new MyWrapper(arena)) { >> : >> } > > Actually, I see that the `@apiNote` we used to have has disappeared in the API reshuffling. I will add it back. > Most of the AutoCloseable in the platform are Closeables where close is specified to have no effect when already closed. With a confined Arena it would be benign for the owner to invoke close again. If it's been useful at finding bugs then okay. The scenario that made me wonder about this is something like the follow where MyWrapper::close invokes Arena::close. > > ``` > try (var arena = Arena.openConfined(); > var wrapper = new MyWrapper(arena)) { > : > } > ``` Sure - this would be problematic - however it seems an edge case (could the TWR just use MyWrapper?) I'd prefer to leave it as is for now, and revisit - so far we had no indications of this being a real problem, whereas we had cases where the thrown exception has been useful to spot issues. If consistency with the rest of the JDK is considered more important we can fix it later. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From psandoz at openjdk.org Wed Nov 16 16:54:47 2022 From: psandoz at openjdk.org (Paul Sandoz) Date: Wed, 16 Nov 2022 16:54:47 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v27] In-Reply-To: <-Lw-dDGfVAZlOT815DeyvfwP0NTWWbj4X0lrl9ek_iQ=.70a5ad19-062f-488d-97fb-f8d923c2dc17@github.com> References: <-Lw-dDGfVAZlOT815DeyvfwP0NTWWbj4X0lrl9ek_iQ=.70a5ad19-062f-488d-97fb-f8d923c2dc17@github.com> Message-ID: On Tue, 15 Nov 2022 18:47:39 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Fix typo in SegmentScope javadoc src/java.base/share/classes/java/lang/foreign/Arena.java line 132: > 130: * and all the memory segments associated with it can no longer be accessed. Furthermore, any off-heap region of memory backing the > 131: * segments associated with that scope are also released. > 132: * @throws IllegalStateException if the arena has already been {@linkplain #close() closed}. JavaDoc was pointing to itself. Suggestion: * @throws IllegalStateException if the arena has already been closed. src/java.base/share/classes/java/lang/foreign/MemorySegment.java line 109: > 107: * Finally, access operations on a memory segment are subject to the thread-confinement checks enforced by the associated > 108: * scope; that is, if the segment is the {@linkplain SegmentScope#global() global scope} or an {@linkplain SegmentScope#auto() automatic scope}, > 109: * it can be accessed by multiple threads. If the segment is associatd with an arena scope, then it can only be Typo: Suggestion: * it can be accessed by multiple threads. If the segment is associated with an arena scope, then it can only be src/java.base/share/classes/java/lang/foreign/SegmentScope.java line 10: > 8: * A segment scope controls access to a memory segment. > 9: *
> 10: * A memory segment can only be accessed while its scope is {@linkplain #isAlive() alive}. Moreoever, Typo: Suggestion: * A memory segment can only be accessed while its scope is {@linkplain #isAlive() alive}. Moreover, ------------- PR: https://git.openjdk.org/jdk/pull/10872 From redestad at openjdk.org Wed Nov 16 18:22:30 2022 From: redestad at openjdk.org (Claes Redestad) Date: Wed, 16 Nov 2022 18:22:30 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Missing & 0xff in StringLatin1::hashCode I'm getting pulled into other tasks and would request for this to be either accepted as-is, rejected or picked up by someone else to rewrite it to something that can be accepted. Obviously I'm biased towards acceptance: While imperfect, it provides improved testing - both functional and performance-wise - and establishes a significantly improved benchmark for more future-proof solutions to beat. There are many ways to iteratively improve upon this solution, some of which would even simplify the implementation. But in the face of upcoming changes that might allow C2 to optimize these kinds of loops without intrinsic support I am not sure spending more time on perfecting the current patch is worth our while. Rejecting it might be the reasonable thing to do, too, especially if the C2 loop optimizations @iwanowww points out might be coming around sooner rather than later. Even if that's not coming soon, the PR at hand adds a chunk of complexity for the compiler team to maintain. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From ysr at openjdk.org Wed Nov 16 23:38:59 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Wed, 16 Nov 2022 23:38:59 GMT Subject: RFR: Change affiliation representation [v2] In-Reply-To: References: <1pJV7ot6gQZCl7rnp_Pj5m-HF24GxEXHxCqHn5-6FYU=.c54f95e4-7b9f-4dc6-b8c3-5e4d76f9d26f@github.com> Message-ID: On Wed, 16 Nov 2022 14:43:08 GMT, Kelvin Nilsen wrote: >> With generational mode of Shenandoah, each region is associated with OLD, YOUNG, or FREE. During certain marking and update-refs activities, the region affiliation is frequently consulted. In the original implementation, the region affiliation was stored in a field of the ShenandoahHeapRegion object. In this code, we maintain a separate array, indexed by region number, to represent the affiliation of each region. This saves one level of indirection and improves cache locality for looking up the affiliation of each region. >> >> Measurements show significant improvement in throughput. One workload that was configured to perform back-to-back old-gen collections was able to increase the frequency of old-gen collections by almost 5 fold. With a 20-minute Extremem workload using a 48G heap, 20G old-gen, the P95 latency improvement was 0.54% (2.395 ms) and the P99.999 latency improvement was 58.21% (29.195 ms) in comparison to the implementation before this patch. > > Kelvin Nilsen has updated the pull request incrementally with six additional commits since the last revision: > > - Add comment to clarify that _affiliation required for non-generational > - Add assertions to clarify locks required for setting affiliation > - Adjust asserts for heap lock > - Undo change to _affiliations add assertion to clarify heap lock > - Only initialize _affiliations if mode is generational > - Only initialize _affiiations in generational mode src/hotspot/share/gc/shenandoah/shenandoahHeap.inline.hpp line 619: > 617: // YOUNG L X > 618: // OLD L X X > 619: // X means state transition won't happen (so don't care) Nits: 1. Instead of the blank entry in the cell for Young->Old, may be use `N` and replace `Blank` in the legend below? 2. If we don't expect some transitions (and believe those are incorrect), it might make sense to assert that in `set_affiliation()`? 3. Further below, it looks like all current (and expected) uses of `assert_lock_for_affiliation` will be from assertion checking code in non-release builds. In that case, you can avoid the `#ifdef ASSERT` verbage at call sites by declaring the method as `PRODUCT_RETURN` which should elide the empty method in release builds. ------------- PR: https://git.openjdk.org/shenandoah/pull/170 From eosterlund at openjdk.org Thu Nov 17 09:24:04 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Thu, 17 Nov 2022 09:24:04 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v4] In-Reply-To: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: > The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. > > In particular, > 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. > > 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. > > 3) Refactoring the stack chunk allocation code > > Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: Fix Richard comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11111/files - new: https://git.openjdk.org/jdk/pull/11111/files/b20563f5..3de25624 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11111&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11111&range=02-03 Stats: 5 lines in 2 files changed: 5 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11111.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11111/head:pull/11111 PR: https://git.openjdk.org/jdk/pull/11111 From eosterlund at openjdk.org Thu Nov 17 09:30:25 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Thu, 17 Nov 2022 09:30:25 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v3] In-Reply-To: <6p2iTiK-RvQtQUUvZHID1kpjZEB1wb72CHEi5X_-zuA=.c447ad16-8a1c-4af3-a062-0b1acbbcc1d0@github.com> References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> <6p2iTiK-RvQtQUUvZHID1kpjZEB1wb72CHEi5X_-zuA=.c447ad16-8a1c-4af3-a062-0b1acbbcc1d0@github.com> Message-ID: <6mluvJmDKsrEmQ7eHAGIkJFkTeAtBugp2J0ZxG7bx_E=.d57b6f9f-e14c-490e-b455-f5eaa7c99da4@github.com> On Wed, 16 Nov 2022 15:47:37 GMT, Richard Reingruber wrote: >> Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: >> >> Indentation fix > > Hi @fisk, I've skimmed the changes. They look good to me. I do have a few comments/questions also. Thanks @reinrich for the review! I have pushed a comment for the continuation oops where they are handled as requested. > src/hotspot/share/gc/shared/barrierSetStackChunk.cpp line 68: > >> 66: >> 67: virtual void do_oop(oop* p) override { >> 68: if (UseCompressedOops) { > > Wouldn't it be better to hoist the check for `UseCompressedOops`? The compiler should be able to do that already. We devirtualize calls into oop closures, and the closure is stack allocated. So the compiler should be able to do that if it finds that it is a good idea. I'd prefer to leave that to the compiler. > src/hotspot/share/gc/shenandoah/shenandoahBarrierSetStackChunk.cpp line 30: > >> 28: >> 29: void ShenandoahBarrierSetStackChunk::encode_gc_mode(stackChunkOop chunk, OopIterator* oop_iterator) { >> 30: // Nothing to do > > Shenandoah allows `UseCompressedOops` enabled, doesn't it? Isn't it necessary then to do the encoding as in the super class? No we don't convert the oops for Shenandoah. Instead, Shenandoah's closures know how to deal with both oop* and narrowOop* on the heap, and will get passed the appropriate type of pointer. So it doesn't use the bitmap. I have tested that it works with Shenandoah as well. ------------- PR: https://git.openjdk.org/jdk/pull/11111 From rrich at openjdk.org Thu Nov 17 11:20:59 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 17 Nov 2022 11:20:59 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v3] In-Reply-To: <6mluvJmDKsrEmQ7eHAGIkJFkTeAtBugp2J0ZxG7bx_E=.d57b6f9f-e14c-490e-b455-f5eaa7c99da4@github.com> References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> <6p2iTiK-RvQtQUUvZHID1kpjZEB1wb72CHEi5X_-zuA=.c447ad16-8a1c-4af3-a062-0b1acbbcc1d0@github.com> <6mluvJmDKsrEmQ7eHAGIkJFkTeAtBugp2J0ZxG7bx_E=.d57b6f9f-e14c-490e-b455-f5eaa7c99da4@github.com> Message-ID: On Thu, 17 Nov 2022 09:23:48 GMT, Erik ?sterlund wrote: >> src/hotspot/share/gc/shared/barrierSetStackChunk.cpp line 68: >> >>> 66: >>> 67: virtual void do_oop(oop* p) override { >>> 68: if (UseCompressedOops) { >> >> Wouldn't it be better to hoist the check for `UseCompressedOops`? > > The compiler should be able to do that already. We devirtualize calls into oop closures, and the closure is stack allocated. So the compiler should be able to do that if it finds that it is a good idea. I'd prefer to leave that to the compiler. `CompressOopsOopClosure::do_oop()` and `FrameOopIterator::oops_do()` are defined in different compilation units. So calls to `do_oop()` cannot be devirtualized or am I missing something? Mistaken or not, I'm ok with this version. >> src/hotspot/share/gc/shenandoah/shenandoahBarrierSetStackChunk.cpp line 30: >> >>> 28: >>> 29: void ShenandoahBarrierSetStackChunk::encode_gc_mode(stackChunkOop chunk, OopIterator* oop_iterator) { >>> 30: // Nothing to do >> >> Shenandoah allows `UseCompressedOops` enabled, doesn't it? Isn't it necessary then to do the encoding as in the super class? > > No we don't convert the oops for Shenandoah. Instead, Shenandoah's closures know how to deal with both oop* and narrowOop* on the heap, and will get passed the appropriate type of pointer. So it doesn't use the bitmap. I have tested that it works with Shenandoah as well. Interesting and good to know. ------------- PR: https://git.openjdk.org/jdk/pull/11111 From rrich at openjdk.org Thu Nov 17 11:26:22 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 17 Nov 2022 11:26:22 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v4] In-Reply-To: References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: On Thu, 17 Nov 2022 09:24:04 GMT, Erik ?sterlund wrote: >> The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. >> >> In particular, >> 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. >> >> 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. >> >> 3) Refactoring the stack chunk allocation code >> >> Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. > > Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: > > Fix Richard comments Not an expert of every aspect but the changes look good to me. Thanks, Richard. Marked as reviewed by rrich (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/11111 From eosterlund at openjdk.org Thu Nov 17 12:10:58 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Thu, 17 Nov 2022 12:10:58 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v4] In-Reply-To: References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: On Thu, 17 Nov 2022 11:23:07 GMT, Richard Reingruber wrote: >> Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix Richard comments > > Marked as reviewed by rrich (Reviewer). Thanks for the review, @reinrich! ------------- PR: https://git.openjdk.org/jdk/pull/11111 From eosterlund at openjdk.org Thu Nov 17 12:10:59 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Thu, 17 Nov 2022 12:10:59 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v3] In-Reply-To: References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> <6p2iTiK-RvQtQUUvZHID1kpjZEB1wb72CHEi5X_-zuA=.c447ad16-8a1c-4af3-a062-0b1acbbcc1d0@github.com> <6mluvJmDKsrEmQ7eHAGIkJFkTeAtBugp2J0ZxG7bx_E=.d57b6f9f-e14c-490e-b455-f5eaa7c99da4@github.com> Message-ID: On Thu, 17 Nov 2022 11:16:52 GMT, Richard Reingruber wrote: >> The compiler should be able to do that already. We devirtualize calls into oop closures, and the closure is stack allocated. So the compiler should be able to do that if it finds that it is a good idea. I'd prefer to leave that to the compiler. > > `CompressOopsOopClosure::do_oop()` and `FrameOopIterator::oops_do()` are defined in different compilation units. So calls to `do_oop()` cannot be devirtualized or am I missing something? > Mistaken or not, I'm ok with this version. Sorry, my bad. You are right - it can't devirtualize. Anyway, I'd like to keep it the way it is as I don't think it's worth optimizing this. ------------- PR: https://git.openjdk.org/jdk/pull/11111 From duke at openjdk.org Thu Nov 17 16:20:29 2022 From: duke at openjdk.org (ExE Boss) Date: Thu, 17 Nov 2022 16:20:29 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v27] In-Reply-To: <-Lw-dDGfVAZlOT815DeyvfwP0NTWWbj4X0lrl9ek_iQ=.70a5ad19-062f-488d-97fb-f8d923c2dc17@github.com> References: <-Lw-dDGfVAZlOT815DeyvfwP0NTWWbj4X0lrl9ek_iQ=.70a5ad19-062f-488d-97fb-f8d923c2dc17@github.com> Message-ID: On Tue, 15 Nov 2022 18:47:39 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Fix typo in SegmentScope javadoc src/java.base/share/classes/jdk/internal/foreign/MemorySessionImpl.java line 77: > 75: } catch (Throwable ex) { > 76: throw new ExceptionInInitializerError(ex); > 77: } The?above `catch`?clause should?only catch?`Exception`s, not?`Throwable`s, as?the?latter would?hide VM?errors such?as?`StackOverflowError` or?`OutOfMemoryError`. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From wkemper at openjdk.org Thu Nov 17 23:21:27 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 17 Nov 2022 23:21:27 GMT Subject: RFR: Do not apply evacuation budgets in non-generational mode Message-ID: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> CSet selection for non-generational mode was using a value configured on the young generation. In non-generational modes we do not maintain all the attributes of the young generation (instead we maintain the _global_ generation). This caused many cycles in which the maximum CSet was zero (or close to it). This, in turn, caused the collector to run much more frequently (approximately 3x on specjbb) which caused severe performance regression in critical jops. I'm not sure why the diff algorithm is struggling so much with these changes. I pulled up the `mode()->is_generational()` out of `compute_evacuation_budgets` and `adjust_evacuation_budgets` into the caller and replaced the check with an assert (re-indenting the code in the method). ------------- Commit messages: - Do not apply evacuation budgets in non-generational mode Changes: https://git.openjdk.org/shenandoah/pull/171/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=171&range=00 Stats: 738 lines in 2 files changed: 227 ins; 238 del; 273 mod Patch: https://git.openjdk.org/shenandoah/pull/171.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/171/head:pull/171 PR: https://git.openjdk.org/shenandoah/pull/171 From wkemper at openjdk.org Thu Nov 17 23:27:23 2022 From: wkemper at openjdk.org (William Kemper) Date: Thu, 17 Nov 2022 23:27:23 GMT Subject: RFR: Merge openjdk/jdk:master Message-ID: Merge tag jdk-20+24. This includes [8294775](https://bugs.openjdk.org/browse/JDK-8294775): Shenandoah: reduce contention on _threads_in_evac ------------- Commit messages: - Merge tag 'jdk-20+24' into merge-jdk-20-24 - 8296453: Simplify resource_area uses in ClassPathDirEntry::open_stream - 8296956: [JVMCI] HotSpotResolvedJavaFieldImpl.getIndex returns wrong value - 8296442: EncryptedPrivateKeyInfo can be created with an uninitialized AlgorithmParameters - 8296967: [JVMCI] rationalize relationship between getCodeSize and getCode in ResolvedJavaMethod - 8296960: [JVMCI] list HotSpotConstantPool.loadReferencedType to ConstantPool - 8283238: make/scripts/compare.sh should show the diff when classlist does not match - 8297006: JFR: AbstractEventStream should not hold thread instance - 8296961: [JVMCI] Access to j.l.r.Method/Constructor/Field for ResolvedJavaMethod/ResolvedJavaField - 8296958: [JVMCI] add API for retrieving ConstantValue attributes - ... and 180 more: https://git.openjdk.org/shenandoah/compare/419a3a2a...9611198d The webrevs contain the adjustments done while merging with regards to each parent branch: - master: https://webrevs.openjdk.org/?repo=shenandoah&pr=172&range=00.0 - openjdk/jdk:master: https://webrevs.openjdk.org/?repo=shenandoah&pr=172&range=00.1 Changes: https://git.openjdk.org/shenandoah/pull/172/files Stats: 25672 lines in 939 files changed: 13672 ins; 7992 del; 4008 mod Patch: https://git.openjdk.org/shenandoah/pull/172.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/172/head:pull/172 PR: https://git.openjdk.org/shenandoah/pull/172 From kdnilsen at openjdk.org Thu Nov 17 23:50:06 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Thu, 17 Nov 2022 23:50:06 GMT Subject: RFR: Do not apply evacuation budgets in non-generational mode In-Reply-To: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> References: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> Message-ID: On Thu, 17 Nov 2022 23:15:28 GMT, William Kemper wrote: > CSet selection for non-generational mode was using a value configured on the young generation. In non-generational modes we do not maintain all the attributes of the young generation (instead we maintain the _global_ generation). This caused many cycles in which the maximum CSet was zero (or close to it). This, in turn, caused the collector to run much more frequently (approximately 3x on specjbb) which caused severe performance regression in critical jops. > > I'm not sure why the diff algorithm is struggling so much with these changes. I pulled up the `mode()->is_generational()` out of `compute_evacuation_budgets` and `adjust_evacuation_budgets` into the caller and replaced the check with an assert (re-indenting the code in the method). Thanks for this fix. I think I may have introduced the bad code. ------------- PR: https://git.openjdk.org/shenandoah/pull/171 From ysr at openjdk.org Fri Nov 18 00:12:58 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 18 Nov 2022 00:12:58 GMT Subject: RFR: Do not apply evacuation budgets in non-generational mode In-Reply-To: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> References: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> Message-ID: On Thu, 17 Nov 2022 23:15:28 GMT, William Kemper wrote: > I'm not sure why the diff algorithm is struggling so much with these changes. Are the diff struggles in git (I assume), or in github (I assume not). I have seen git have issues with indentation that caused issues with merges in like manner recently. :-( I wonder if there's a git option for indentation that would rended this diff better? > ``` > --indent-heuristic > Enable the heuristic that shifts diff hunk boundaries to make patches easier to read. This is the default. > > --no-indent-heuristic > Disable the indent heuristic. > ``` ------------- PR: https://git.openjdk.org/shenandoah/pull/171 From wkemper at openjdk.org Fri Nov 18 00:16:37 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 18 Nov 2022 00:16:37 GMT Subject: RFR: Do not apply evacuation budgets in non-generational mode In-Reply-To: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> References: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> Message-ID: On Thu, 17 Nov 2022 23:15:28 GMT, William Kemper wrote: > CSet selection for non-generational mode was using a value configured on the young generation. In non-generational modes we do not maintain all the attributes of the young generation (instead we maintain the _global_ generation). This caused many cycles in which the maximum CSet was zero (or close to it). This, in turn, caused the collector to run much more frequently (approximately 3x on specjbb) which caused severe performance regression in critical jops. > > I'm not sure why the diff algorithm is struggling so much with these changes. I pulled up the `mode()->is_generational()` out of `compute_evacuation_budgets` and `adjust_evacuation_budgets` into the caller and replaced the check with an assert (re-indenting the code in the method). The diff looked hairy even in `p4merge`. ------------- PR: https://git.openjdk.org/shenandoah/pull/171 From ysr at openjdk.org Fri Nov 18 00:16:37 2022 From: ysr at openjdk.org (Y. Srinivas Ramakrishna) Date: Fri, 18 Nov 2022 00:16:37 GMT Subject: RFR: Do not apply evacuation budgets in non-generational mode In-Reply-To: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> References: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> Message-ID: <74DvJ_udsR7P2gj7WiAAX8LrcbyOUismazIIaPTTux8=.d1253899-33bc-4587-85eb-6a496ab9eaa1@github.com> On Thu, 17 Nov 2022 23:15:28 GMT, William Kemper wrote: > CSet selection for non-generational mode was using a value configured on the young generation. In non-generational modes we do not maintain all the attributes of the young generation (instead we maintain the _global_ generation). This caused many cycles in which the maximum CSet was zero (or close to it). This, in turn, caused the collector to run much more frequently (approximately 3x on specjbb) which caused severe performance regression in critical jops. > > I'm not sure why the diff algorithm is struggling so much with these changes. I pulled up the `mode()->is_generational()` out of `compute_evacuation_budgets` and `adjust_evacuation_budgets` into the caller and replaced the check with an assert (re-indenting the code in the method). Interesting, I wonder if this would make any diff (pun not intended): --diff-algorithm={patience|minimal|histogram|myers} Choose a diff algorithm. The variants are as follows: default, myers The basic greedy diff algorithm. Currently, this is the default. minimal Spend extra time to make sure the smallest possible diff is produced. patience Use "patience diff" algorithm when generating patches. histogram This algorithm extends the patience algorithm to "support low-occurrence common elements". For instance, if you configured the diff.algorithm variable to a non-default value and want to use the default one, then you have to use --diff-algorithm=default option. ------------- PR: https://git.openjdk.org/shenandoah/pull/171 From wkemper at openjdk.org Fri Nov 18 00:22:17 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 18 Nov 2022 00:22:17 GMT Subject: RFR: Do not apply evacuation budgets in non-generational mode In-Reply-To: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> References: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> Message-ID: On Thu, 17 Nov 2022 23:15:28 GMT, William Kemper wrote: > CSet selection for non-generational mode was using a value configured on the young generation. In non-generational modes we do not maintain all the attributes of the young generation (instead we maintain the _global_ generation). This caused many cycles in which the maximum CSet was zero (or close to it). This, in turn, caused the collector to run much more frequently (approximately 3x on specjbb) which caused severe performance regression in critical jops. > > I'm not sure why the diff algorithm is struggling so much with these changes. I pulled up the `mode()->is_generational()` out of `compute_evacuation_budgets` and `adjust_evacuation_budgets` into the caller and replaced the check with an assert (re-indenting the code in the method). I think my editor applied some bonus formatting... One second, I will fix up the patch. ------------- PR: https://git.openjdk.org/shenandoah/pull/171 From wkemper at openjdk.org Fri Nov 18 00:34:18 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 18 Nov 2022 00:34:18 GMT Subject: RFR: Do not apply evacuation budgets in non-generational mode [v2] In-Reply-To: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> References: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> Message-ID: > CSet selection for non-generational mode was using a value configured on the young generation. In non-generational modes we do not maintain all the attributes of the young generation (instead we maintain the _global_ generation). This caused many cycles in which the maximum CSet was zero (or close to it). This, in turn, caused the collector to run much more frequently (approximately 3x on specjbb) which caused severe performance regression in critical jops. > > I'm not sure why the diff algorithm is struggling so much with these changes. I pulled up the `mode()->is_generational()` out of `compute_evacuation_budgets` and `adjust_evacuation_budgets` into the caller and replaced the check with an assert (re-indenting the code in the method). William Kemper has updated the pull request incrementally with one additional commit since the last revision: Try to improve diff ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/171/files - new: https://git.openjdk.org/shenandoah/pull/171/files/725448fa..b586761f Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=171&range=01 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=171&range=00-01 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/shenandoah/pull/171.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/171/head:pull/171 PR: https://git.openjdk.org/shenandoah/pull/171 From wkemper at openjdk.org Fri Nov 18 00:34:19 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 18 Nov 2022 00:34:19 GMT Subject: RFR: Do not apply evacuation budgets in non-generational mode In-Reply-To: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> References: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> Message-ID: On Thu, 17 Nov 2022 23:15:28 GMT, William Kemper wrote: > CSet selection for non-generational mode was using a value configured on the young generation. In non-generational modes we do not maintain all the attributes of the young generation (instead we maintain the _global_ generation). This caused many cycles in which the maximum CSet was zero (or close to it). This, in turn, caused the collector to run much more frequently (approximately 3x on specjbb) which caused severe performance regression in critical jops. > > I'm not sure why the diff algorithm is struggling so much with these changes. I pulled up the `mode()->is_generational()` out of `compute_evacuation_budgets` and `adjust_evacuation_budgets` into the caller and replaced the check with an assert (re-indenting the code in the method). Try hiding whitespace on github - looks much better to me: ![image](https://user-images.githubusercontent.com/71722661/202588802-a768c9f4-f3fb-45d6-99f5-08b1b75f270a.png) ------------- PR: https://git.openjdk.org/shenandoah/pull/171 From wkemper at openjdk.org Fri Nov 18 00:34:20 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 18 Nov 2022 00:34:20 GMT Subject: RFR: Do not apply evacuation budgets in non-generational mode [v2] In-Reply-To: References: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> Message-ID: On Fri, 18 Nov 2022 00:30:00 GMT, William Kemper wrote: >> CSet selection for non-generational mode was using a value configured on the young generation. In non-generational modes we do not maintain all the attributes of the young generation (instead we maintain the _global_ generation). This caused many cycles in which the maximum CSet was zero (or close to it). This, in turn, caused the collector to run much more frequently (approximately 3x on specjbb) which caused severe performance regression in critical jops. >> >> I'm not sure why the diff algorithm is struggling so much with these changes. I pulled up the `mode()->is_generational()` out of `compute_evacuation_budgets` and `adjust_evacuation_budgets` into the caller and replaced the check with an assert (re-indenting the code in the method). > > William Kemper has updated the pull request incrementally with one additional commit since the last revision: > > Try to improve diff src/hotspot/share/gc/shenandoah/shenandoahGeneration.cpp line 795: > 793: } > 794: > 795: heap->assert_pinned_region_status(); This is called again immediately inside collection set selection so I removed this invocation. ------------- PR: https://git.openjdk.org/shenandoah/pull/171 From kdnilsen at openjdk.org Fri Nov 18 19:24:39 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Fri, 18 Nov 2022 19:24:39 GMT Subject: RFR: Do not apply evacuation budgets in non-generational mode [v2] In-Reply-To: References: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> Message-ID: <6icokonz57EwK_I2KqKKfIT0pyV-kGFK5hLseJRHT70=.483d68a0-2773-4dc3-b8be-8b73c9c8bba7@github.com> On Fri, 18 Nov 2022 00:34:18 GMT, William Kemper wrote: >> CSet selection for non-generational mode was using a value configured on the young generation. In non-generational modes we do not maintain all the attributes of the young generation (instead we maintain the _global_ generation). This caused many cycles in which the maximum CSet was zero (or close to it). This, in turn, caused the collector to run much more frequently (approximately 3x on specjbb) which caused severe performance regression in critical jops. >> >> I'm not sure why the diff algorithm is struggling so much with these changes. I pulled up the `mode()->is_generational()` out of `compute_evacuation_budgets` and `adjust_evacuation_budgets` into the caller and replaced the check with an assert (re-indenting the code in the method). > > William Kemper has updated the pull request incrementally with one additional commit since the last revision: > > Try to improve diff Marked as reviewed by kdnilsen (Committer). ------------- PR: https://git.openjdk.org/shenandoah/pull/171 From wkemper at openjdk.org Fri Nov 18 19:49:12 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 18 Nov 2022 19:49:12 GMT Subject: RFR: Merge openjdk/jdk:master [v2] In-Reply-To: References: Message-ID: <_ug00T22lcJp_iwb-n0ru9SoltlmmFXF1enDt3HESkA=.f9943f59-2405-4182-9c91-3b3cb960cd99@github.com> > Merge tag jdk-20+24. This includes [8294775](https://bugs.openjdk.org/browse/JDK-8294775): Shenandoah: reduce contention on _threads_in_evac William Kemper has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 181 commits: - Merge tag 'jdk-20+24' into merge-jdk-20-24 Added tag jdk-20+24 for changeset 2159170b - Change affiliation representation Reviewed-by: ysr, wkemper - Improve some defaults and remove unused options for generational mode Reviewed-by: rkennke - Merge openjdk/jdk:master Reviewed-by: rkennke - Improve evacuation instrumentation Reviewed-by: kdnilsen - Fix preemption of coalesce and fill Reviewed-by: wkemper - Fix assertion error with advance promotion budgeting Reviewed-by: rkennke - Merge openjdk/jdk:master - Merge openjdk/jdk:master - Shenandoah unified logging Reviewed-by: wkemper, shade - ... and 171 more: https://git.openjdk.org/shenandoah/compare/2159170b...9611198d ------------- Changes: https://git.openjdk.org/shenandoah/pull/172/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=172&range=01 Stats: 14912 lines in 147 files changed: 13654 ins; 507 del; 751 mod Patch: https://git.openjdk.org/shenandoah/pull/172.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/172/head:pull/172 PR: https://git.openjdk.org/shenandoah/pull/172 From wkemper at openjdk.org Fri Nov 18 19:50:42 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 18 Nov 2022 19:50:42 GMT Subject: Integrated: Merge openjdk/jdk:master In-Reply-To: References: Message-ID: On Thu, 17 Nov 2022 23:20:06 GMT, William Kemper wrote: > Merge tag jdk-20+24. This includes [8294775](https://bugs.openjdk.org/browse/JDK-8294775): Shenandoah: reduce contention on _threads_in_evac This pull request has now been integrated. Changeset: 3da2fd54 Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/3da2fd548950eb066e8685add60ae44b56c471f6 Stats: 25672 lines in 939 files changed: 13672 ins; 7992 del; 4008 mod Merge openjdk/jdk:master ------------- PR: https://git.openjdk.org/shenandoah/pull/172 From wkemper at openjdk.org Fri Nov 18 19:52:42 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 18 Nov 2022 19:52:42 GMT Subject: Integrated: Do not apply evacuation budgets in non-generational mode In-Reply-To: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> References: <3GP1obWNkWDbcIuskAM2jVoiIYWfXUUX8n3JwAl1Jzc=.a4078a85-f647-401c-8090-be748e41d978@github.com> Message-ID: On Thu, 17 Nov 2022 23:15:28 GMT, William Kemper wrote: > CSet selection for non-generational mode was using a value configured on the young generation. In non-generational modes we do not maintain all the attributes of the young generation (instead we maintain the _global_ generation). This caused many cycles in which the maximum CSet was zero (or close to it). This, in turn, caused the collector to run much more frequently (approximately 3x on specjbb) which caused severe performance regression in critical jops. > > I'm not sure why the diff algorithm is struggling so much with these changes. I pulled up the `mode()->is_generational()` out of `compute_evacuation_budgets` and `adjust_evacuation_budgets` into the caller and replaced the check with an assert (re-indenting the code in the method). This pull request has now been integrated. Changeset: fe51a9bb Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/fe51a9bb089787bc82f96c33b950fe5ac1f4149d Stats: 721 lines in 2 files changed: 210 ins; 219 del; 292 mod Do not apply evacuation budgets in non-generational mode Reviewed-by: kdnilsen ------------- PR: https://git.openjdk.org/shenandoah/pull/171 From kdnilsen at openjdk.org Fri Nov 18 22:06:24 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Fri, 18 Nov 2022 22:06:24 GMT Subject: RFR: Load balance remset scan Message-ID: Prior to this change, the initial group of remembered set assignments was given to worker threads one entire region at a time. We found that with large region sizes (e.g. 16 MiB and above), this resulted in too much imbalance in the work performed by individual threads. A few threads assigned to scan 16 MiB regions with high density of "interesting pointers" were still scanning after all other worker threads finished their scanning efforts. This change caps the maximum assignment size for worker threads at 4 MiB. This results in better distribution of efforts between multiple concurrent threads. With 13 worker threads and 16 MiB heap regions, we observe the following benefits on an Extremem workload (46_064 MiB heap size, 27_648 MiB new size): Latency for customer preparation processing improved by 0.79% for P50, 2.26% for P95, 8.21% for p99, 28.17% for p99.9, 86.59% for p99.99, 86.77% for p99.999. The p100 response improved only slightly, by 1.99%. Average time for concurrent remembered set marking scan improved by 1.92%. The average time for concurrent update refs time, which includes remembered set scanning, improved by 1.72%. ------------- Commit messages: - Fix whitespace and typos in comments - Merge remote-tracking branch 'GitFarmBranch/smaller-remset-rebase' into load-balance-remset-scan - Fix typo in comment - Remove instrumentation - Add detail to assertion failure message - Fix calculation of adjustment to last ShenandoahRegionChunk group - Fix handling of smaller heap regions and disable debug messages - Increase size of largest ShenandoahRegionChunk - Fix initialization of ShenandoahRegionChunkIterator - Disable instrumentation - ... and 3 more: https://git.openjdk.org/shenandoah/compare/419a3a2a...9704b348 Changes: https://git.openjdk.org/shenandoah/pull/173/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=173&range=00 Stats: 150 lines in 4 files changed: 87 ins; 6 del; 57 mod Patch: https://git.openjdk.org/shenandoah/pull/173.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/173/head:pull/173 PR: https://git.openjdk.org/shenandoah/pull/173 From wkemper at openjdk.org Fri Nov 18 22:06:25 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 18 Nov 2022 22:06:25 GMT Subject: RFR: Load balance remset scan In-Reply-To: References: Message-ID: On Fri, 18 Nov 2022 18:49:08 GMT, Kelvin Nilsen wrote: > Prior to this change, the initial group of remembered set assignments was given to worker threads one entire region at a time. We found that with large region sizes (e.g. 16 MiB and above), this resulted in too much imbalance in the work performed by individual threads. A few threads assigned to scan 16 MiB regions with high density of "interesting pointers" were still scanning after all other worker threads finished their scanning efforts. > > This change caps the maximum assignment size for worker threads at 4 MiB. This results in better distribution of efforts between multiple concurrent threads. With 13 worker threads and 16 MiB heap regions, we observe the following benefits on an Extremem workload (46_064 MiB heap size, 27_648 MiB new size): > > Latency for customer preparation processing improved by 0.79% for P50, 2.26% for P95, 8.21% for p99, 28.17% for p99.9, 86.59% for p99.99, 86.77% for p99.999. The p100 response improved only slightly, by 1.99%. > > Average time for concurrent remembered set marking scan improved by 1.92%. The average time for concurrent update refs time, which includes remembered set scanning, improved by 1.72%. I could use an overview here. I understand the advantage of lowering the maximum chunk size, but what is the advantage of using successively smaller chunks? src/hotspot/share/gc/shenandoah/shenandoahScanRemembered.cpp line 184: > 182: size_t smallest_group_span = _smallest_chunk_size_words * _regular_group_size; > 183: > 184: // Tyhe first group gets special handling because the first chunk size can be no larger than _largets_chunk_size_words `s/Thye/The` ------------- PR: https://git.openjdk.org/shenandoah/pull/173 From wkemper at openjdk.org Fri Nov 18 22:50:09 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 18 Nov 2022 22:50:09 GMT Subject: RFR: Load balance remset scan In-Reply-To: References: Message-ID: On Fri, 18 Nov 2022 18:49:08 GMT, Kelvin Nilsen wrote: > Prior to this change, the initial group of remembered set assignments was given to worker threads one entire region at a time. We found that with large region sizes (e.g. 16 MiB and above), this resulted in too much imbalance in the work performed by individual threads. A few threads assigned to scan 16 MiB regions with high density of "interesting pointers" were still scanning after all other worker threads finished their scanning efforts. > > This change caps the maximum assignment size for worker threads at 4 MiB. This results in better distribution of efforts between multiple concurrent threads. With 13 worker threads and 16 MiB heap regions, we observe the following benefits on an Extremem workload (46_064 MiB heap size, 27_648 MiB new size): > > Latency for customer preparation processing improved by 0.79% for P50, 2.26% for P95, 8.21% for p99, 28.17% for p99.9, 86.59% for p99.99, 86.77% for p99.999. The p100 response improved only slightly, by 1.99%. > > Average time for concurrent remembered set marking scan improved by 1.92%. The average time for concurrent update refs time, which includes remembered set scanning, improved by 1.72%. Marked as reviewed by wkemper (Committer). ------------- PR: https://git.openjdk.org/shenandoah/pull/173 From kdnilsen at openjdk.org Fri Nov 18 22:57:55 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Fri, 18 Nov 2022 22:57:55 GMT Subject: RFR: Load balance remset scan In-Reply-To: References: Message-ID: On Fri, 18 Nov 2022 18:49:08 GMT, Kelvin Nilsen wrote: > Prior to this change, the initial group of remembered set assignments was given to worker threads one entire region at a time. We found that with large region sizes (e.g. 16 MiB and above), this resulted in too much imbalance in the work performed by individual threads. A few threads assigned to scan 16 MiB regions with high density of "interesting pointers" were still scanning after all other worker threads finished their scanning efforts. > > This change caps the maximum assignment size for worker threads at 4 MiB. This results in better distribution of efforts between multiple concurrent threads. With 13 worker threads and 16 MiB heap regions, we observe the following benefits on an Extremem workload (46_064 MiB heap size, 27_648 MiB new size): > > Latency for customer preparation processing improved by 0.79% for P50, 2.26% for P95, 8.21% for p99, 28.17% for p99.9, 86.59% for p99.99, 86.77% for p99.999. The p100 response improved only slightly, by 1.99%. > > Average time for concurrent remembered set marking scan improved by 1.92%. The average time for concurrent update refs time, which includes remembered set scanning, improved by 1.72%. I'll add a comment to explain the rationale behind successively smaller chunks. General idea is that as we get closer to the end of the total effort, we want to be more careful to avoid giving one of the worker threads a disproportionately large amount of work to do. Early in the total effort, it's ok for one thread to get a larger assignment than the others. In this case, the thread with the larger effort will chew away on that large assignment while all the other threads repeatedly receive and finish work assignments that can be completed more quickly. ------------- PR: https://git.openjdk.org/shenandoah/pull/173 From wkemper at openjdk.org Fri Nov 18 23:14:18 2022 From: wkemper at openjdk.org (William Kemper) Date: Fri, 18 Nov 2022 23:14:18 GMT Subject: RFR: Various build fixes Message-ID: A hodge-podge of fixes: * 32bit/64bit int assignment errors on windows * missing precompiled header * incorrect format specifiers * skip initialization of shenandoah unit tests when shenandoah is not running ------------- Commit messages: - Merge branch 'openjdk:master' into build-fixes - Fix another use of MARKING. - Fix type warning, disable setup for Shenandoah tests when not running ShenandoahGC. - Fix typo - Various build fixes Changes: https://git.openjdk.org/shenandoah/pull/174/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=174&range=00 Stats: 26 lines in 5 files changed: 12 ins; 0 del; 14 mod Patch: https://git.openjdk.org/shenandoah/pull/174.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/174/head:pull/174 PR: https://git.openjdk.org/shenandoah/pull/174 From kdnilsen at openjdk.org Fri Nov 18 23:22:26 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Fri, 18 Nov 2022 23:22:26 GMT Subject: RFR: Load balance remset scan [v2] In-Reply-To: References: Message-ID: <0UpmRC3LWsmALWSDckm_7V0kuYheUEh_LTOm73em6Ds=.42c0f408-a756-46aa-8d84-80517fc753fb@github.com> > Prior to this change, the initial group of remembered set assignments was given to worker threads one entire region at a time. We found that with large region sizes (e.g. 16 MiB and above), this resulted in too much imbalance in the work performed by individual threads. A few threads assigned to scan 16 MiB regions with high density of "interesting pointers" were still scanning after all other worker threads finished their scanning efforts. > > This change caps the maximum assignment size for worker threads at 4 MiB. This results in better distribution of efforts between multiple concurrent threads. With 13 worker threads and 16 MiB heap regions, we observe the following benefits on an Extremem workload (46_064 MiB heap size, 27_648 MiB new size): > > Latency for customer preparation processing improved by 0.79% for P50, 2.26% for P95, 8.21% for p99, 28.17% for p99.9, 86.59% for p99.99, 86.77% for p99.999. The p100 response improved only slightly, by 1.99%. > > Average time for concurrent remembered set marking scan improved by 1.92%. The average time for concurrent update refs time, which includes remembered set scanning, improved by 1.72%. Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision: Add comment to explain ShenandoahRegionChunkIterator ------------- Changes: - all: https://git.openjdk.org/shenandoah/pull/173/files - new: https://git.openjdk.org/shenandoah/pull/173/files/9704b348..256d88da Webrevs: - full: https://webrevs.openjdk.org/?repo=shenandoah&pr=173&range=01 - incr: https://webrevs.openjdk.org/?repo=shenandoah&pr=173&range=00-01 Stats: 11 lines in 1 file changed: 11 ins; 0 del; 0 mod Patch: https://git.openjdk.org/shenandoah/pull/173.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/173/head:pull/173 PR: https://git.openjdk.org/shenandoah/pull/173 From kdnilsen at openjdk.org Fri Nov 18 23:26:00 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Fri, 18 Nov 2022 23:26:00 GMT Subject: RFR: Various build fixes In-Reply-To: References: Message-ID: On Fri, 18 Nov 2022 23:07:00 GMT, William Kemper wrote: > A hodge-podge of fixes: > * 32bit/64bit int assignment errors on windows > * missing precompiled header > * incorrect format specifiers > * skip initialization of shenandoah unit tests when shenandoah is not running Marked as reviewed by kdnilsen (Committer). ------------- PR: https://git.openjdk.org/shenandoah/pull/174 From kdnilsen at openjdk.org Fri Nov 18 23:26:06 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Fri, 18 Nov 2022 23:26:06 GMT Subject: Integrated: Load balance remset scan In-Reply-To: References: Message-ID: On Fri, 18 Nov 2022 18:49:08 GMT, Kelvin Nilsen wrote: > Prior to this change, the initial group of remembered set assignments was given to worker threads one entire region at a time. We found that with large region sizes (e.g. 16 MiB and above), this resulted in too much imbalance in the work performed by individual threads. A few threads assigned to scan 16 MiB regions with high density of "interesting pointers" were still scanning after all other worker threads finished their scanning efforts. > > This change caps the maximum assignment size for worker threads at 4 MiB. This results in better distribution of efforts between multiple concurrent threads. With 13 worker threads and 16 MiB heap regions, we observe the following benefits on an Extremem workload (46_064 MiB heap size, 27_648 MiB new size): > > Latency for customer preparation processing improved by 0.79% for P50, 2.26% for P95, 8.21% for p99, 28.17% for p99.9, 86.59% for p99.99, 86.77% for p99.999. The p100 response improved only slightly, by 1.99%. > > Average time for concurrent remembered set marking scan improved by 1.92%. The average time for concurrent update refs time, which includes remembered set scanning, improved by 1.72%. This pull request has now been integrated. Changeset: 264f9c23 Author: Kelvin Nilsen URL: https://git.openjdk.org/shenandoah/commit/264f9c2343244bdbb6f6e6d1d290f18495409da4 Stats: 161 lines in 4 files changed: 98 ins; 6 del; 57 mod Load balance remset scan Reviewed-by: wkemper ------------- PR: https://git.openjdk.org/shenandoah/pull/173 From wkemper at openjdk.org Sat Nov 19 00:17:05 2022 From: wkemper at openjdk.org (William Kemper) Date: Sat, 19 Nov 2022 00:17:05 GMT Subject: Integrated: Various build fixes In-Reply-To: References: Message-ID: On Fri, 18 Nov 2022 23:07:00 GMT, William Kemper wrote: > A hodge-podge of fixes: > * 32bit/64bit int assignment errors on windows > * missing precompiled header > * incorrect format specifiers > * skip initialization of shenandoah unit tests when shenandoah is not running This pull request has now been integrated. Changeset: b594e541 Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/b594e541311eae053aea0664be3f2453c62b95f4 Stats: 26 lines in 5 files changed: 12 ins; 0 del; 14 mod Various build fixes Reviewed-by: kdnilsen ------------- PR: https://git.openjdk.org/shenandoah/pull/174 From jcking at openjdk.org Mon Nov 21 04:35:55 2022 From: jcking at openjdk.org (Justin King) Date: Mon, 21 Nov 2022 04:35:55 GMT Subject: RFR: JDK-8297309: Memory leak in ShenandoahFullGC Message-ID: Signed-off-by: Justin King ------------- Commit messages: - Fix memory leak in ShenandoahFullGC Changes: https://git.openjdk.org/jdk/pull/11255/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11255&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297309 Stats: 5 lines in 2 files changed: 5 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11255.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11255/head:pull/11255 PR: https://git.openjdk.org/jdk/pull/11255 From rkennke at openjdk.org Mon Nov 21 10:29:19 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Mon, 21 Nov 2022 10:29:19 GMT Subject: RFR: JDK-8297309: Memory leak in ShenandoahFullGC In-Reply-To: References: Message-ID: <-Zej2h2pnRfZDAC-v6mcdh1pfs6ieO86MpIFHROSvgU=.1d44b16a-debe-40c7-b79c-1a132f9198cb@github.com> On Mon, 21 Nov 2022 04:25:13 GMT, Justin King wrote: > Signed-off-by: Justin King Ouch. Good catch. Patch looks good, thank you! ------------- Marked as reviewed by rkennke (Reviewer). PR: https://git.openjdk.org/jdk/pull/11255 From shade at openjdk.org Mon Nov 21 10:44:22 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 21 Nov 2022 10:44:22 GMT Subject: RFR: JDK-8297309: Memory leak in ShenandoahFullGC In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 04:25:13 GMT, Justin King wrote: > Signed-off-by: Justin King I thought `ShenandoahFullGC` is eternal, but apparently we create it as scoped object every time we need to go for Full GC. Oof. Looks good! ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/11255 From pchilanomate at openjdk.org Mon Nov 21 12:20:10 2022 From: pchilanomate at openjdk.org (Patricio Chilano Mateo) Date: Mon, 21 Nov 2022 12:20:10 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v4] In-Reply-To: References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: On Thu, 17 Nov 2022 09:24:04 GMT, Erik ?sterlund wrote: >> The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. >> >> In particular, >> 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. >> >> 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. >> >> 3) Refactoring the stack chunk allocation code >> >> Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. > > Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: > > Fix Richard comments I went through the changes and all looks good to me. Only minor comments. Thanks, Patricio src/hotspot/share/gc/shared/memAllocator.cpp line 381: > 379: } > 380: > 381: oop MemAllocator::try_allocate_in_existing_tlab() { try_allocate_in_existing_tlab() is now unused in memAllocator.hpp. src/hotspot/share/gc/shared/memAllocator.hpp line 98: > 96: virtual oop initialize(HeapWord* mem) const; > 97: > 98: using MemAllocator::allocate; Do we need these declarations? I thought this would be needed if allocate() would not be public on the base class or to avoid hiding it if here we define a method with the same name but different signature. src/hotspot/share/runtime/continuationFreezeThaw.cpp line 1393: > 1391: // Guaranteed to be in young gen / newly allocated memory > 1392: assert(!chunk->requires_barriers(), "Unfamiliar GC requires barriers on TLAB allocation"); > 1393: _barriers = false; Do we need to explicitly set _barriers to false? It's already initialized to be false (same above for the UseZGC case). That would also allow to simplify the code a bit I think to be just an if statement that calls requires_barriers() for the "ZGC_ONLY(!UseZGC &&) (SHENANDOAHGC_ONLY(UseShenandoahGC ||) allocator.took_slow_path())" case, and then ZGC and the fast path could use just separate asserts outside conditionals. ------------- Marked as reviewed by pchilanomate (Reviewer). PR: https://git.openjdk.org/jdk/pull/11111 From asmehra at redhat.com Mon Nov 21 21:09:49 2022 From: asmehra at redhat.com (Ashutosh Mehra) Date: Mon, 21 Nov 2022 16:09:49 -0500 Subject: Directions on fixing JDK-8297285 Message-ID: Hi, I am trying to find a way to fix JDK-8297285: Shenandoah pacing causes assertion failure during VM initialization. The scenario is during VM init if the main thread exhausts the allocation budget, then the shenandoah pacer would cause it to wait. The call to wait can hit assertion self->is_active_Java_thread() in Monitor::wait() as the main thread is not yet added to the thread list, and therefore is_active_Java_thread() returns false. The approach I have in mind is to delay initialization of ShenandoahPacer until the main thread is considered an active Java thread. But there is no hook to capture that event in the GC. The closest I found is `CollectedHeap::post_initialize()`. The comments for `CollectedHeap::post_initialize()` indicate this *may* be the right place to initialize the ShenandoahPacer: ``` // In many heaps, there will be a need to perform some initialization activities // after the Universe is fully formed, but before general heap allocation is allowed. // This is the correct place to place such initialization methods. virtual void post_initialize(); ``` However, there are still some VM init activities between the call to CollectedHeap::post_initialize() in universe_post_init and the call to Threads::add(main_thread) in Threads::create_vm(). So initializing in CollectedHeap::post_initialize() can still break if enough heap allocations happen during those activities. I am wondering if there is a better place to initialize the ShenandoahPacer, or another approach to fix this situation. Here is the current set of changes [0] that initialize ShenandoahPacer in CollectedHeap::post_initialize(). [0] https://github.com/openjdk/jdk/compare/master...ashu-mehra:jdk:JDK-8297285 Thanks, Ashutosh Mehra -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcimadamore at openjdk.org Mon Nov 21 21:56:01 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Mon, 21 Nov 2022 21:56:01 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v28] In-Reply-To: References: Message-ID: <0y8JsbYwjCSOwW3GHC-7W3u4__AjjxlyQ_GZHR1YBtk=.f17cdd0e-652b-4906-a9a3-bf53e3d4768a@github.com> > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with three additional commits since the last revision: - Address more review comments - Fix bad @throws in MemorySegment::copy methods - Address review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/876587c3..a0cee7b0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=27 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=26-27 Stats: 19 lines in 4 files changed: 8 ins; 0 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From sviswanathan at openjdk.org Tue Nov 22 00:07:26 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 22 Nov 2022 00:07:26 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13] In-Reply-To: References: Message-ID: On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Missing & 0xff in StringLatin1::hashCode We have seen this as a hotspot in workloads. It will be good to optimize the StringUTF16 and StringLatin1 hash code computation. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From mcimadamore at openjdk.org Tue Nov 22 14:48:04 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Tue, 22 Nov 2022 14:48:04 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v29] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Fix wrong check in MemorySegment::spliterator/elements (The check which ensures that the segment size is multiple of spliterator element size is bogus) ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/a0cee7b0..66dd888d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=28 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=27-28 Stats: 29 lines in 2 files changed: 21 ins; 0 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From asmehra at redhat.com Tue Nov 22 17:13:43 2022 From: asmehra at redhat.com (Ashutosh Mehra) Date: Tue, 22 Nov 2022 12:13:43 -0500 Subject: Allocation pacing and graceful degradation in ShenandoahGC In-Reply-To: References: Message-ID: Hi Alex, I have spent some time understanding the Shenandoah Pacer and I will try to answer your questions as best I can. Does that mean that each Java thread goes throw runtime -> heap to > allocate, and that's how pacer paces it? So we just pace any allocating > thread and threads that allocate more will just hit this code more often. Allocations from tlab do not go through pacer, but allocating a new tlab does go through the pacer. And yes a thread allocating more is more likely to hit the pacer. so I assume if there is no budget available it will pace a thread for up to > 10ms, but it does not imply allocation failure. Yes, it does not imply allocation failure. It is just a mechanism to ensure concurrent gc is able to keep pace with the allocation rate. Heap class tries to allocate under lock and if unsuccessful considers this > as allocation failure and handles it by calling ShenandoahControlThread. > Does it mean that Pacer can?t cause GC to switch to degenerated mode or I > am missing something? Pacer itself does not cause GC to degenerate. It only delays the mutator thread. As you mentioned earlier, after the expiry of wait time the mutator thread would still attempt allocation which may succeed. If the allocation rate is high, pacer may not be able to cope up, and in that case the mutator thread may suffer allocation failure which would result in running a degenerated GC cycle. If Pacer doesn't have budget to allocate memory it paces thread, but is > there any global budget for pacing time or it is only per thread max > (ShenandoahPacingMaxDelay)? The wait time introduced by Pacer is per thread and bounded by ShenandoahPacingMaxDelay. I don't think there is any global budget for pacing time. It would really nice if you can sched some light on these transitions: > Concurrent Mode -> Pacing (single thread and total pacing time for all > threads) -> most importantly logic of transitioning from pacing to > degenerated GC I will try to summarize the transition to degenerated GC. Allocation failures are signaled by the mutator thread by setting a flag _alloc_failure_gc [0] in ShenandoahControlThread::handle_alloc_failure() and then it waits on _alloc_failure_waiters_lock [1] for notification from the control thread after it has handled the allocation failure. The control thread executing run_service() checks if the flag _alloc_failure_gc is set [2], if so it indicates a pending allocation failure. It then tries to handle alloc failure by running either a Degenerated GC or a Full GC cycle. That decision depends on ShenandoahHeuristics::should_degenerate_cycle() which performs a simple counter check for the number of consecutive degenerated GC cycles. There are some details in the wiki [4] for pacing and degenerated gc in case you have not looked at that. I hope this helps you to move forward in your effort. [0] https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp#L531 [1] https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp#L543 [2] https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp#L100 [3] https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp#L127 [4] https://wiki.openjdk.org/display/shenandoah/Main Thanks, Ashutosh Mehra On Mon, Nov 14, 2022 at 4:07 PM Alex Dubrouski wrote: > Good afternoon everyone, > > > > I checked all video presentations and slides by Alex Shipilev and Roman > Kennke about ShenandoahGC to find the answer for my question with no luck. > I am trying to find more details about transitions between modes in > ShenandoahGC > > I am looking for solution to assess concurrent collector health in real > time using different metrics. > > > > Here is the schema of transitions, and allocation failure causes > degenerated GC cycle, but it does not mention allocation pacing at all: > > > https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp#L361 > > > > I tried to dig further into this logic, but need your help to put all the > pieces together > > I was not able to effectively trace entry point, but this might work, > allocation on heap outside of TLAB: > > > https://github.com/openjdk/jdk/blob/master/src/hotspot/share/gc/shared/memAllocator.cpp#L258 > > in case of ShenandoahGC I assume we call > > > https://github.com/openjdk/jdk/blame/739769c8fc4b496f08a92225a12d07414537b6c0/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L901 > > which then calls > > > https://github.com/openjdk/jdk/blame/739769c8fc4b496f08a92225a12d07414537b6c0/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L821 > > if mutator is allocating and pacer enabled (default) we enter Pacer: > > > https://github.com/openjdk/jdk/blame/739769c8fc4b496f08a92225a12d07414537b6c0/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L828 > > > https://github.com/openjdk/jdk/blame/master/src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp#L229 > > and I assume try to handle it nicely, if not we start pacing: > > > https://github.com/openjdk/jdk/blame/master/src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp#L253 > > > > I have few questions here: > > - Could you please explain a bit how the system of taxes works? I assume > mutators claim budget, while GC replenishes it async, but the details are > missing and no comments in the code > > - To pace we use wait function from Monitor class > > > https://github.com/openjdk/jdk/blame/master/src/hotspot/share/runtime/mutex.cpp#L232 > > but the first thing it gets current Java thread. Does that mean that > each Java thread goes throw runtime -> heap to allocate, and that's how > pacer paces it? So we just pace any allocating thread and threads that > allocate more will just hit this code more often. > > > > - Pacer uses ShenandoahPacingMaxDelay (10ms) as max, but > pace_for_allocation returns void > > > https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L826 > > > https://github.com/openjdk/jdk/blame/master/src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp#L225 > > so I assume if there is no budget available it will pace a thread for up > to 10ms, but it does not imply allocation failure. > > Heap class tries to allocate under lock and if unsuccessful considers this > as allocation failure and handles it by calling ShenandoahControlThread. > Does it mean that Pacer can?t cause GC to switch to degenerated mode or I > am missing something? > > > > - If Pacer doesn't have budget to allocate memory it paces thread, but is > there any global budget for pacing time or it is only per thread max > (ShenandoahPacingMaxDelay)? > > - It would really nice if you can sched some light on these transitions: > Concurrent Mode -> Pacing (single thread and total pacing time for all > threads) -> most importantly logic of transitioning from pacing to > degenerated GC > > > > I am trying to build a model which can tell me whether GC is healthy > (fully concurrent), a bit unhealthy (pacing), unhealthy (degenerated or > full GC) and how close are to the edge of the next state (a bit unhealthy > -> unhealthy) > > > > No rush and thanks a lot in advance. > > > > Regards, > > Alex Dubrouski > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From adubrouski at linkedin.com Tue Nov 22 19:49:48 2022 From: adubrouski at linkedin.com (Alex Dubrouski) Date: Tue, 22 Nov 2022 19:49:48 +0000 Subject: Allocation pacing and graceful degradation in ShenandoahGC In-Reply-To: References: Message-ID: <2ECF447B-965C-4398-A6EB-91A2BAAC91A8@linkedin.biz> Hi Ashutosh, Thanks a lot for your response. I checked the wiki but it did not contain details, so had to take a look into source code. Plus, I noticed one thing ? Wiki mentions that degenerated STW GC continues concurrent cycle, while checking code shows that logic is more complex: https://github.com/openjdk/jdk/blob/master/src/hotspot/share/gc/shenandoah/shenandoahDegeneratedGC.cpp#L134 Thanks again. Regards, Alex Dubrouski From: Ashutosh Mehra Date: Tuesday, November 22, 2022 at 9:14 AM To: Alex Dubrouski Cc: "shenandoah-dev at openjdk.org" Subject: Re: Allocation pacing and graceful degradation in ShenandoahGC Hi Alex, I have spent some time understanding the Shenandoah Pacer and I will try to answer your questions as best I can. Does that mean that each Java thread goes throw runtime -> heap to allocate, and that's how pacer paces it? So we just pace any allocating thread and threads that allocate more will just hit this code more often. Allocations from tlab do not go through pacer, but allocating a new tlab does go through the pacer. And yes a thread allocating more is more likely to hit the pacer. so I assume if there is no budget available it will pace a thread for up to 10ms, but it does not imply allocation failure. Yes, it does not imply allocation failure. It is just a mechanism to ensure concurrent gc is able to keep pace with the allocation rate. Heap class tries to allocate under lock and if unsuccessful considers this as allocation failure and handles it by calling ShenandoahControlThread. Does it mean that Pacer can?t cause GC to switch to degenerated mode or I am missing something? Pacer itself does not cause GC to degenerate. It only delays the mutator thread. As you mentioned earlier, after the expiry of wait time the mutator thread would still attempt allocation which may succeed. If the allocation rate is high, pacer may not be able to cope up, and in that case the mutator thread may suffer allocation failure which would result in running a degenerated GC cycle. If Pacer doesn't have budget to allocate memory it paces thread, but is there any global budget for pacing time or it is only per thread max (ShenandoahPacingMaxDelay)? The wait time introduced by Pacer is per thread and bounded by ShenandoahPacingMaxDelay. I don't think there is any global budget for pacing time. It would really nice if you can sched some light on these transitions: Concurrent Mode -> Pacing (single thread and total pacing time for all threads) -> most importantly logic of transitioning from pacing to degenerated GC I will try to summarize the transition to degenerated GC. Allocation failures are signaled by the mutator thread by setting a flag _alloc_failure_gc [0] in ShenandoahControlThread::handle_alloc_failure() and then it waits on _alloc_failure_waiters_lock [1] for notification from the control thread after it has handled the allocation failure. The control thread executing run_service() checks if the flag _alloc_failure_gc is set [2], if so it indicates a pending allocation failure. It then tries to handle alloc failure by running either a Degenerated GC or a Full GC cycle. That decision depends on ShenandoahHeuristics::should_degenerate_cycle() which performs a simple counter check for the number of consecutive degenerated GC cycles. There are some details in the wiki [4] for pacing and degenerated gc in case you have not looked at that. I hope this helps you to move forward in your effort. [0] https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp#L531 [1] https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp#L543 [2] https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp#L100 [3] https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp#L127 [4] https://wiki.openjdk.org/display/shenandoah/Main Thanks, Ashutosh Mehra On Mon, Nov 14, 2022 at 4:07 PM Alex Dubrouski > wrote: Good afternoon everyone, I checked all video presentations and slides by Alex Shipilev and Roman Kennke about ShenandoahGC to find the answer for my question with no luck. I am trying to find more details about transitions between modes in ShenandoahGC I am looking for solution to assess concurrent collector health in real time using different metrics. Here is the schema of transitions, and allocation failure causes degenerated GC cycle, but it does not mention allocation pacing at all: https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp#L361 I tried to dig further into this logic, but need your help to put all the pieces together I was not able to effectively trace entry point, but this might work, allocation on heap outside of TLAB: https://github.com/openjdk/jdk/blob/master/src/hotspot/share/gc/shared/memAllocator.cpp#L258 in case of ShenandoahGC I assume we call https://github.com/openjdk/jdk/blame/739769c8fc4b496f08a92225a12d07414537b6c0/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L901 which then calls https://github.com/openjdk/jdk/blame/739769c8fc4b496f08a92225a12d07414537b6c0/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L821 if mutator is allocating and pacer enabled (default) we enter Pacer: https://github.com/openjdk/jdk/blame/739769c8fc4b496f08a92225a12d07414537b6c0/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L828 https://github.com/openjdk/jdk/blame/master/src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp#L229 and I assume try to handle it nicely, if not we start pacing: https://github.com/openjdk/jdk/blame/master/src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp#L253 I have few questions here: - Could you please explain a bit how the system of taxes works? I assume mutators claim budget, while GC replenishes it async, but the details are missing and no comments in the code - To pace we use wait function from Monitor class https://github.com/openjdk/jdk/blame/master/src/hotspot/share/runtime/mutex.cpp#L232 but the first thing it gets current Java thread. Does that mean that each Java thread goes throw runtime -> heap to allocate, and that's how pacer paces it? So we just pace any allocating thread and threads that allocate more will just hit this code more often. - Pacer uses ShenandoahPacingMaxDelay (10ms) as max, but pace_for_allocation returns void https://github.com/openjdk/jdk/blob/fba763f82528d2825831a26b4ae4e090c602208f/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L826 https://github.com/openjdk/jdk/blame/master/src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp#L225 so I assume if there is no budget available it will pace a thread for up to 10ms, but it does not imply allocation failure. Heap class tries to allocate under lock and if unsuccessful considers this as allocation failure and handles it by calling ShenandoahControlThread. Does it mean that Pacer can?t cause GC to switch to degenerated mode or I am missing something? - If Pacer doesn't have budget to allocate memory it paces thread, but is there any global budget for pacing time or it is only per thread max (ShenandoahPacingMaxDelay)? - It would really nice if you can sched some light on these transitions: Concurrent Mode -> Pacing (single thread and total pacing time for all threads) -> most importantly logic of transitioning from pacing to degenerated GC I am trying to build a model which can tell me whether GC is healthy (fully concurrent), a bit unhealthy (pacing), unhealthy (degenerated or full GC) and how close are to the edge of the next state (a bit unhealthy -> unhealthy) No rush and thanks a lot in advance. Regards, Alex Dubrouski -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcimadamore at openjdk.org Wed Nov 23 10:54:53 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Wed, 23 Nov 2022 10:54:53 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v30] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Fix bit vs. byte mismatch in test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/66dd888d..3c75e097 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=29 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=28-29 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Wed Nov 23 17:33:06 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Wed, 23 Nov 2022 17:33:06 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v31] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: * remove unused Scoped interface * re-add trusting of final fields in layout class implementations * Fix BulkOps benchmark, which had alignment issues ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/3c75e097..97168155 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=30 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=29-30 Stats: 56 lines in 5 files changed: 8 ins; 39 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From asmehra at redhat.com Thu Nov 24 14:59:53 2022 From: asmehra at redhat.com (Ashutosh Mehra) Date: Thu, 24 Nov 2022 09:59:53 -0500 Subject: Fwd: Directions on fixing JDK-8297285 In-Reply-To: References: Message-ID: Any suggestions on the problem mentioned in the previous email? I would like to add that the problem is likely to become more apparent with https://bugs.openjdk.org/browse/JDK-8293650 as it would increase heap utilization during VM init. Thanks, Ashutosh Mehra ---------- Forwarded message --------- From: Ashutosh Mehra Date: Mon, Nov 21, 2022 at 4:09 PM Subject: Directions on fixing JDK-8297285 To: Hi, I am trying to find a way to fix JDK-8297285: Shenandoah pacing causes assertion failure during VM initialization. The scenario is during VM init if the main thread exhausts the allocation budget, then the shenandoah pacer would cause it to wait. The call to wait can hit assertion self->is_active_Java_thread() in Monitor::wait() as the main thread is not yet added to the thread list, and therefore is_active_Java_thread() returns false. The approach I have in mind is to delay initialization of ShenandoahPacer until the main thread is considered an active Java thread. But there is no hook to capture that event in the GC. The closest I found is `CollectedHeap::post_initialize()`. The comments for `CollectedHeap::post_initialize()` indicate this *may* be the right place to initialize the ShenandoahPacer: ``` // In many heaps, there will be a need to perform some initialization activities // after the Universe is fully formed, but before general heap allocation is allowed. // This is the correct place to place such initialization methods. virtual void post_initialize(); ``` However, there are still some VM init activities between the call to CollectedHeap::post_initialize() in universe_post_init and the call to Threads::add(main_thread) in Threads::create_vm(). So initializing in CollectedHeap::post_initialize() can still break if enough heap allocations happen during those activities. I am wondering if there is a better place to initialize the ShenandoahPacer, or another approach to fix this situation. Here is the current set of changes [0] that initialize ShenandoahPacer in CollectedHeap::post_initialize(). [0] https://github.com/openjdk/jdk/compare/master...ashu-mehra:jdk:JDK-8297285 Thanks, Ashutosh Mehra -------------- next part -------------- An HTML attachment was scrubbed... URL: From rkennke at amazon.de Thu Nov 24 15:48:14 2022 From: rkennke at amazon.de (Kennke, Roman) Date: Thu, 24 Nov 2022 16:48:14 +0100 Subject: Directions on fixing JDK-8297285 In-Reply-To: References: Message-ID: <41c4f41d-972d-5737-89a6-558b3aecd9d1@amazon.de> Hi Ashu, In ShenandoahPacer::pace_for_alloc(), there is a check like this: // Threads that are attaching should not block at all: they are not // fully initialized yet. Blocking them would be awkward. // This is probably the path that allocates the thread oop itself. if (JavaThread::current()->is_attaching_via_jni()) { return; } Maybe the right thing to do is to add a check for whether or not the thread is in threads list or is_active_Java_thread() ? Also, CCing Aleksey, because he originally wrote this code, IIRC. Thanks, Roman > Any suggestions on the problem mentioned in the previous email? > I would like to add that the problem is likely to become more apparent > with https://bugs.openjdk.org/browse/JDK-8293650 > as it would increase heap > utilization during VM init. > > Thanks, > Ashutosh Mehra > > > ---------- Forwarded message --------- > From: *Ashutosh Mehra* > > Date: Mon, Nov 21, 2022 at 4:09 PM > Subject: Directions on fixing JDK-8297285 > To: > > > > Hi, > > I am trying to find a way to fix?JDK-8297285:?Shenandoah pacing causes > assertion failure during VM initialization. > > The scenario is during VM init if the main thread exhausts the > allocation budget, then the shenandoah pacer would cause it to wait. The > call to wait can hit assertion self->is_active_Java_thread() in > Monitor::wait() as the main thread is not yet added to the thread list, > and therefore is_active_Java_thread() returns false. > > The approach I have in mind is to delay initialization of > ShenandoahPacer until?the main thread is considered?an active Java > thread. But there is no hook to capture that event in the GC. The > closest I?found is `CollectedHeap::post_initialize()`. The comments for > `CollectedHeap::post_initialize()` indicate this /may/ be the right > place to initialize the ShenandoahPacer: > > ``` > ? // In many heaps, there will be a need to perform some initialization > activities > ? // after the Universe is fully formed, but before general heap > allocation is allowed. > ? // This is the correct place to place such initialization methods. > ? virtual void post_initialize(); > ``` > > However, there are still some VM init activities between the call to > CollectedHeap::post_initialize() in universe_post_init and the call to > Threads::add(main_thread) in Threads::create_vm(). So initializing in > CollectedHeap::post_initialize() can still break if enough heap > allocations happen during those activities. > I am wondering if there is a better place to initialize the > ShenandoahPacer, or another approach to fix this situation. > Here is the current set of changes [0] that initialize ShenandoahPacer > in? CollectedHeap::post_initialize(). > > [0] > https://github.com/openjdk/jdk/compare/master...ashu-mehra:jdk:JDK-8297285 > > Thanks, > Ashutosh Mehra Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879 From asmehra at redhat.com Thu Nov 24 16:22:41 2022 From: asmehra at redhat.com (Ashutosh Mehra) Date: Thu, 24 Nov 2022 11:22:41 -0500 Subject: Directions on fixing JDK-8297285 In-Reply-To: <41c4f41d-972d-5737-89a6-558b3aecd9d1@amazon.de> References: <41c4f41d-972d-5737-89a6-558b3aecd9d1@amazon.de> Message-ID: > > Maybe the right thing to do is to add a check for whether or not the > thread is in threads list or is_active_Java_thread() ? > I was trying to avoid adding another check in the allocation path especially since it would only be used during init. But if there is no other easy way to fix it, I can live with it. Thanks, Ashutosh Mehra On Thu, Nov 24, 2022 at 10:49 AM Kennke, Roman wrote: > Hi Ashu, > > In ShenandoahPacer::pace_for_alloc(), there is a check like this: > > // Threads that are attaching should not block at all: they are not > // fully initialized yet. Blocking them would be awkward. > // This is probably the path that allocates the thread oop itself. > if (JavaThread::current()->is_attaching_via_jni()) { > return; > } > > Maybe the right thing to do is to add a check for whether or not the > thread is in threads list or is_active_Java_thread() ? > > Also, CCing Aleksey, because he originally wrote this code, IIRC. > > Thanks, > Roman > > > > Any suggestions on the problem mentioned in the previous email? > > I would like to add that the problem is likely to become more apparent > > with https://bugs.openjdk.org/browse/JDK-8293650 > > as it would increase heap > > utilization during VM init. > > > > Thanks, > > Ashutosh Mehra > > > > > > ---------- Forwarded message --------- > > From: *Ashutosh Mehra* > > > Date: Mon, Nov 21, 2022 at 4:09 PM > > Subject: Directions on fixing JDK-8297285 > > To: > > > > > > > Hi, > > > > I am trying to find a way to fix JDK-8297285: Shenandoah pacing causes > > assertion failure during VM initialization. > > > > The scenario is during VM init if the main thread exhausts the > > allocation budget, then the shenandoah pacer would cause it to wait. The > > call to wait can hit assertion self->is_active_Java_thread() in > > Monitor::wait() as the main thread is not yet added to the thread list, > > and therefore is_active_Java_thread() returns false. > > > > The approach I have in mind is to delay initialization of > > ShenandoahPacer until the main thread is considered an active Java > > thread. But there is no hook to capture that event in the GC. The > > closest I found is `CollectedHeap::post_initialize()`. The comments for > > `CollectedHeap::post_initialize()` indicate this /may/ be the right > > place to initialize the ShenandoahPacer: > > > > ``` > > // In many heaps, there will be a need to perform some initialization > > activities > > // after the Universe is fully formed, but before general heap > > allocation is allowed. > > // This is the correct place to place such initialization methods. > > virtual void post_initialize(); > > ``` > > > > However, there are still some VM init activities between the call to > > CollectedHeap::post_initialize() in universe_post_init and the call to > > Threads::add(main_thread) in Threads::create_vm(). So initializing in > > CollectedHeap::post_initialize() can still break if enough heap > > allocations happen during those activities. > > I am wondering if there is a better place to initialize the > > ShenandoahPacer, or another approach to fix this situation. > > Here is the current set of changes [0] that initialize ShenandoahPacer > > in CollectedHeap::post_initialize(). > > > > [0] > > > https://github.com/openjdk/jdk/compare/master...ashu-mehra:jdk:JDK-8297285 > < > https://github.com/openjdk/jdk/compare/master...ashu-mehra:jdk:JDK-8297285 > > > > > > Thanks, > > Ashutosh Mehra > > > > Amazon Development Center Germany GmbH > Krausenstr. 38 > 10117 Berlin > Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss > Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B > Sitz: Berlin > Ust-ID: DE 289 237 879 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shade at openjdk.org Thu Nov 24 19:30:24 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 24 Nov 2022 19:30:24 GMT Subject: RFR: 8297600: Check current thread in selected JRT_LEAF methods Message-ID: With [JDK-8275286](https://bugs.openjdk.org/browse/JDK-8275286), we added the `Thread::current()` checks for most of the JRT entries. But `JRT_LEAF` is still not checked, because not every `JRT_LEAF` carries a `JavaThread` argument. Having assertions there helps for two reasons. First, these methods can be called from the stub/compiler code, which might be erroneous with thread handling (especially in x86_32 that does not have a dedicated thread register). Second, in the post-Loom world, current thread can change suddenly, as evidenced here: https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2022-November/060779.html. We can add the thread checks to relevant `JRT_LEAF` methods that accept `JavaThread*` too. Additional testing: - [x] Linux x86_64 fastdebug `tier1` - [x] Linux x86_64 fastdebug `tier2` - [ ] Linux x86_32 fastdebug `tier1` - [ ] Linux x86_32 fastdebug `tier2` ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/11359/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11359&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297600 Stats: 32 lines in 8 files changed: 32 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11359.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11359/head:pull/11359 PR: https://git.openjdk.org/jdk/pull/11359 From duke at openjdk.org Thu Nov 24 22:05:15 2022 From: duke at openjdk.org (Ashutosh Mehra) Date: Thu, 24 Nov 2022 22:05:15 GMT Subject: RFR: 8297285: Shenandoah pacing causes assertion failure during VM initialization Message-ID: Please review the fix for the assertion failure seen during VM init due to pacing in shenandoah gc. The fix is to avoid pacing during VM initialization as the main thread is not yet an active java thread. Signed-off-by: Ashutosh Mehra ------------- Commit messages: - 8297285: Shenandoah pacing causes assertion failure during VM Changes: https://git.openjdk.org/jdk/pull/11360/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11360&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8297285 Stats: 7 lines in 1 file changed: 6 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/11360.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11360/head:pull/11360 PR: https://git.openjdk.org/jdk/pull/11360 From duke at openjdk.org Thu Nov 24 23:53:44 2022 From: duke at openjdk.org (Ashutosh Mehra) Date: Thu, 24 Nov 2022 23:53:44 GMT Subject: RFR: 8297285: Shenandoah pacing causes assertion failure during VM initialization [v2] In-Reply-To: References: Message-ID: > Please review the fix for the assertion failure seen during VM init due to pacing in shenandoah gc. > The fix is to avoid pacing during VM initialization as the main thread is not yet an active java thread. > > Signed-off-by: Ashutosh Mehra Ashutosh Mehra has updated the pull request incrementally with one additional commit since the last revision: Include runtime/javaThread.inline.hpp for JavaThread::is_terminated() to fix compile failure Signed-off-by: Ashutosh Mehra ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11360/files - new: https://git.openjdk.org/jdk/pull/11360/files/d8d439f0..60f174fc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11360&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11360&range=00-01 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11360.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11360/head:pull/11360 PR: https://git.openjdk.org/jdk/pull/11360 From duke at openjdk.org Thu Nov 24 23:53:44 2022 From: duke at openjdk.org (Ashutosh Mehra) Date: Thu, 24 Nov 2022 23:53:44 GMT Subject: RFR: 8297285: Shenandoah pacing causes assertion failure during VM initialization In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 21:57:06 GMT, Ashutosh Mehra wrote: > Please review the fix for the assertion failure seen during VM init due to pacing in shenandoah gc. > The fix is to avoid pacing during VM initialization as the main thread is not yet an active java thread. > > Signed-off-by: Ashutosh Mehra zero build was failing with following compile error: ------------- PR: https://git.openjdk.org/jdk/pull/11360 From dholmes at openjdk.org Sat Nov 26 08:21:13 2022 From: dholmes at openjdk.org (David Holmes) Date: Sat, 26 Nov 2022 08:21:13 GMT Subject: RFR: 8297600: Check current thread in selected JRT_LEAF methods In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 19:23:29 GMT, Aleksey Shipilev wrote: > With [JDK-8275286](https://bugs.openjdk.org/browse/JDK-8275286), we added the `Thread::current()` checks for most of the JRT entries. But `JRT_LEAF` is still not checked, because not every `JRT_LEAF` carries a `JavaThread` argument. Having assertions there helps for two reasons. First, these methods can be called from the stub/compiler code, which might be erroneous with thread handling (especially in x86_32 that does not have a dedicated thread register). Second, in the post-Loom world, current thread can change suddenly, as evidenced here: https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2022-November/060779.html. > > We can add the thread checks to relevant `JRT_LEAF` methods that accept `JavaThread*` too. > > Additional testing: > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` > - [x] Linux x86_32 fastdebug `tier1` > - [x] Linux x86_32 fastdebug `tier2` Unclear for many JVMCI functions that the thread argument is actually intended/required to be the current thread. It seems unused in many cases so why is it passed? src/hotspot/share/jvmci/jvmciRuntime.cpp line 584: > 582: > 583: JRT_LEAF(void, JVMCIRuntime::log_object(JavaThread* thread, oopDesc* obj, bool as_string, bool newline)) > 584: assert(thread == JavaThread::current(), "pre-condition"); `thread` seems unused in this function and so it is not obvious it has to be the current thread. src/hotspot/share/jvmci/jvmciRuntime.cpp line 611: > 609: > 610: void JVMCIRuntime::write_barrier_pre(JavaThread* thread, oopDesc* obj) { > 611: assert(thread == JavaThread::current(), "pre-condition"); Not obvious thread is expected/required to be current src/hotspot/share/jvmci/jvmciRuntime.cpp line 616: > 614: > 615: void JVMCIRuntime::write_barrier_post(JavaThread* thread, volatile CardValue* card_addr) { > 616: assert(thread == JavaThread::current(), "pre-condition"); Not obvious thread is expected/required to be current ------------- PR: https://git.openjdk.org/jdk/pull/11359 From jcking at openjdk.org Mon Nov 28 11:06:31 2022 From: jcking at openjdk.org (Justin King) Date: Mon, 28 Nov 2022 11:06:31 GMT Subject: Integrated: JDK-8297309: Memory leak in ShenandoahFullGC In-Reply-To: References: Message-ID: On Mon, 21 Nov 2022 04:25:13 GMT, Justin King wrote: > Signed-off-by: Justin King This pull request has now been integrated. Changeset: b80f5af6 Author: Justin King Committer: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/b80f5af6981440aec14f4dedbc5ee46364d0254c Stats: 5 lines in 2 files changed: 5 ins; 0 del; 0 mod 8297309: Memory leak in ShenandoahFullGC Reviewed-by: rkennke, shade ------------- PR: https://git.openjdk.org/jdk/pull/11255 From jvernee at openjdk.org Mon Nov 28 12:11:00 2022 From: jvernee at openjdk.org (Jorn Vernee) Date: Mon, 28 Nov 2022 12:11:00 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v31] In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 17:33:06 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > * remove unused Scoped interface > * re-add trusting of final fields in layout class implementations > * Fix BulkOps benchmark, which had alignment issues Latest version looks good to me as well ------------- Marked as reviewed by jvernee (Reviewer). PR: https://git.openjdk.org/jdk/pull/10872 From eosterlund at openjdk.org Mon Nov 28 12:14:59 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Mon, 28 Nov 2022 12:14:59 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v5] In-Reply-To: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: > The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. > > In particular, > 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. > > 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. > > 3) Refactoring the stack chunk allocation code > > Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: Patricio concerns ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11111/files - new: https://git.openjdk.org/jdk/pull/11111/files/3de25624..ad52bc7c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11111&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11111&range=03-04 Stats: 7 lines in 1 file changed: 0 ins; 7 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11111.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11111/head:pull/11111 PR: https://git.openjdk.org/jdk/pull/11111 From eosterlund at openjdk.org Mon Nov 28 12:15:03 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Mon, 28 Nov 2022 12:15:03 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v4] In-Reply-To: References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: On Mon, 21 Nov 2022 12:17:02 GMT, Patricio Chilano Mateo wrote: >> Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix Richard comments > > I went through the changes and all looks good to me. Only minor comments. > > Thanks, > Patricio Thanks for the review @pchilano! I made the changes you requested. > src/hotspot/share/runtime/continuationFreezeThaw.cpp line 1393: > >> 1391: // Guaranteed to be in young gen / newly allocated memory >> 1392: assert(!chunk->requires_barriers(), "Unfamiliar GC requires barriers on TLAB allocation"); >> 1393: _barriers = false; > > Do we need to explicitly set _barriers to false? It's already initialized to be false (same above for the UseZGC case). That would also allow to simplify the code a bit I think to be just an if statement that calls requires_barriers() for the "ZGC_ONLY(!UseZGC &&) (SHENANDOAHGC_ONLY(UseShenandoahGC ||) allocator.took_slow_path())" case, and then ZGC and the fast path could use just separate asserts outside conditionals. It's mainly there to improve readability at the moment. The simplification you have in mind applies well now, but unfortunately doesn't apply well for generational ZGC. And the main point of this PR is to prepare for generational ZGC integration. So I would prefer to leave it the way it is, if you are okay with that? ------------- PR: https://git.openjdk.org/jdk/pull/11111 From shade at openjdk.org Mon Nov 28 12:26:41 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 28 Nov 2022 12:26:41 GMT Subject: RFR: 8297600: Check current thread in selected JRT_LEAF methods [v2] In-Reply-To: References: Message-ID: > With [JDK-8275286](https://bugs.openjdk.org/browse/JDK-8275286), we added the `Thread::current()` checks for most of the JRT entries. But `JRT_LEAF` is still not checked, because not every `JRT_LEAF` carries a `JavaThread` argument. Having assertions there helps for two reasons. First, these methods can be called from the stub/compiler code, which might be erroneous with thread handling (especially in x86_32 that does not have a dedicated thread register). Second, in the post-Loom world, current thread can change suddenly, as evidenced here: https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2022-November/060779.html. > > We can add the thread checks to relevant `JRT_LEAF` methods that accept `JavaThread*` too. > > Additional testing: > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` > - [x] Linux x86_32 fastdebug `tier1` > - [x] Linux x86_32 fastdebug `tier2` Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: Revert some additions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/11359/files - new: https://git.openjdk.org/jdk/pull/11359/files/cfd86289..cde0c198 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=11359&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11359&range=00-01 Stats: 13 lines in 3 files changed: 0 ins; 13 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/11359.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11359/head:pull/11359 PR: https://git.openjdk.org/jdk/pull/11359 From shade at openjdk.org Mon Nov 28 12:26:42 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 28 Nov 2022 12:26:42 GMT Subject: RFR: 8297600: Check current thread in selected JRT_LEAF methods [v2] In-Reply-To: References: Message-ID: <5_vN_U3k0tUArJD0XcWBn4_a7pFKxt-PmQT677--DcQ=.673f717d-ae71-460e-93fc-0036473330c9@github.com> On Sat, 26 Nov 2022 08:18:42 GMT, David Holmes wrote: > Unclear for many JVMCI functions that the thread argument is actually intended/required to be the current thread. It seems unused in many cases so why is it passed? Yes, I agree the initial patch over-reached in some places. Please see new commit, which reduces it. I left `thread == JavaThread::current()` checks where I can argue the threads are expected to be current. ------------- PR: https://git.openjdk.org/jdk/pull/11359 From mdoerr at openjdk.org Mon Nov 28 13:11:13 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Mon, 28 Nov 2022 13:11:13 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v5] In-Reply-To: References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: On Mon, 28 Nov 2022 12:14:59 GMT, Erik ?sterlund wrote: >> The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. >> >> In particular, >> 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. >> >> 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. >> >> 3) Refactoring the stack chunk allocation code >> >> Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. > > Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: > > Patricio concerns I think PPC64 needs the change, too, now: https://github.com/openjdk/jdk/blob/c05dc80234a6beff3fa4d2de3228928c639da083/src/hotspot/cpu/ppc/sharedRuntime_ppc.cpp#L1660 ------------- PR: https://git.openjdk.org/jdk/pull/11111 From pchilanomate at openjdk.org Mon Nov 28 15:18:10 2022 From: pchilanomate at openjdk.org (Patricio Chilano Mateo) Date: Mon, 28 Nov 2022 15:18:10 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v4] In-Reply-To: References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: On Mon, 21 Nov 2022 12:17:02 GMT, Patricio Chilano Mateo wrote: >> Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix Richard comments > > I went through the changes and all looks good to me. Only minor comments. > > Thanks, > Patricio > Thanks for the review @pchilano! I made the changes you requested. > Looks good, thanks Erik! ------------- PR: https://git.openjdk.org/jdk/pull/11111 From eosterlund at openjdk.org Mon Nov 28 15:49:30 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Mon, 28 Nov 2022 15:49:30 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v6] In-Reply-To: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: <_P0NXex3w0yz8V-4FXZdTKT4Jt_eskqOYRykIoWqVrI=.4553b958-0425-460c-be29-a1ce884d0f27@github.com> > The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. > > In particular, > 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. > > 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. > > 3) Refactoring the stack chunk allocation code > > Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. Erik ?sterlund has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: - PPC support - Merge branch 'master' into 8296875_refactor_loom_code - Patricio concerns - Fix Richard comments - Indentation fix - Fix verification and RISC-V support - Generational ZGC: Loom support ------------- Changes: https://git.openjdk.org/jdk/pull/11111/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11111&range=05 Stats: 978 lines in 42 files changed: 641 ins; 228 del; 109 mod Patch: https://git.openjdk.org/jdk/pull/11111.diff Fetch: git fetch https://git.openjdk.org/jdk pull/11111/head:pull/11111 PR: https://git.openjdk.org/jdk/pull/11111 From mdoerr at openjdk.org Mon Nov 28 15:49:32 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Mon, 28 Nov 2022 15:49:32 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v5] In-Reply-To: References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: On Mon, 28 Nov 2022 12:14:59 GMT, Erik ?sterlund wrote: >> The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. >> >> In particular, >> 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. >> >> 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. >> >> 3) Refactoring the stack chunk allocation code >> >> Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. > > Erik ?sterlund has updated the pull request incrementally with one additional commit since the last revision: > > Patricio concerns Thanks for the update! ------------- PR: https://git.openjdk.org/jdk/pull/11111 From pminborg at openjdk.org Mon Nov 28 16:44:47 2022 From: pminborg at openjdk.org (Per Minborg) Date: Mon, 28 Nov 2022 16:44:47 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v31] In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 17:33:06 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > * remove unused Scoped interface > * re-add trusting of final fields in layout class implementations > * Fix BulkOps benchmark, which had alignment issues Looks good on API level. ------------- Marked as reviewed by pminborg (no project role). PR: https://git.openjdk.org/jdk/pull/10872 From psandoz at openjdk.org Mon Nov 28 18:30:46 2022 From: psandoz at openjdk.org (Paul Sandoz) Date: Mon, 28 Nov 2022 18:30:46 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v31] In-Reply-To: References: Message-ID: On Wed, 23 Nov 2022 17:33:06 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > * remove unused Scoped interface > * re-add trusting of final fields in layout class implementations > * Fix BulkOps benchmark, which had alignment issues Marked as reviewed by psandoz (Reviewer). src/java.base/share/classes/jdk/internal/foreign/FunctionDescriptorImpl.java line 57: > 55: * {@return the return layout (if any) associated with this function descriptor} > 56: */ > 57: public final Optional returnLayout() { No need for `final` since class is final. Suggestion: public Optional returnLayout() { src/java.base/share/classes/jdk/internal/foreign/SlicingAllocator.java line 33: > 31: public final class SlicingAllocator implements SegmentAllocator { > 32: > 33: public static final long DEFAULT_BLOCK_SIZE = 4 * 1024; Not used. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Mon Nov 28 19:29:08 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Mon, 28 Nov 2022 19:29:08 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v32] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Address review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/97168155..6699ad99 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=31 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=30-31 Stats: 8 lines in 2 files changed: 0 ins; 2 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From dholmes at openjdk.org Tue Nov 29 02:03:20 2022 From: dholmes at openjdk.org (David Holmes) Date: Tue, 29 Nov 2022 02:03:20 GMT Subject: RFR: 8297600: Check current thread in selected JRT_LEAF methods [v2] In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 12:26:41 GMT, Aleksey Shipilev wrote: >> With [JDK-8275286](https://bugs.openjdk.org/browse/JDK-8275286), we added the `Thread::current()` checks for most of the JRT entries. But `JRT_LEAF` is still not checked, because not every `JRT_LEAF` carries a `JavaThread` argument. Having assertions there helps for two reasons. First, these methods can be called from the stub/compiler code, which might be erroneous with thread handling (especially in x86_32 that does not have a dedicated thread register). Second, in the post-Loom world, current thread can change suddenly, as evidenced here: https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2022-November/060779.html. >> >> We can add the thread checks to relevant `JRT_LEAF` methods that accept `JavaThread*` too. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `tier1` >> - [x] Linux x86_64 fastdebug `tier2` >> - [x] Linux x86_32 fastdebug `tier1` >> - [x] Linux x86_32 fastdebug `tier2` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Revert some additions That seems more recognisably correct. It would be even better if `thread` were named `current` in those cases where it must be, but that is a separate RFE. Thanks. ------------- Marked as reviewed by dholmes (Reviewer). PR: https://git.openjdk.org/jdk/pull/11359 From wkemper at openjdk.org Tue Nov 29 23:11:31 2022 From: wkemper at openjdk.org (William Kemper) Date: Tue, 29 Nov 2022 23:11:31 GMT Subject: RFR: Merge openjdk/jdk:master Message-ID: Merge tag jdk-20+25 ------------- Commit messages: - Merge tag 'jdk-20+25' into merge-jdk-20-25 - 8297533: ProblemList java/io/File/TempDirDoesNotExist.java test failing on windows-x64 - 8297529: ProblemList javax/swing/JFileChooser/8046391/bug8046391.java on windows-x64 - 8297525: jdk/jshell/ToolBasicTest.java fails after JDK-8295984 - 7181214: Need specify SKF translateKey(SecurityKey) method requires instance of PBEKey for PBKDF2 algorithms - 8297338: JFR: RemoteRecordingStream doesn't respect setMaxAge and setMaxSize - 8290313: Produce warning when user specified java.io.tmpdir directory doesn't exist - 8297154: Improve safepoint cleanup logging - 8297507: Update header after JDK-8297230 - 8295984: Remove unexpected JShell feature - ... and 113 more: https://git.openjdk.org/shenandoah/compare/b594e541...a1efb66c The merge commit only contains trivial merges, so no merge-specific webrevs have been generated. Changes: https://git.openjdk.org/shenandoah/pull/175/files Stats: 44630 lines in 581 files changed: 16639 ins; 16876 del; 11115 mod Patch: https://git.openjdk.org/shenandoah/pull/175.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/175/head:pull/175 PR: https://git.openjdk.org/shenandoah/pull/175 From coleenp at openjdk.org Wed Nov 30 00:23:17 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 30 Nov 2022 00:23:17 GMT Subject: RFR: 8297600: Check current thread in selected JRT_LEAF methods [v2] In-Reply-To: References: Message-ID: <2GEzNJsx32XuhuFq7rkgF3C6bEUxZ_MzOh7pptfUuHY=.e3791947-02e0-408e-97a6-d2c71744b2b4@github.com> On Mon, 28 Nov 2022 12:26:41 GMT, Aleksey Shipilev wrote: >> With [JDK-8275286](https://bugs.openjdk.org/browse/JDK-8275286), we added the `Thread::current()` checks for most of the JRT entries. But `JRT_LEAF` is still not checked, because not every `JRT_LEAF` carries a `JavaThread` argument. Having assertions there helps for two reasons. First, these methods can be called from the stub/compiler code, which might be erroneous with thread handling (especially in x86_32 that does not have a dedicated thread register). Second, in the post-Loom world, current thread can change suddenly, as evidenced here: https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2022-November/060779.html. >> >> We can add the thread checks to relevant `JRT_LEAF` methods that accept `JavaThread*` too. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `tier1` >> - [x] Linux x86_64 fastdebug `tier2` >> - [x] Linux x86_32 fastdebug `tier1` >> - [x] Linux x86_32 fastdebug `tier2` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Revert some additions I thought this was going to be in the JRT_LEAF macro in interfaceSupport.inline.hpp but this seems fine. ------------- Marked as reviewed by coleenp (Reviewer). PR: https://git.openjdk.org/jdk/pull/11359 From darcy at openjdk.org Wed Nov 30 04:58:24 2022 From: darcy at openjdk.org (Joe Darcy) Date: Wed, 30 Nov 2022 04:58:24 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v32] In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 19:29:08 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments src/java.base/share/classes/java/lang/foreign/Linker.java line 288: > 286: > 287: /** > 288: * {@return A linker option used to denote the index of the first variadic argument layout in a Typo: "A linker" vs "a linker" ------------- PR: https://git.openjdk.org/jdk/pull/10872 From shade at openjdk.org Wed Nov 30 09:08:16 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 30 Nov 2022 09:08:16 GMT Subject: RFR: 8297600: Check current thread in selected JRT_LEAF methods [v2] In-Reply-To: References: Message-ID: On Mon, 28 Nov 2022 12:26:41 GMT, Aleksey Shipilev wrote: >> With [JDK-8275286](https://bugs.openjdk.org/browse/JDK-8275286), we added the `Thread::current()` checks for most of the JRT entries. But `JRT_LEAF` is still not checked, because not every `JRT_LEAF` carries a `JavaThread` argument. Having assertions there helps for two reasons. First, these methods can be called from the stub/compiler code, which might be erroneous with thread handling (especially in x86_32 that does not have a dedicated thread register). Second, in the post-Loom world, current thread can change suddenly, as evidenced here: https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2022-November/060779.html. >> >> We can add the thread checks to relevant `JRT_LEAF` methods that accept `JavaThread*` too. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `tier1` >> - [x] Linux x86_64 fastdebug `tier2` >> - [x] Linux x86_32 fastdebug `tier1` >> - [x] Linux x86_32 fastdebug `tier2` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Revert some additions Thanks! I am integrating then. ------------- PR: https://git.openjdk.org/jdk/pull/11359 From shade at openjdk.org Wed Nov 30 09:12:26 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 30 Nov 2022 09:12:26 GMT Subject: Integrated: 8297600: Check current thread in selected JRT_LEAF methods In-Reply-To: References: Message-ID: <4NaGNlP74men4NBqQkg5eRSEtjYo4RCeRM7hrSrTwCs=.0dce735b-71a7-4fc4-b0a9-2ee7fdf2d5d1@github.com> On Thu, 24 Nov 2022 19:23:29 GMT, Aleksey Shipilev wrote: > With [JDK-8275286](https://bugs.openjdk.org/browse/JDK-8275286), we added the `Thread::current()` checks for most of the JRT entries. But `JRT_LEAF` is still not checked, because not every `JRT_LEAF` carries a `JavaThread` argument. Having assertions there helps for two reasons. First, these methods can be called from the stub/compiler code, which might be erroneous with thread handling (especially in x86_32 that does not have a dedicated thread register). Second, in the post-Loom world, current thread can change suddenly, as evidenced here: https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2022-November/060779.html. > > We can add the thread checks to relevant `JRT_LEAF` methods that accept `JavaThread*` too. > > Additional testing: > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` > - [x] Linux x86_32 fastdebug `tier1` > - [x] Linux x86_32 fastdebug `tier2` This pull request has now been integrated. Changeset: b3501fd1 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/b3501fd11c59813515b46f80283e22b094c6e251 Stats: 19 lines in 7 files changed: 19 ins; 0 del; 0 mod 8297600: Check current thread in selected JRT_LEAF methods Reviewed-by: dholmes, coleenp ------------- PR: https://git.openjdk.org/jdk/pull/11359 From mcimadamore at openjdk.org Wed Nov 30 12:30:50 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Wed, 30 Nov 2022 12:30:50 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v33] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Polish javadoc: * Make sure that first para of class javadoc is succinct and descriptive * Remove references to "access" var handle or "memory segment view" var handle (just use var handle) * Minor tweak to layout classes javadoc - use `@see` in value layouts instead of a dedicated para. * Other minor typos fixes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/6699ad99..5a75118b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=32 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=31-32 Stats: 59 lines in 10 files changed: 19 ins; 18 del; 22 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From eosterlund at openjdk.org Wed Nov 30 14:11:39 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Wed, 30 Nov 2022 14:11:39 GMT Subject: RFR: 8296875: Generational ZGC: Refactor loom code [v6] In-Reply-To: <_P0NXex3w0yz8V-4FXZdTKT4Jt_eskqOYRykIoWqVrI=.4553b958-0425-460c-be29-a1ce884d0f27@github.com> References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> <_P0NXex3w0yz8V-4FXZdTKT4Jt_eskqOYRykIoWqVrI=.4553b958-0425-460c-be29-a1ce884d0f27@github.com> Message-ID: On Mon, 28 Nov 2022 15:49:30 GMT, Erik ?sterlund wrote: >> The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. >> >> In particular, >> 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. >> >> 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. >> >> 3) Refactoring the stack chunk allocation code >> >> Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. > > Erik ?sterlund has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: > > - PPC support > - Merge branch 'master' into 8296875_refactor_loom_code > - Patricio concerns > - Fix Richard comments > - Indentation fix > - Fix verification and RISC-V support > - Generational ZGC: Loom support Thanks for the reviews @coleenp and @TheRealMDoerr! ------------- PR: https://git.openjdk.org/jdk/pull/11111 From eosterlund at openjdk.org Wed Nov 30 14:11:39 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Wed, 30 Nov 2022 14:11:39 GMT Subject: Integrated: 8296875: Generational ZGC: Refactor loom code In-Reply-To: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> References: <2o2G0DQuCzMxGA0hq148c5E5ysEXUTKf9ymWsa7emOc=.35fa21f1-374e-4d0b-9619-68c81ac89301@github.com> Message-ID: On Fri, 11 Nov 2022 16:16:18 GMT, Erik ?sterlund wrote: > The current loom code makes some assumptions about GC that will not work with generational ZGC. We should make this code more GC agnostic, and provide a better interface for talking to the GC. > > In particular, > 1) All GCs have a way of encoding oops inside of the heap differently to oops outside of the heap. For non-ZGC collectors, that is compressed oops. For ZGC, that is colored pointers. With generational ZGC, pointers on-heap will be colored and pointers off-heap will be "colorless". So we need to generalize encoding and decoding of oops in the heap, for loom. > > 2) The cont_oop is located on a stack. In order to access it we need to start_processing on that thread, if it isn't the current thread. This happened to work so far for ZGC, because the stale pointers had enough colors. But with generational ZGC, these on-stack oops will be colorless, so we have to be more accurate here and ensure processing really has started on any thread that cont_oop is used on. To make life a bit easier, I'm moving the oop processing responsibility for these oops to the thread instead. Currently there is no more than one of these, so doing it lazily per frame seems a bit overkill. > > 3) Refactoring the stack chunk allocation code > > Tested with tier1-5 and manually running Skynet. No regressions detected. We have also been running with this (yet a slightly different backend) in the generational ZGC repo for a while now. This pull request has now been integrated. Changeset: be99e84c Author: Erik ?sterlund URL: https://git.openjdk.org/jdk/commit/be99e84c98786ff9c2c9ca1a979dc17ba810ae09 Stats: 978 lines in 42 files changed: 641 ins; 228 del; 109 mod 8296875: Generational ZGC: Refactor loom code Co-authored-by: Stefan Karlsson Co-authored-by: Axel Boldt-Christmas Reviewed-by: stefank, rrich, pchilanomate ------------- PR: https://git.openjdk.org/jdk/pull/11111 From mcimadamore at openjdk.org Wed Nov 30 15:14:26 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Wed, 30 Nov 2022 15:14:26 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v34] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Address review comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/5a75118b..ce85d182 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=33 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=32-33 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Wed Nov 30 15:30:40 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Wed, 30 Nov 2022 15:30:40 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v35] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 67 additional commits since the last revision: - Merge branch 'master' into PR_20 - Address review comment - Polish javadoc: * Make sure that first para of class javadoc is succinct and descriptive * Remove references to "access" var handle or "memory segment view" var handle (just use var handle) * Minor tweak to layout classes javadoc - use `@see` in value layouts instead of a dedicated para. * Other minor typos fixes - Address review comments - * remove unused Scoped interface * re-add trusting of final fields in layout class implementations * Fix BulkOps benchmark, which had alignment issues - Fix bit vs. byte mismatch in test - Fix wrong check in MemorySegment::spliterator/elements (The check which ensures that the segment size is multiple of spliterator element size is bogus) - Address more review comments - Fix bad @throws in MemorySegment::copy methods - Address review comments - ... and 57 more: https://git.openjdk.org/jdk/compare/d0d99ae1...8668fb39 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/ce85d182..8668fb39 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=34 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=33-34 Stats: 65983 lines in 1282 files changed: 30320 ins; 21180 del; 14483 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From alanb at openjdk.org Wed Nov 30 16:41:16 2022 From: alanb at openjdk.org (Alan Bateman) Date: Wed, 30 Nov 2022 16:41:16 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v35] In-Reply-To: References: Message-ID: On Wed, 30 Nov 2022 15:30:40 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 67 additional commits since the last revision: > > - Merge branch 'master' into PR_20 > - Address review comment > - Polish javadoc: > * Make sure that first para of class javadoc is succinct and descriptive > * Remove references to "access" var handle or "memory segment view" var handle (just use var handle) > * Minor tweak to layout classes javadoc - use `@see` in value layouts instead of a dedicated para. > * Other minor typos fixes > - Address review comments > - * remove unused Scoped interface > * re-add trusting of final fields in layout class implementations > * Fix BulkOps benchmark, which had alignment issues > - Fix bit vs. byte mismatch in test > - Fix wrong check in MemorySegment::spliterator/elements > (The check which ensures that the segment size is multiple of spliterator element size is bogus) > - Address more review comments > - Fix bad @throws in MemorySegment::copy methods > - Address review comments > - ... and 57 more: https://git.openjdk.org/jdk/compare/4c9f206a...8668fb39 src/java.base/share/classes/java/nio/channels/FileChannel.java line 1004: > 1002: * Maps a region of this channel's file into a new mapped memory segment, with the given offset, > 1003: * size and memory session. The {@linkplain MemorySegment#address() address} of the returned memory segment > 1004: * is the starting address of the mapped off-heap region backing the segment. Would you mind reflowing this paragraph to that the line lengths are a bit more consistent with the paragraphs that follow? That would also help with side-by-side views when looking at changes. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From alanb at openjdk.org Wed Nov 30 16:44:42 2022 From: alanb at openjdk.org (Alan Bateman) Date: Wed, 30 Nov 2022 16:44:42 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v35] In-Reply-To: References: Message-ID: <2wV4OEuvJQXQGnRNZ7qhv1PZuMlEYFBqnDgOp5L6D9U=.76a9f864-872a-4c39-a02e-2b0646414571@github.com> On Wed, 30 Nov 2022 15:30:40 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 67 additional commits since the last revision: > > - Merge branch 'master' into PR_20 > - Address review comment > - Polish javadoc: > * Make sure that first para of class javadoc is succinct and descriptive > * Remove references to "access" var handle or "memory segment view" var handle (just use var handle) > * Minor tweak to layout classes javadoc - use `@see` in value layouts instead of a dedicated para. > * Other minor typos fixes > - Address review comments > - * remove unused Scoped interface > * re-add trusting of final fields in layout class implementations > * Fix BulkOps benchmark, which had alignment issues > - Fix bit vs. byte mismatch in test > - Fix wrong check in MemorySegment::spliterator/elements > (The check which ensures that the segment size is multiple of spliterator element size is bogus) > - Address more review comments > - Fix bad @throws in MemorySegment::copy methods > - Address review comments > - ... and 57 more: https://git.openjdk.org/jdk/compare/e1da2b11...8668fb39 src/java.base/share/classes/java/lang/foreign/SegmentScope.java line 1: > 1: package java.lang.foreign; This one is missing a header. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From alanb at openjdk.org Wed Nov 30 16:49:00 2022 From: alanb at openjdk.org (Alan Bateman) Date: Wed, 30 Nov 2022 16:49:00 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v35] In-Reply-To: References: Message-ID: <1Ao-HvZlCHoGgLIJSJTXOOnvoR1pRtaZoljYUZpFEv0=.1a93bb09-06aa-4fad-905a-41f5f12b6945@github.com> On Wed, 30 Nov 2022 15:30:40 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 67 additional commits since the last revision: > > - Merge branch 'master' into PR_20 > - Address review comment > - Polish javadoc: > * Make sure that first para of class javadoc is succinct and descriptive > * Remove references to "access" var handle or "memory segment view" var handle (just use var handle) > * Minor tweak to layout classes javadoc - use `@see` in value layouts instead of a dedicated para. > * Other minor typos fixes > - Address review comments > - * remove unused Scoped interface > * re-add trusting of final fields in layout class implementations > * Fix BulkOps benchmark, which had alignment issues > - Fix bit vs. byte mismatch in test > - Fix wrong check in MemorySegment::spliterator/elements > (The check which ensures that the segment size is multiple of spliterator element size is bogus) > - Address more review comments > - Fix bad @throws in MemorySegment::copy methods > - Address review comments > - ... and 57 more: https://git.openjdk.org/jdk/compare/3e822e72...8668fb39 src/java.base/share/classes/java/lang/ModuleLayer.java line 313: > 311: * where possible. > 312: * > 313: * @since 20 We usually put the "@since 20" after the params/return/throws. ------------- PR: https://git.openjdk.org/jdk/pull/10872 From alanb at openjdk.org Wed Nov 30 16:56:56 2022 From: alanb at openjdk.org (Alan Bateman) Date: Wed, 30 Nov 2022 16:56:56 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v35] In-Reply-To: References: Message-ID: On Wed, 30 Nov 2022 15:30:40 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 67 additional commits since the last revision: > > - Merge branch 'master' into PR_20 > - Address review comment > - Polish javadoc: > * Make sure that first para of class javadoc is succinct and descriptive > * Remove references to "access" var handle or "memory segment view" var handle (just use var handle) > * Minor tweak to layout classes javadoc - use `@see` in value layouts instead of a dedicated para. > * Other minor typos fixes > - Address review comments > - * remove unused Scoped interface > * re-add trusting of final fields in layout class implementations > * Fix BulkOps benchmark, which had alignment issues > - Fix bit vs. byte mismatch in test > - Fix wrong check in MemorySegment::spliterator/elements > (The check which ensures that the segment size is multiple of spliterator element size is bogus) > - Address more review comments > - Fix bad @throws in MemorySegment::copy methods > - Address review comments > - ... and 57 more: https://git.openjdk.org/jdk/compare/9a07ecac...8668fb39 src/java.base/share/classes/java/lang/foreign/SegmentScope.java line 69: > 67: * Creates a new scope that is managed, automatically, by the garbage collector. > 68: * Segments associated with the returned scope can be > 69: * {@linkplain SegmentScope#isAccessibleBy(Thread) accessed} by multiple threads. "can be accessed by multiple threads" hints a bit of concurrency. It might be clearer to say "by any thread". ------------- PR: https://git.openjdk.org/jdk/pull/10872 From alanb at openjdk.org Wed Nov 30 17:08:13 2022 From: alanb at openjdk.org (Alan Bateman) Date: Wed, 30 Nov 2022 17:08:13 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v35] In-Reply-To: References: Message-ID: <9lmq8wD3c1YD4s0NGrHQXV6CgVJyJ9S42xUmx1FzXJ0=.2f64ed69-a54d-4517-ae29-1539f06cffe0@github.com> On Wed, 30 Nov 2022 15:30:40 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 67 additional commits since the last revision: > > - Merge branch 'master' into PR_20 > - Address review comment > - Polish javadoc: > * Make sure that first para of class javadoc is succinct and descriptive > * Remove references to "access" var handle or "memory segment view" var handle (just use var handle) > * Minor tweak to layout classes javadoc - use `@see` in value layouts instead of a dedicated para. > * Other minor typos fixes > - Address review comments > - * remove unused Scoped interface > * re-add trusting of final fields in layout class implementations > * Fix BulkOps benchmark, which had alignment issues > - Fix bit vs. byte mismatch in test > - Fix wrong check in MemorySegment::spliterator/elements > (The check which ensures that the segment size is multiple of spliterator element size is bogus) > - Address more review comments > - Fix bad @throws in MemorySegment::copy methods > - Address review comments > - ... and 57 more: https://git.openjdk.org/jdk/compare/ddc274f3...8668fb39 src/java.base/share/classes/java/lang/foreign/Arena.java line 135: > 133: * @apiNote This operation is not idempotent; that is, closing an already closed arena always results in an > 134: * exception being thrown. This reflects a deliberate design choice: arena state transitions should be > 135: * manifest in the client code; a failure in any of these transitions reveals a bug in the underlying application Not important but I'm not sure about the wording here. Maybe you mean "manifested" or "should manifest" ? src/java.base/share/classes/java/lang/foreign/Arena.java line 155: > 153: > 154: /** > 155: * {@return a new confined arena} For completeness, this should probably say "a new confined arena owned by the current thread". ------------- PR: https://git.openjdk.org/jdk/pull/10872 From duke at openjdk.org Wed Nov 30 17:55:30 2022 From: duke at openjdk.org (Ashutosh Mehra) Date: Wed, 30 Nov 2022 17:55:30 GMT Subject: RFR: 8297285: Shenandoah pacing causes assertion failure during VM initialization [v2] In-Reply-To: References: Message-ID: On Thu, 24 Nov 2022 23:53:44 GMT, Ashutosh Mehra wrote: >> Please review the fix for the assertion failure seen during VM init due to pacing in shenandoah gc. >> The fix is to avoid pacing during VM initialization as the main thread is not yet an active java thread. >> >> Signed-off-by: Ashutosh Mehra > > Ashutosh Mehra has updated the pull request incrementally with one additional commit since the last revision: > > Include runtime/javaThread.inline.hpp for JavaThread::is_terminated() to > fix compile failure > > Signed-off-by: Ashutosh Mehra Can I please get reviews for this PR. ------------- PR: https://git.openjdk.org/jdk/pull/11360 From wkemper at openjdk.org Wed Nov 30 18:06:48 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 30 Nov 2022 18:06:48 GMT Subject: RFR: Merge openjdk/jdk:master [v2] In-Reply-To: References: Message-ID: > Merge tag jdk-20+25 William Kemper has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 185 commits: - Merge tag 'jdk-20+25' into merge-jdk-20-25 Added tag jdk-20+25 for changeset 09ac9eb5 - Various build fixes Reviewed-by: kdnilsen - Load balance remset scan Reviewed-by: wkemper - Do not apply evacuation budgets in non-generational mode Reviewed-by: kdnilsen - Merge openjdk/jdk:master - Change affiliation representation Reviewed-by: ysr, wkemper - Improve some defaults and remove unused options for generational mode Reviewed-by: rkennke - Merge openjdk/jdk:master Reviewed-by: rkennke - Improve evacuation instrumentation Reviewed-by: kdnilsen - Fix preemption of coalesce and fill Reviewed-by: wkemper - ... and 175 more: https://git.openjdk.org/shenandoah/compare/09ac9eb5...a1efb66c ------------- Changes: https://git.openjdk.org/shenandoah/pull/175/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=175&range=01 Stats: 15009 lines in 148 files changed: 13749 ins; 507 del; 753 mod Patch: https://git.openjdk.org/shenandoah/pull/175.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/175/head:pull/175 PR: https://git.openjdk.org/shenandoah/pull/175 From wkemper at openjdk.org Wed Nov 30 18:06:52 2022 From: wkemper at openjdk.org (William Kemper) Date: Wed, 30 Nov 2022 18:06:52 GMT Subject: Integrated: Merge openjdk/jdk:master In-Reply-To: References: Message-ID: On Tue, 29 Nov 2022 23:03:14 GMT, William Kemper wrote: > Merge tag jdk-20+25 This pull request has now been integrated. Changeset: f90a7701 Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/f90a77016d934dd06b6e9cff000b35e00778834e Stats: 44630 lines in 581 files changed: 16639 ins; 16876 del; 11115 mod Merge openjdk/jdk:master ------------- PR: https://git.openjdk.org/shenandoah/pull/175 From mcimadamore at openjdk.org Wed Nov 30 18:14:00 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Wed, 30 Nov 2022 18:14:00 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v36] In-Reply-To: References: Message-ID: <1oR6S6K1w-GPz7Mw67Sqw9s8mPI4YDyC9_FOOjIqJU4=.9e645539-2bba-4740-be1e-e61493a3252f@github.com> > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Address review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/8668fb39..df8a4a63 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=35 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=34-35 Stats: 34 lines in 3 files changed: 29 ins; 2 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From alanb at openjdk.org Wed Nov 30 20:35:29 2022 From: alanb at openjdk.org (Alan Bateman) Date: Wed, 30 Nov 2022 20:35:29 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v36] In-Reply-To: <1oR6S6K1w-GPz7Mw67Sqw9s8mPI4YDyC9_FOOjIqJU4=.9e645539-2bba-4740-be1e-e61493a3252f@github.com> References: <1oR6S6K1w-GPz7Mw67Sqw9s8mPI4YDyC9_FOOjIqJU4=.9e645539-2bba-4740-be1e-e61493a3252f@github.com> Message-ID: On Wed, 30 Nov 2022 18:14:00 GMT, Maurizio Cimadamore wrote: >> This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. >> >> [1] - https://openjdk.org/jeps/434 > > Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments Marked as reviewed by alanb (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Wed Nov 30 21:56:51 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Wed, 30 Nov 2022 21:56:51 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v37] In-Reply-To: References: Message-ID: <60l3vr69yRpaCUeht5gNEVsf7ODpvdqFHpHdjxnfkAo=.a9b6e1de-4f38-4cf2-a840-e5cb249c522c@github.com> > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision: Address review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10872/files - new: https://git.openjdk.org/jdk/pull/10872/files/df8a4a63..198f30c0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=36 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=35-36 Stats: 6 lines in 2 files changed: 0 ins; 1 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Wed Nov 30 22:05:59 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Wed, 30 Nov 2022 22:05:59 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) [v38] In-Reply-To: References: Message-ID: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Maurizio Cimadamore has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 70 commits: - Merge branch 'master' into PR_20 - Address review comments - Address review comments - Merge branch 'master' into PR_20 - Address review comment - Polish javadoc: * Make sure that first para of class javadoc is succinct and descriptive * Remove references to "access" var handle or "memory segment view" var handle (just use var handle) * Minor tweak to layout classes javadoc - use `@see` in value layouts instead of a dedicated para. * Other minor typos fixes - Address review comments - * remove unused Scoped interface * re-add trusting of final fields in layout class implementations * Fix BulkOps benchmark, which had alignment issues - Fix bit vs. byte mismatch in test - Fix wrong check in MemorySegment::spliterator/elements (The check which ensures that the segment size is multiple of spliterator element size is bogus) - ... and 60 more: https://git.openjdk.org/jdk/compare/4485d4e5...8b5dc0f0 ------------- Changes: https://git.openjdk.org/jdk/pull/10872/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=37 Stats: 13807 lines in 254 files changed: 5780 ins; 4448 del; 3579 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872