From vladimir.kempik at gmail.com Thu Jul 28 20:50:00 2022 From: vladimir.kempik at gmail.com (=?utf-8?B?0JLQu9Cw0LTQuNC80LjRgCDQmtC10LzQv9C40Lo=?=) Date: Thu, 28 Jul 2022 23:50:00 +0300 Subject: Unaligned memory access with JDK Message-ID: <4BBDE959-CB10-4B54-9CB6-BE126AFE0065@gmail.com> Hello I was recently playing with a simple risc-v core running on an fpga and found the idk crashes on it. It crashes with SIG_ILL : ILL_TRP, on a simple load from memory instruction. So I figured the main issue is unaligned memory access MacroAssembler::stop() and that risc-v core was pretty simple and didn?t support unaligned memory access. Here is what I found: void MacroAssembler::stop(const char* msg) { const char * msg1 = ((uint64_t)msg) & ~0x07 + 0x08; BLOCK_COMMENT(msg1); illegal_instruction(Assembler::csr::time); emit_int64((uintptr_t)msg1); } and emit_64 is : void emit_int64( int64_t x) { *((int64_t*) end()) = x; set_end(end() + sizeof(int64_t)); } the problem is that the end() pointer is shared between emit_int64, emit_int32, emit_int8, etc, and non of them do care about natural memory alignment for processed types: void emit_int32(int32_t x) { address curr = end(); *((int32_t*) curr) = x; set_end(curr + sizeof(int32_t)); } void emit_int8(int8_t x1) { address curr = end(); *((int8_t*) curr++) = x1; set_end(curr); } So my question is - risc-v cores without unaligned memory access support - are they supported by risc-v openjdk port ? From palmer at dabbelt.com Thu Jul 28 21:19:10 2022 From: palmer at dabbelt.com (Palmer Dabbelt) Date: Thu, 28 Jul 2022 14:19:10 -0700 (PDT) Subject: Unaligned memory access with JDK In-Reply-To: <4BBDE959-CB10-4B54-9CB6-BE126AFE0065@gmail.com> Message-ID: On Thu, 28 Jul 2022 13:50:00 PDT (-0700), vladimir.kempik at gmail.com wrote: > Hello > I was recently playing with a simple risc-v core running on an fpga and found the idk crashes on it. > It crashes with SIG_ILL : ILL_TRP, on a simple load from memory instruction. > So I figured the main issue is unaligned memory access MacroAssembler::stop() and that risc-v core was pretty simple and didn?t support unaligned memory access. > Here is what I found: > > void MacroAssembler::stop(const char* msg) { > const char * msg1 = ((uint64_t)msg) & ~0x07 + 0x08; > BLOCK_COMMENT(msg1); > illegal_instruction(Assembler::csr::time); > emit_int64((uintptr_t)msg1); > } > > and emit_64 is : > void emit_int64( int64_t x) { *((int64_t*) end()) = x; set_end(end() + sizeof(int64_t)); } > > the problem is that the end() pointer is shared between emit_int64, emit_int32, emit_int8, etc, and non of them do care about natural memory alignment for processed types: > > void emit_int32(int32_t x) { > address curr = end(); > *((int32_t*) curr) = x; > set_end(curr + sizeof(int32_t)); > } > > void emit_int8(int8_t x1) { > address curr = end(); > *((int8_t*) curr++) = x1; > set_end(curr); > } > > So my question is - risc-v cores without unaligned memory access support - are they supported by risc-v openjdk port ? Support for misaligned accesses lives in a weird grey area in RISC-V: misaligned accesses used to be mandated by the ISA, but that requirement was removed in 2018 via 61cadb9 ("Provide new description of misaligned load/store behavior compatible with privileged architecture."). I just sent a patch to document this, looks like we never bothered to write it down (probably because nobody was watching for the ISA change). That said, some implementations support misaligned accesses via a M-mode trap handler, as implementations can do essentially anything they want in RISC-V. IIUC most of the RISC-V OpenJDK port was done on systems that have hardware support for misaligned accesses, but even on systems that trap to M-mode the port should function correctly -- sure it'll be slow, but the support should otherwise be transparent to userspace (and even to Linux). It might be worth fixing that performance issue, but if you're seeing a SIGILL from a misaligned access then there's likely also a functional bug in the emulation routines or Linux. From vladimir.kempik at gmail.com Thu Jul 28 21:22:33 2022 From: vladimir.kempik at gmail.com (=?utf-8?B?0JLQu9Cw0LTQuNC80LjRgCDQmtC10LzQv9C40Lo=?=) Date: Fri, 29 Jul 2022 00:22:33 +0300 Subject: Unaligned memory access with JDK In-Reply-To: References: Message-ID: <29811AA7-733F-4674-9BEC-B01F48A92D65@gmail.com> Right, the system I was playing with doesn?t have misaligned access emulation enabled in M-mode, but that can be enabled. THanks for clarifying, I was wondering is it a bug or a feature. > 29 ???? 2022 ?., ? 00:19, Palmer Dabbelt ???????(?): > > On Thu, 28 Jul 2022 13:50:00 PDT (-0700), vladimir.kempik at gmail.com wrote: >> Hello >> I was recently playing with a simple risc-v core running on an fpga and found the idk crashes on it. >> It crashes with SIG_ILL : ILL_TRP, on a simple load from memory instruction. >> So I figured the main issue is unaligned memory access MacroAssembler::stop() and that risc-v core was pretty simple and didn?t support unaligned memory access. >> Here is what I found: >> >> void MacroAssembler::stop(const char* msg) { >> const char * msg1 = ((uint64_t)msg) & ~0x07 + 0x08; >> BLOCK_COMMENT(msg1); >> illegal_instruction(Assembler::csr::time); >> emit_int64((uintptr_t)msg1); >> } >> >> and emit_64 is : >> void emit_int64( int64_t x) { *((int64_t*) end()) = x; set_end(end() + sizeof(int64_t)); } >> >> the problem is that the end() pointer is shared between emit_int64, emit_int32, emit_int8, etc, and non of them do care about natural memory alignment for processed types: >> >> void emit_int32(int32_t x) { >> address curr = end(); >> *((int32_t*) curr) = x; >> set_end(curr + sizeof(int32_t)); >> } >> >> void emit_int8(int8_t x1) { >> address curr = end(); >> *((int8_t*) curr++) = x1; >> set_end(curr); >> } >> >> So my question is - risc-v cores without unaligned memory access support - are they supported by risc-v openjdk port ? > > Support for misaligned accesses lives in a weird grey area in RISC-V: misaligned accesses used to be mandated by the ISA, but that requirement was removed in 2018 via 61cadb9 ("Provide new description of misaligned load/store behavior compatible with privileged architecture."). I just sent a patch to document this, looks like we never bothered to write it down (probably because nobody was watching for the ISA change). > > That said, some implementations support misaligned accesses via a M-mode trap handler, as implementations can do essentially anything they want in RISC-V. IIUC most of the RISC-V OpenJDK port was done on systems that have hardware support for misaligned accesses, but even on systems that trap to M-mode the port should function correctly -- sure it'll be slow, but the support should otherwise be transparent to userspace (and even to Linux). It might be worth fixing that performance issue, but if you're seeing a SIGILL from a misaligned access then there's likely also a functional bug in the emulation routines or Linux. From palmer at dabbelt.com Thu Jul 28 21:30:03 2022 From: palmer at dabbelt.com (Palmer Dabbelt) Date: Thu, 28 Jul 2022 14:30:03 -0700 (PDT) Subject: Unaligned memory access with JDK In-Reply-To: <29811AA7-733F-4674-9BEC-B01F48A92D65@gmail.com> Message-ID: On Thu, 28 Jul 2022 14:22:33 PDT (-0700), vladimir.kempik at gmail.com wrote: > Right, the system I was playing with doesn?t have misaligned access emulation enabled in M-mode, but that can be enabled. I suppose it's also undefined whether these accesses need to be handled in M-mode or if Linux should also do so, but there's currently no code in Linux to do that and no way for userspace to control that handling. In the long run we probably want to stop predenting misaligned accesses are supported when they're actually emulated, via something like PR_SET_UNALIGN (and some M-mode interface). That'd require some code, though, and either way we'd need to leave the default as is. > THanks for clarifying, I was wondering is it a bug or a feature. I guess it's both ;) > >> 29 ???? 2022 ?., ? 00:19, Palmer Dabbelt ???????(?): >> >> On Thu, 28 Jul 2022 13:50:00 PDT (-0700), vladimir.kempik at gmail.com wrote: >>> Hello >>> I was recently playing with a simple risc-v core running on an fpga and found the idk crashes on it. >>> It crashes with SIG_ILL : ILL_TRP, on a simple load from memory instruction. >>> So I figured the main issue is unaligned memory access MacroAssembler::stop() and that risc-v core was pretty simple and didn?t support unaligned memory access. >>> Here is what I found: >>> >>> void MacroAssembler::stop(const char* msg) { >>> const char * msg1 = ((uint64_t)msg) & ~0x07 + 0x08; >>> BLOCK_COMMENT(msg1); >>> illegal_instruction(Assembler::csr::time); >>> emit_int64((uintptr_t)msg1); >>> } >>> >>> and emit_64 is : >>> void emit_int64( int64_t x) { *((int64_t*) end()) = x; set_end(end() + sizeof(int64_t)); } >>> >>> the problem is that the end() pointer is shared between emit_int64, emit_int32, emit_int8, etc, and non of them do care about natural memory alignment for processed types: >>> >>> void emit_int32(int32_t x) { >>> address curr = end(); >>> *((int32_t*) curr) = x; >>> set_end(curr + sizeof(int32_t)); >>> } >>> >>> void emit_int8(int8_t x1) { >>> address curr = end(); >>> *((int8_t*) curr++) = x1; >>> set_end(curr); >>> } >>> >>> So my question is - risc-v cores without unaligned memory access support - are they supported by risc-v openjdk port ? >> >> Support for misaligned accesses lives in a weird grey area in RISC-V: misaligned accesses used to be mandated by the ISA, but that requirement was removed in 2018 via 61cadb9 ("Provide new description of misaligned load/store behavior compatible with privileged architecture."). I just sent a patch to document this, looks like we never bothered to write it down (probably because nobody was watching for the ISA change). >> >> That said, some implementations support misaligned accesses via a M-mode trap handler, as implementations can do essentially anything they want in RISC-V. IIUC most of the RISC-V OpenJDK port was done on systems that have hardware support for misaligned accesses, but even on systems that trap to M-mode the port should function correctly -- sure it'll be slow, but the support should otherwise be transparent to userspace (and even to Linux). It might be worth fixing that performance issue, but if you're seeing a SIGILL from a misaligned access then there's likely also a functional bug in the emulation routines or Linux. From yadonn.wang at huawei.com Fri Jul 29 07:03:58 2022 From: yadonn.wang at huawei.com (wangyadong (E)) Date: Fri, 29 Jul 2022 07:03:58 +0000 Subject: Unaligned memory access with JDK In-Reply-To: References: <29811AA7-733F-4674-9BEC-B01F48A92D65@gmail.com> Message-ID: > IIUC most of the RISC-V OpenJDK port was done on systems that have hardware support for misaligned accesses, but even on systems that trap to M-mode the port should function > correctly -- sure it'll be slow, but the support should otherwise be transparent to userspace (and even to Linux). Agree. The RISC-V port should cover the hardware not support for user-invisible misaligned accesses. > So my question is - risc-v cores without unaligned memory access support - are they supported by risc-v openjdk port ? We'll fix it, and that' great if you're interested :) Yadong -----Original Message----- From: riscv-port-dev [mailto:riscv-port-dev-retn at openjdk.org] On Behalf Of Palmer Dabbelt Sent: Friday, July 29, 2022 5:30 AM To: vladimir.kempik at gmail.com Cc: riscv-port-dev at openjdk.org Subject: Re: Unaligned memory access with JDK On Thu, 28 Jul 2022 14:22:33 PDT (-0700), vladimir.kempik at gmail.com wrote: > Right, the system I was playing with doesn?t have misaligned access emulation enabled in M-mode, but that can be enabled. I suppose it's also undefined whether these accesses need to be handled in M-mode or if Linux should also do so, but there's currently no code in Linux to do that and no way for userspace to control that handling. In the long run we probably want to stop predenting misaligned accesses are supported when they're actually emulated, via something like PR_SET_UNALIGN (and some M-mode interface). That'd require some code, though, and either way we'd need to leave the default as is. > THanks for clarifying, I was wondering is it a bug or a feature. I guess it's both ;) > >> 29 ???? 2022 ?., ? 00:19, Palmer Dabbelt ???????(?): >> >> On Thu, 28 Jul 2022 13:50:00 PDT (-0700), vladimir.kempik at gmail.com wrote: >>> Hello >>> I was recently playing with a simple risc-v core running on an fpga and found the idk crashes on it. >>> It crashes with SIG_ILL : ILL_TRP, on a simple load from memory instruction. >>> So I figured the main issue is unaligned memory access MacroAssembler::stop() and that risc-v core was pretty simple and didn?t support unaligned memory access. >>> Here is what I found: >>> >>> void MacroAssembler::stop(const char* msg) { const char * msg1 = >>> ((uint64_t)msg) & ~0x07 + 0x08; BLOCK_COMMENT(msg1); >>> illegal_instruction(Assembler::csr::time); >>> emit_int64((uintptr_t)msg1); >>> } >>> >>> and emit_64 is : >>> void emit_int64( int64_t x) { *((int64_t*) end()) = x; >>> set_end(end() + sizeof(int64_t)); } >>> >>> the problem is that the end() pointer is shared between emit_int64, emit_int32, emit_int8, etc, and non of them do care about natural memory alignment for processed types: >>> >>> void emit_int32(int32_t x) { >>> address curr = end(); >>> *((int32_t*) curr) = x; >>> set_end(curr + sizeof(int32_t)); >>> } >>> >>> void emit_int8(int8_t x1) { >>> address curr = end(); >>> *((int8_t*) curr++) = x1; >>> set_end(curr); >>> } >>> >>> So my question is - risc-v cores without unaligned memory access support - are they supported by risc-v openjdk port ? >> >> Support for misaligned accesses lives in a weird grey area in RISC-V: misaligned accesses used to be mandated by the ISA, but that requirement was removed in 2018 via 61cadb9 ("Provide new description of misaligned load/store behavior compatible with privileged architecture."). I just sent a patch to document this, looks like we never bothered to write it down (probably because nobody was watching for the ISA change). >> >> That said, some implementations support misaligned accesses via a M-mode trap handler, as implementations can do essentially anything they want in RISC-V. IIUC most of the RISC-V OpenJDK port was done on systems that have hardware support for misaligned accesses, but even on systems that trap to M-mode the port should function correctly -- sure it'll be slow, but the support should otherwise be transparent to userspace (and even to Linux). It might be worth fixing that performance issue, but if you're seeing a SIGILL from a misaligned access then there's likely also a functional bug in the emulation routines or Linux. From vladimir.kempik at gmail.com Fri Jul 29 07:24:19 2022 From: vladimir.kempik at gmail.com (=?utf-8?B?0JLQu9Cw0LTQuNC80LjRgCDQmtC10LzQv9C40Lo=?=) Date: Fri, 29 Jul 2022 10:24:19 +0300 Subject: Unaligned memory access with JDK In-Reply-To: References: <29811AA7-733F-4674-9BEC-B01F48A92D65@gmail.com> Message-ID: <72FB087D-396F-4029-A4AD-37EE6EF54560@gmail.com> Hello, Should I file a jbs bug then ? I have also found misaligned access in stack setup prologue of template interp generated methods. for example, the putstatic code: 0x3f89d033c0: ff8a0a13 addi s4,s4,-8 0x3f89d033c4: 00aa3023 sd a0,0(s4) 0x3f89d033c8: 0380006f j 56 # 0x3f89d03400 0x3f89d033cc: ff8a0a13 addi s4,s4,-8 0x3f89d033d0: 00aa2027 fsw fa0,0(s4) 0x3f89d033d4: 02c0006f j 44 # 0x3f89d03400 0x3f89d033d8: ff0a0a13 addi s4,s4,-16 0x3f89d033dc: 00aa3027 fsd fa0,0(s4) 0x3f89d033e0: 0200006f j 32 # 0x3f89d03400 0x3f89d033e4: ff0a0a13 addi s4,s4,-16 0x3f89d033e8: 000a3423 sd zero,8(s4) 0x3f89d033ec: 00aa3023 sd a0,0(s4) 0x3f89d033f0: 0100006f j 16 # 0x3f89d03400 0x3f89d033f4: ff8a0a13 addi s4,s4,-8 0x3f89d033f8: 0005053b addw a0,a0,zero 0x3f89d033fc: 00aa3023 sd a0,0(s4) 0x3f89d03400: 001b5683 lhu a3,1(s6). 29 ???? 2022 ?., ? 10:03, wangyadong (E) ???????(?): > >> IIUC most of the RISC-V OpenJDK port was done on systems that have hardware support for misaligned accesses, but even on systems that trap to M-mode the port should function >> correctly -- sure it'll be slow, but the support should otherwise be transparent to userspace (and even to Linux). > > Agree. The RISC-V port should cover the hardware not support for user-invisible misaligned accesses. > >> So my question is - risc-v cores without unaligned memory access support - are they supported by risc-v openjdk port ? > We'll fix it, and that' great if you're interested :) > > Yadong > > -----Original Message----- > From: riscv-port-dev [mailto:riscv-port-dev-retn at openjdk.org] On Behalf Of Palmer Dabbelt > Sent: Friday, July 29, 2022 5:30 AM > To: vladimir.kempik at gmail.com > Cc: riscv-port-dev at openjdk.org > Subject: Re: Unaligned memory access with JDK > > On Thu, 28 Jul 2022 14:22:33 PDT (-0700), vladimir.kempik at gmail.com wrote: >> Right, the system I was playing with doesn?t have misaligned access emulation enabled in M-mode, but that can be enabled. > > I suppose it's also undefined whether these accesses need to be handled in M-mode or if Linux should also do so, but there's currently no code in Linux to do that and no way for userspace to control that handling. > > In the long run we probably want to stop predenting misaligned accesses are supported when they're actually emulated, via something like PR_SET_UNALIGN (and some M-mode interface). That'd require some code, though, and either way we'd need to leave the default as is. > >> THanks for clarifying, I was wondering is it a bug or a feature. > > I guess it's both ;) > >> >>> 29 ???? 2022 ?., ? 00:19, Palmer Dabbelt ???????(?): >>> >>> On Thu, 28 Jul 2022 13:50:00 PDT (-0700), vladimir.kempik at gmail.com wrote: >>>> Hello >>>> I was recently playing with a simple risc-v core running on an fpga and found the idk crashes on it. >>>> It crashes with SIG_ILL : ILL_TRP, on a simple load from memory instruction. >>>> So I figured the main issue is unaligned memory access MacroAssembler::stop() and that risc-v core was pretty simple and didn?t support unaligned memory access. >>>> Here is what I found: >>>> >>>> void MacroAssembler::stop(const char* msg) { const char * msg1 = >>>> ((uint64_t)msg) & ~0x07 + 0x08; BLOCK_COMMENT(msg1); >>>> illegal_instruction(Assembler::csr::time); >>>> emit_int64((uintptr_t)msg1); >>>> } >>>> >>>> and emit_64 is : >>>> void emit_int64( int64_t x) { *((int64_t*) end()) = x; >>>> set_end(end() + sizeof(int64_t)); } >>>> >>>> the problem is that the end() pointer is shared between emit_int64, emit_int32, emit_int8, etc, and non of them do care about natural memory alignment for processed types: >>>> >>>> void emit_int32(int32_t x) { >>>> address curr = end(); >>>> *((int32_t*) curr) = x; >>>> set_end(curr + sizeof(int32_t)); >>>> } >>>> >>>> void emit_int8(int8_t x1) { >>>> address curr = end(); >>>> *((int8_t*) curr++) = x1; >>>> set_end(curr); >>>> } >>>> >>>> So my question is - risc-v cores without unaligned memory access support - are they supported by risc-v openjdk port ? >>> >>> Support for misaligned accesses lives in a weird grey area in RISC-V: misaligned accesses used to be mandated by the ISA, but that requirement was removed in 2018 via 61cadb9 ("Provide new description of misaligned load/store behavior compatible with privileged architecture."). I just sent a patch to document this, looks like we never bothered to write it down (probably because nobody was watching for the ISA change). >>> >>> That said, some implementations support misaligned accesses via a M-mode trap handler, as implementations can do essentially anything they want in RISC-V. IIUC most of the RISC-V OpenJDK port was done on systems that have hardware support for misaligned accesses, but even on systems that trap to M-mode the port should function correctly -- sure it'll be slow, but the support should otherwise be transparent to userspace (and even to Linux). It might be worth fixing that performance issue, but if you're seeing a SIGILL from a misaligned access then there's likely also a functional bug in the emulation routines or Linux. From yadonn.wang at huawei.com Fri Jul 29 07:41:54 2022 From: yadonn.wang at huawei.com (wangyadong (E)) Date: Fri, 29 Jul 2022 07:41:54 +0000 Subject: Unaligned memory access with JDK In-Reply-To: <72FB087D-396F-4029-A4AD-37EE6EF54560@gmail.com> References: <29811AA7-733F-4674-9BEC-B01F48A92D65@gmail.com> <72FB087D-396F-4029-A4AD-37EE6EF54560@gmail.com> Message-ID: <4588851b17b4465caf6c4fc8cc7787d5@huawei.com> > Hello, Should I file a jbs bug then ? Of course. -----Original Message----- From: ???????? ?????? [mailto:vladimir.kempik at gmail.com] Sent: Friday, July 29, 2022 3:24 PM To: wangyadong (E) Cc: Palmer Dabbelt ; riscv-port-dev at openjdk.org Subject: Re: Unaligned memory access with JDK Hello, Should I file a jbs bug then ? I have also found misaligned access in stack setup prologue of template interp generated methods. for example, the putstatic code: 0x3f89d033c0: ff8a0a13 addi s4,s4,-8 0x3f89d033c4: 00aa3023 sd a0,0(s4) 0x3f89d033c8: 0380006f j 56 # 0x3f89d03400 0x3f89d033cc: ff8a0a13 addi s4,s4,-8 0x3f89d033d0: 00aa2027 fsw fa0,0(s4) 0x3f89d033d4: 02c0006f j 44 # 0x3f89d03400 0x3f89d033d8: ff0a0a13 addi s4,s4,-16 0x3f89d033dc: 00aa3027 fsd fa0,0(s4) 0x3f89d033e0: 0200006f j 32 # 0x3f89d03400 0x3f89d033e4: ff0a0a13 addi s4,s4,-16 0x3f89d033e8: 000a3423 sd zero,8(s4) 0x3f89d033ec: 00aa3023 sd a0,0(s4) 0x3f89d033f0: 0100006f j 16 # 0x3f89d03400 0x3f89d033f4: ff8a0a13 addi s4,s4,-8 0x3f89d033f8: 0005053b addw a0,a0,zero 0x3f89d033fc: 00aa3023 sd a0,0(s4) 0x3f89d03400: 001b5683 lhu a3,1(s6). 29 ???? 2022 ?., ? 10:03, wangyadong (E) ???????(?): > >> IIUC most of the RISC-V OpenJDK port was done on systems that have >> hardware support for misaligned accesses, but even on systems that trap to M-mode the port should function correctly -- sure it'll be slow, but the support should otherwise be transparent to userspace (and even to Linux). > > Agree. The RISC-V port should cover the hardware not support for user-invisible misaligned accesses. > >> So my question is - risc-v cores without unaligned memory access support - are they supported by risc-v openjdk port ? > We'll fix it, and that' great if you're interested :) > > Yadong > > -----Original Message----- > From: riscv-port-dev [mailto:riscv-port-dev-retn at openjdk.org] On > Behalf Of Palmer Dabbelt > Sent: Friday, July 29, 2022 5:30 AM > To: vladimir.kempik at gmail.com > Cc: riscv-port-dev at openjdk.org > Subject: Re: Unaligned memory access with JDK > > On Thu, 28 Jul 2022 14:22:33 PDT (-0700), vladimir.kempik at gmail.com wrote: >> Right, the system I was playing with doesn?t have misaligned access emulation enabled in M-mode, but that can be enabled. > > I suppose it's also undefined whether these accesses need to be handled in M-mode or if Linux should also do so, but there's currently no code in Linux to do that and no way for userspace to control that handling. > > In the long run we probably want to stop predenting misaligned accesses are supported when they're actually emulated, via something like PR_SET_UNALIGN (and some M-mode interface). That'd require some code, though, and either way we'd need to leave the default as is. > >> THanks for clarifying, I was wondering is it a bug or a feature. > > I guess it's both ;) > >> >>> 29 ???? 2022 ?., ? 00:19, Palmer Dabbelt ???????(?): >>> >>> On Thu, 28 Jul 2022 13:50:00 PDT (-0700), vladimir.kempik at gmail.com wrote: >>>> Hello >>>> I was recently playing with a simple risc-v core running on an fpga and found the idk crashes on it. >>>> It crashes with SIG_ILL : ILL_TRP, on a simple load from memory instruction. >>>> So I figured the main issue is unaligned memory access MacroAssembler::stop() and that risc-v core was pretty simple and didn?t support unaligned memory access. >>>> Here is what I found: >>>> >>>> void MacroAssembler::stop(const char* msg) { const char * msg1 = >>>> ((uint64_t)msg) & ~0x07 + 0x08; BLOCK_COMMENT(msg1); >>>> illegal_instruction(Assembler::csr::time); >>>> emit_int64((uintptr_t)msg1); >>>> } >>>> >>>> and emit_64 is : >>>> void emit_int64( int64_t x) { *((int64_t*) end()) = x; >>>> set_end(end() + sizeof(int64_t)); } >>>> >>>> the problem is that the end() pointer is shared between emit_int64, emit_int32, emit_int8, etc, and non of them do care about natural memory alignment for processed types: >>>> >>>> void emit_int32(int32_t x) { >>>> address curr = end(); >>>> *((int32_t*) curr) = x; >>>> set_end(curr + sizeof(int32_t)); >>>> } >>>> >>>> void emit_int8(int8_t x1) { >>>> address curr = end(); >>>> *((int8_t*) curr++) = x1; >>>> set_end(curr); >>>> } >>>> >>>> So my question is - risc-v cores without unaligned memory access support - are they supported by risc-v openjdk port ? >>> >>> Support for misaligned accesses lives in a weird grey area in RISC-V: misaligned accesses used to be mandated by the ISA, but that requirement was removed in 2018 via 61cadb9 ("Provide new description of misaligned load/store behavior compatible with privileged architecture."). I just sent a patch to document this, looks like we never bothered to write it down (probably because nobody was watching for the ISA change). >>> >>> That said, some implementations support misaligned accesses via a M-mode trap handler, as implementations can do essentially anything they want in RISC-V. IIUC most of the RISC-V OpenJDK port was done on systems that have hardware support for misaligned accesses, but even on systems that trap to M-mode the port should function correctly -- sure it'll be slow, but the support should otherwise be transparent to userspace (and even to Linux). It might be worth fixing that performance issue, but if you're seeing a SIGILL from a misaligned access then there's likely also a functional bug in the emulation routines or Linux. From vladimir.kempik at gmail.com Fri Jul 29 10:30:44 2022 From: vladimir.kempik at gmail.com (=?utf-8?B?0JLQu9Cw0LTQuNC80LjRgCDQmtC10LzQv9C40Lo=?=) Date: Fri, 29 Jul 2022 13:30:44 +0300 Subject: The usage of fence.i in openjdk Message-ID: <845742B4-B2D8-40C1-8BD0-D60142EFD45E@gmail.com> Hello I was looking at the generated executable code sync across all harts in openjdk and found few thing to not be in line with the spec. Looking at the spec: https://github.com/riscv/riscv-isa-manual/blob/master/src/zifencei.tex , there are few important moments: >Because FENCE.I only orders stores with a hart's own instruction >fetches, application code should only rely upon FENCE.I if the >application thread will not be migrated to a different hart. The EEI >can provide mechanisms for efficient multiprocessor instruction-stream >synchronization. I believe Java?s threads can migrate to different hart at any moment, hence the use of fence.i is dangerous. There are few places where fence.i ( via fence_i() ) are used in openjdk at the moment: void Assembler::ifence() { fence_i(); if (UseConservativeFence) { fence(ir, ir); } } void MacroAssembler::safepoint_ifence() { ifence(); .... } void MacroAssembler::emit_static_call_stub() { // CompiledDirectStaticCall::set_to_interpreted knows the // exact layout of this stub. ifence(); mov_metadata(xmethod, (Metadata*)NULL); // Jump to the entry point of the i2c stub. int32_t offset = 0; movptr_with_offset(t0, 0, offset); jalr(x0, t0, offset); } Maybe it would be good to get rid of them. Another interesting point is: >FENCE.I does not ensure that other RISC-V harts? >instruction fetches will observe the local hart's stores in a >multiprocessor system. To make a store to instruction memory visible >to all RISC-V harts, the writing hart also has to execute a data FENCE >before requesting that all remote RISC-V harts execute a FENCE.I. Here is how we do the flush_icache call: static void icache_flush(long int start, long int end) { const int SYSCALL_RISCV_FLUSH_ICACHE = 259; register long int __a7 asm ("a7") = SYSCALL_RISCV_FLUSH_ICACHE; register long int __a0 asm ("a0") = start; register long int __a1 asm ("a1") = end; // the flush can be applied to either all threads or only the current. // 0 means a global icache flush, and the icache flush will be applied // to other harts concurrently executing. register long int __a2 asm ("a2") = 0; __asm__ volatile ("ecall\n\t" : "+r" (__a0) : "r" (__a0), "r" (__a1), "r" (__a2), "r" (__a7) : "memory"); } Maybe there is a need to add __asm__ volatile ("fence":::"memory") at the beginning of this method. Lets discuss these points. Regards, Vladimir. From yangfei at iscas.ac.cn Fri Jul 29 13:56:47 2022 From: yangfei at iscas.ac.cn (yangfei at iscas.ac.cn) Date: Fri, 29 Jul 2022 21:56:47 +0800 (GMT+08:00) Subject: Unaligned memory access with JDK In-Reply-To: <72FB087D-396F-4029-A4AD-37EE6EF54560@gmail.com> References: <29811AA7-733F-4674-9BEC-B01F48A92D65@gmail.com> <72FB087D-396F-4029-A4AD-37EE6EF54560@gmail.com> Message-ID: <4f56abf8.4c001.1824a3ed234.Coremail.yangfei@iscas.ac.cn> > -----Original Messages----- > From: "???????? ??????" > Sent Time: 2022-07-29 15:24:19 (Friday) > To: "wangyadong (E)" > Cc: "Palmer Dabbelt" , "riscv-port-dev at openjdk.org" > Subject: Re: Unaligned memory access with JDK > > Hello, Should I file a jbs bug then ? > > I have also found misaligned access in stack setup prologue of template interp generated methods. > for example, the putstatic code: > > 0x3f89d033c0: ff8a0a13 addi s4,s4,-8 > 0x3f89d033c4: 00aa3023 sd a0,0(s4) > 0x3f89d033c8: 0380006f j 56 # 0x3f89d03400 > 0x3f89d033cc: ff8a0a13 addi s4,s4,-8 > 0x3f89d033d0: 00aa2027 fsw fa0,0(s4) > 0x3f89d033d4: 02c0006f j 44 # 0x3f89d03400 > 0x3f89d033d8: ff0a0a13 addi s4,s4,-16 > 0x3f89d033dc: 00aa3027 fsd fa0,0(s4) > 0x3f89d033e0: 0200006f j 32 # 0x3f89d03400 > 0x3f89d033e4: ff0a0a13 addi s4,s4,-16 > 0x3f89d033e8: 000a3423 sd zero,8(s4) > 0x3f89d033ec: 00aa3023 sd a0,0(s4) > 0x3f89d033f0: 0100006f j 16 # 0x3f89d03400 > 0x3f89d033f4: ff8a0a13 addi s4,s4,-8 > 0x3f89d033f8: 0005053b addw a0,a0,zero > 0x3f89d033fc: 00aa3023 sd a0,0(s4) > 0x3f89d03400: 001b5683 lhu a3,1(s6). <? MISALLIGNED ACCESS > 0x3f89d03404: 00569613 slli a2,a3,5 > 0x3f89d03408: 00cd0633 add a2,s10,a2 > 0x3f89d0340c: 02860493 addi s1,a2,40 > 0x3f89d03410: 00048493 mv s1,s1 > 0x3f89d03414: 0ff0000f fence iorw,iorw > 0x3f89d03418: 0004e483 lwu s1,0(s1) > 0x3f89d0341c: 0af0000f fence ir,iorw Yes. With the current status of unaligned access described by Palmer, I think we should identify these cases in the port and put them under control of existing JVM option AvoidUnalignedAccesses. Thanks, Fei From yadonn.wang at huawei.com Fri Jul 29 15:12:21 2022 From: yadonn.wang at huawei.com (wangyadong (E)) Date: Fri, 29 Jul 2022 15:12:21 +0000 Subject: =?utf-8?B?562U5aSNOiBUaGUgdXNhZ2Ugb2YgZmVuY2UuaSBpbiBvcGVuamRr?= In-Reply-To: <845742B4-B2D8-40C1-8BD0-D60142EFD45E@gmail.com> References: <845742B4-B2D8-40C1-8BD0-D60142EFD45E@gmail.com> Message-ID: Hi, Vladimir, > I believe Java?s threads can migrate to different hart at any moment, hence the use of fence.i is dangerous. Could you describe in detail why the use of fence.i is dangerous? I think it may be just inefficient but not dangerous. To a certain extent, this code is hangover from the aarch64 port and we use fence.i to mimic isb. > Maybe there is a need to add __asm__ volatile ("fence":::"memory") at the beginning of this method. You're right. It'd better place a full data fence before the syscall, because we cannot guarantee here the syscall leave a data fence there before IPI remote fence.i to other harts Yadong -----????----- ???: riscv-port-dev ?? ???????? ?????? ????: 2022?7?29? 18:31 ???: riscv-port-dev at openjdk.org ??: The usage of fence.i in openjdk Hello I was looking at the generated executable code sync across all harts in openjdk and found few thing to not be in line with the spec. Looking at the spec: https://github.com/riscv/riscv-isa-manual/blob/master/src/zifencei.tex , there are few important moments: >Because FENCE.I only orders stores with a hart's own instruction >fetches, application code should only rely upon FENCE.I if the >application thread will not be migrated to a different hart. The EEI >can provide mechanisms for efficient multiprocessor instruction-stream >synchronization. I believe Java?s threads can migrate to different hart at any moment, hence the use of fence.i is dangerous. There are few places where fence.i ( via fence_i() ) are used in openjdk at the moment: void Assembler::ifence() { fence_i(); if (UseConservativeFence) { fence(ir, ir); } } void MacroAssembler::safepoint_ifence() { ifence(); .... } void MacroAssembler::emit_static_call_stub() { // CompiledDirectStaticCall::set_to_interpreted knows the // exact layout of this stub. ifence();C mov_metadata(xmethod, (Metadata*)NULL); // Jump to the entry point of the i2c stub. int32_t offset = 0; movptr_with_offset(t0, 0, offset); jalr(x0, t0, offset); } Maybe it would be good to get rid of them. Another interesting point is: >FENCE.I does not ensure that other RISC-V harts? >instruction fetches will observe the local hart's stores in a >multiprocessor system. To make a store to instruction memory visible to >all RISC-V harts, the writing hart also has to execute a data FENCE >before requesting that all remote RISC-V harts execute a FENCE.I. Here is how we do the flush_icache call: static void icache_flush(long int start, long int end) { const int SYSCALL_RISCV_FLUSH_ICACHE = 259; register long int __a7 asm ("a7") = SYSCALL_RISCV_FLUSH_ICACHE; register long int __a0 asm ("a0") = start; register long int __a1 asm ("a1") = end; // the flush can be applied to either all threads or only the current. // 0 means a global icache flush, and the icache flush will be applied // to other harts concurrently executing. register long int __a2 asm ("a2") = 0; __asm__ volatile ("ecall\n\t" : "+r" (__a0) : "r" (__a0), "r" (__a1), "r" (__a2), "r" (__a7) : "memory"); } Maybe there is a need to add __asm__ volatile ("fence":::"memory") at the beginning of this method. Lets discuss these points. Regards, Vladimir. From vladimir.kempik at gmail.com Fri Jul 29 17:41:18 2022 From: vladimir.kempik at gmail.com (Vladimir Kempik) Date: Fri, 29 Jul 2022 20:41:18 +0300 Subject: The usage of fence.i in openjdk In-Reply-To: References: <845742B4-B2D8-40C1-8BD0-D60142EFD45E@gmail.com> Message-ID: <00EDAECF-F0AE-473A-B124-5C24CC1B8542@gmail.com> > 29 ???? 2022 ?., ? 18:12, wangyadong (E) > ???????(?): > > Hi, Vladimir, > >> I believe Java?s threads can migrate to different hart at any moment, hence the use of fence.i is dangerous. > Could you describe in detail why the use of fence.i is dangerous? I think it may be just inefficient but not dangerous. > To a certain extent, this code is hangover from the aarch64 port and we use fence.i to mimic isb. > Basically, fence.i executed on a hart, it is only doing anything to the current hart. And your thread can be rescheduled to another hart soon after fence.i Lets say you have a thread A running on hart 1. You've changed some code in region 0x11223300 and need fence.i before executing that code. you execute fence.i in your thread A running on hart 1. right after that your thread ( for some reason) got rescheduled ( by kernel) to hart 2. if hart 2 had something in l1i corresponding to region 0x11223300, then you gonna have a problem: l1i on hart 2 has old code, it wasn?t refreshed, because fence.i was executed on hart 1 ( and never on hart 2). And you thread gonna execute old code, or mix of old and new code. Regards, Vladimir. >> Maybe there is a need to add __asm__ volatile ("fence":::"memory") at the beginning of this method. > You're right. It'd better place a full data fence before the syscall, because we cannot guarantee here the syscall leave a data fence there before IPI remote fence.i to other harts > > Yadong > -------------- next part -------------- An HTML attachment was scrubbed... URL: From palmer at dabbelt.com Fri Jul 29 18:02:59 2022 From: palmer at dabbelt.com (Palmer Dabbelt) Date: Fri, 29 Jul 2022 11:02:59 -0700 (PDT) Subject: 答复: The usage of fence.i in openjdk In-Reply-To: Message-ID: On Fri, 29 Jul 2022 08:12:21 PDT (-0700), yadonn.wang at huawei.com wrote: > Hi, Vladimir, > >> I believe Java?s threads can migrate to different hart at any moment, hence the use of fence.i is dangerous. > Could you describe in detail why the use of fence.i is dangerous? I think it may be just inefficient but not dangerous. The issue here is that fence.i applies to the current hart, whereas Linux userspace processes just know the current thread. Executing a fence.i in userspace adds a bunch of orderings to the thread's state that can only be moved to a new hart via another fence.i. Normally that sort of thing isn't such a big deal (there's similar state for things like fence and lr), but the SiFive chips implement fence.i by flushing the instruction cache which is a slow operation to put on the scheduling path. The only way for the kernel to avoid that fence.i on the scheduling path is for it to know if one's been executed in userspace. There's no way to trap on fence.i so instead the Linux uABI just requires userspace to make a syscall (or a VDSO library call). If userspace directly executes a fence.i then the kernel won't know and thus can't ensure the thread state is adequately moved to the new hart during scheduling, which may result in incorrect behavior. We've known for a while that this will cause performance issues for JITs on some implemenations, but so far it's just not been a priority. I poked around the RISC-V OpenJDK port a few months ago and I think there's some improvements that can be made there, but we're probably also going to want some kernel support. Exactly how to fix it is probably going to depend on the workloads and implementations, though, and while I think I understand the OpenJDK part pretty well it's not clear what the other fence.i implementations are doing. In the long run we're also going to need some ISA support for doing this sanely, but that's sort of a different problem. I've been kind of crossing my fingers and hoping that anyone who has a system where JIT performance is important is also going to have some better write/fetch ordering instructions, but given how long it's been maybe that's a bad plan. That said, the direct fence.i is incorrect and it's likely that the long-term solution involves making the VDSO call so it's probably best to swap over. I remember having written a patch to do that at some point, but I can't find it so maybe I just forgot to send it? > To a certain extent, this code is hangover from the aarch64 port and we use fence.i to mimic isb. >> Maybe there is a need to add __asm__ volatile ("fence":::"memory") at the beginning of this method. > You're right. It'd better place a full data fence before the syscall, because we cannot guarantee here the syscall leave a data fence there before IPI remote fence.i to other harts That's not necessary with the current implementation, but it's not specified by the interface. I think we should probably just have strongly ordered memory as implicit in all user/kernel transitions, but I'm not sure if there's an issue there. > > Yadong > > -----????----- > ???: riscv-port-dev ?? ???????? ?????? > ????: 2022?7?29? 18:31 > ???: riscv-port-dev at openjdk.org > ??: The usage of fence.i in openjdk > > Hello > I was looking at the generated executable code sync across all harts in openjdk and found few thing to not be in line with the spec. > > Looking at the spec: https://github.com/riscv/riscv-isa-manual/blob/master/src/zifencei.tex , there are few important moments: > >>Because FENCE.I only orders stores with a hart's own instruction >>fetches, application code should only rely upon FENCE.I if the >>application thread will not be migrated to a different hart. The EEI >>can provide mechanisms for efficient multiprocessor instruction-stream >>synchronization. > > I believe Java?s threads can migrate to different hart at any moment, hence the use of fence.i is dangerous. > There are few places where fence.i ( via fence_i() ) are used in openjdk at the moment: > > void Assembler::ifence() { > fence_i(); > if (UseConservativeFence) { > fence(ir, ir); > } > } > > void MacroAssembler::safepoint_ifence() { > ifence(); > .... > } > > void MacroAssembler::emit_static_call_stub() { > // CompiledDirectStaticCall::set_to_interpreted knows the > // exact layout of this stub. > > ifence();C > mov_metadata(xmethod, (Metadata*)NULL); > > // Jump to the entry point of the i2c stub. > int32_t offset = 0; > movptr_with_offset(t0, 0, offset); > jalr(x0, t0, offset); > } > > Maybe it would be good to get rid of them. > > Another interesting point is: >>FENCE.I does not ensure that other RISC-V harts? >>instruction fetches will observe the local hart's stores in a >>multiprocessor system. To make a store to instruction memory visible to >>all RISC-V harts, the writing hart also has to execute a data FENCE >>before requesting that all remote RISC-V harts execute a FENCE.I. > > Here is how we do the flush_icache call: > > static void icache_flush(long int start, long int end) > { > const int SYSCALL_RISCV_FLUSH_ICACHE = 259; > register long int __a7 asm ("a7") = SYSCALL_RISCV_FLUSH_ICACHE; > register long int __a0 asm ("a0") = start; > register long int __a1 asm ("a1") = end; > // the flush can be applied to either all threads or only the current. > // 0 means a global icache flush, and the icache flush will be applied > // to other harts concurrently executing. > register long int __a2 asm ("a2") = 0; > __asm__ volatile ("ecall\n\t" > : "+r" (__a0) > : "r" (__a0), "r" (__a1), "r" (__a2), "r" (__a7) > : "memory"); > } > > Maybe there is a need to add __asm__ volatile ("fence":::"memory") at the beginning of this method. > > Lets discuss these points. > > Regards, Vladimir. > From palmer at dabbelt.com Fri Jul 29 18:07:42 2022 From: palmer at dabbelt.com (Palmer Dabbelt) Date: Fri, 29 Jul 2022 11:07:42 -0700 (PDT) Subject: The usage of fence.i in openjdk In-Reply-To: <00EDAECF-F0AE-473A-B124-5C24CC1B8542@gmail.com> Message-ID: On Fri, 29 Jul 2022 10:41:18 PDT (-0700), vladimir.kempik at gmail.com wrote: > > >> 29 ???? 2022 ?., ? 18:12, wangyadong (E) > ???????(?): >> >> Hi, Vladimir, >> >>> I believe Java?s threads can migrate to different hart at any moment, hence the use of fence.i is dangerous. >> Could you describe in detail why the use of fence.i is dangerous? I think it may be just inefficient but not dangerous. >> To a certain extent, this code is hangover from the aarch64 port and we use fence.i to mimic isb. >> > Basically, fence.i executed on a hart, it is only doing anything to the current hart. And your thread can be rescheduled to another hart soon after fence.i > > Lets say you have a thread A running on hart 1. > You've changed some code in region 0x11223300 and need fence.i before executing that code. > you execute fence.i in your thread A running on hart 1. > right after that your thread ( for some reason) got rescheduled ( by kernel) to hart 2. > if hart 2 had something in l1i corresponding to region 0x11223300, then you gonna have a problem: l1i on hart 2 has old code, it wasn?t refreshed, because fence.i was executed on hart 1 ( and never on hart 2). And you thread gonna execute old code, or mix of old and new code. Sorry for forking the thread, I saw this come in right after I sent my message. This is correct, the performance-related reasons we're not doing a fence.i when scheduling are discribed in that message: > > Regards, Vladimir. > >>> Maybe there is a need to add __asm__ volatile ("fence":::"memory") at the beginning of this method. >> You're right. It'd better place a full data fence before the syscall, because we cannot guarantee here the syscall leave a data fence there before IPI remote fence.i to other harts >> >> Yadong >> From yadonn.wang at huawei.com Sat Jul 30 02:41:36 2022 From: yadonn.wang at huawei.com (wangyadong (E)) Date: Sat, 30 Jul 2022 02:41:36 +0000 Subject: The usage of fence.i in openjdk In-Reply-To: References: <00EDAECF-F0AE-473A-B124-5C24CC1B8542@gmail.com> Message-ID: <303ab75147704124b9759934da0107e5@huawei.com> > Lets say you have a thread A running on hart 1. > You've changed some code in region 0x11223300 and need fence.i before executing that code. > you execute fence.i in your thread A running on hart 1. > right after that your thread ( for some reason) got rescheduled ( by kernel) to hart 2. > if hart 2 had something in l1i corresponding to region 0x11223300, then you gonna have a problem: l1i on hart 2 has old code, it wasn?t refreshed, because fence.i was executed on hart 1 ( and never on hart 2). And you thread gonna execute old code, or mix of old and new code. @vladimir Thanks for your explanation. I understand your concern now. We know the fence.i's scope, so the write hart does not rely solely on the fence.i in RISC-V port, but calls the icache_flush syscall in ICache::invalidate_range() every time after modifying the code. For example: Hart 1 void MacroAssembler::emit_static_call_stub() { // CompiledDirectStaticCall::set_to_interpreted knows the // exact layout of this stub. ifence(); mov_metadata(xmethod, (Metadata*)NULL); <- patchable code here // Jump to the entry point of the i2c stub. int32_t offset = 0; movptr_with_offset(t0, 0, offset); jalr(x0, t0, offset); } Hart 2 (write hart) void NativeMovConstReg::set_data(intptr_t x) { // ... // Store x into the instruction stream. MacroAssembler::pd_patch_instruction_size(instruction_address(), (address)x); <- write code ICache::invalidate_range(instruction_address(), movptr_instruction_size); <- syscall here // ... } The syscall here: void flush_icache_mm(struct mm_struct *mm, bool local) { unsigned int cpu; cpumask_t others, *mask; preempt_disable(); /* Mark every hart's icache as needing a flush for this MM. */ mask = &mm->context.icache_stale_mask; cpumask_setall(mask); /* Flush this hart's I$ now, and mark it as flushed. */ cpu = smp_processor_id(); cpumask_clear_cpu(cpu, mask); local_flush_icache_all(); /* * Flush the I$ of other harts concurrently executing, and mark them as * flushed. */ cpumask_andnot(&others, mm_cpumask(mm), cpumask_of(cpu)); local |= cpumask_empty(&others); if (mm == current->active_mm && local) { /* * It's assumed that at least one strongly ordered operation is * performed on this hart between setting a hart's cpumask bit * and scheduling this MM context on that hart. Sending an SBI * remote message will do this, but in the case where no * messages are sent we still need to order this hart's writes * with flush_icache_deferred(). */ smp_mb(); } else if (IS_ENABLED(CONFIG_RISCV_SBI)) { sbi_remote_fence_i(&others); } else { on_each_cpu_mask(&others, ipi_remote_fence_i, NULL, 1); } preempt_enable(); } > Maybe it would be good to get rid of them. So maybe we can remove the fence.i used in the user space from a performance perspective, but not because it's not safe. Please correct me if I'm wrong. Yadong -----Original Message----- From: Palmer Dabbelt [mailto:palmer at dabbelt.com] Sent: Saturday, July 30, 2022 2:08 AM To: vladimir.kempik at gmail.com Cc: wangyadong (E) ; riscv-port-dev at openjdk.org Subject: Re: The usage of fence.i in openjdk On Fri, 29 Jul 2022 10:41:18 PDT (-0700), vladimir.kempik at gmail.com wrote: > > >> 29 ???? 2022 ?., ? 18:12, wangyadong (E) > ???????(?): >> >> Hi, Vladimir, >> >>> I believe Java?s threads can migrate to different hart at any moment, hence the use of fence.i is dangerous. >> Could you describe in detail why the use of fence.i is dangerous? I think it may be just inefficient but not dangerous. >> To a certain extent, this code is hangover from the aarch64 port and we use fence.i to mimic isb. >> > Basically, fence.i executed on a hart, it is only doing anything to > the current hart. And your thread can be rescheduled to another hart > soon after fence.i > > Lets say you have a thread A running on hart 1. > You've changed some code in region 0x11223300 and need fence.i before executing that code. > you execute fence.i in your thread A running on hart 1. > right after that your thread ( for some reason) got rescheduled ( by kernel) to hart 2. > if hart 2 had something in l1i corresponding to region 0x11223300, then you gonna have a problem: l1i on hart 2 has old code, it wasn?t refreshed, because fence.i was executed on hart 1 ( and never on hart 2). And you thread gonna execute old code, or mix of old and new code. Sorry for forking the thread, I saw this come in right after I sent my message. This is correct, the performance-related reasons we're not doing a fence.i when scheduling are discribed in that message: > > Regards, Vladimir. > >>> Maybe there is a need to add __asm__ volatile ("fence":::"memory") at the beginning of this method. >> You're right. It'd better place a full data fence before the syscall, because we cannot guarantee here the syscall leave a data fence there before IPI remote fence.i to other harts >> >> Yadong >> From vladimir.kempik at gmail.com Sat Jul 30 10:29:59 2022 From: vladimir.kempik at gmail.com (Vladimir Kempik) Date: Sat, 30 Jul 2022 13:29:59 +0300 Subject: The usage of fence.i in openjdk In-Reply-To: <303ab75147704124b9759934da0107e5@huawei.com> References: <00EDAECF-F0AE-473A-B124-5C24CC1B8542@gmail.com> <303ab75147704124b9759934da0107e5@huawei.com> Message-ID: <1CADD7EC-49F0-4665-BF59-E8526D6AF54C@gmail.com> Hello Thanks for explanation. that sounds like the fence.i in userspace code is not needed at all Regards, Vladimir > 30 ???? 2022 ?., ? 05:41, wangyadong (E) ???????(?): > >> Lets say you have a thread A running on hart 1. >> You've changed some code in region 0x11223300 and need fence.i before executing that code. >> you execute fence.i in your thread A running on hart 1. >> right after that your thread ( for some reason) got rescheduled ( by kernel) to hart 2. >> if hart 2 had something in l1i corresponding to region 0x11223300, then you gonna have a problem: l1i on hart 2 has old code, it wasn?t refreshed, because fence.i was executed on hart 1 ( and never on hart 2). And you thread gonna execute old code, or mix of old and new code. > > @vladimir Thanks for your explanation. I understand your concern now. We know the fence.i's scope, so the write hart does not rely solely on the fence.i in RISC-V port, but calls the icache_flush syscall in ICache::invalidate_range() every time after modifying the code. > > For example: > Hart 1 > void MacroAssembler::emit_static_call_stub() { > // CompiledDirectStaticCall::set_to_interpreted knows the > // exact layout of this stub. > > ifence(); > mov_metadata(xmethod, (Metadata*)NULL); <- patchable code here > > // Jump to the entry point of the i2c stub. > int32_t offset = 0; > movptr_with_offset(t0, 0, offset); > jalr(x0, t0, offset); > } > > Hart 2 (write hart) > void NativeMovConstReg::set_data(intptr_t x) { > // ... > // Store x into the instruction stream. > MacroAssembler::pd_patch_instruction_size(instruction_address(), (address)x); <- write code > ICache::invalidate_range(instruction_address(), movptr_instruction_size); <- syscall here > // ... > } > From yangfei at iscas.ac.cn Sat Jul 30 13:35:11 2022 From: yangfei at iscas.ac.cn (yangfei at iscas.ac.cn) Date: Sat, 30 Jul 2022 21:35:11 +0800 (GMT+08:00) Subject: =?UTF-8?Q?Re:_Re:_=E7=AD=94=E5=A4=8D:_The_usage_of_fence.i_in_openjdk?= In-Reply-To: References: Message-ID: <37edd86e.4d5b0.1824f5168bf.Coremail.yangfei@iscas.ac.cn> Hi Palmer, > -----Original Messages----- > From: "Palmer Dabbelt" > Sent Time: 2022-07-30 02:02:59 (Saturday) > To: yadonn.wang at huawei.com > Cc: vladimir.kempik at gmail.com, riscv-port-dev at openjdk.org > Subject: Re: ??: The usage of fence.i in openjdk > > On Fri, 29 Jul 2022 08:12:21 PDT (-0700), yadonn.wang at huawei.com wrote: > > Hi, Vladimir, > > > >> I believe Java?s threads can migrate to different hart at any moment, hence the use of fence.i is dangerous. > > Could you describe in detail why the use of fence.i is dangerous? I think it may be just inefficient but not dangerous. > > The issue here is that fence.i applies to the current hart, whereas > Linux userspace processes just know the current thread. Executing a > fence.i in userspace adds a bunch of orderings to the thread's state > that can only be moved to a new hart via another fence.i. Normally that > sort of thing isn't such a big deal (there's similar state for things > like fence and lr), but the SiFive chips implement fence.i by flushing > the instruction cache which is a slow operation to put on the scheduling > path. > > The only way for the kernel to avoid that fence.i on the scheduling path > is for it to know if one's been executed in userspace. There's no way > to trap on fence.i so instead the Linux uABI just requires userspace to > make a syscall (or a VDSO library call). If userspace directly executes > a fence.i then the kernel won't know and thus can't ensure the thread > state is adequately moved to the new hart during scheduling, which may > result in incorrect behavior. > > We've known for a while that this will cause performance issues for JITs > on some implemenations, but so far it's just not been a priority. I > poked around the RISC-V OpenJDK port a few months ago and I think > there's some improvements that can be made there, but we're probably > also going to want some kernel support. Exactly how to fix it is > probably going to depend on the workloads and implementations, though, > and while I think I understand the OpenJDK part pretty well it's not > clear what the other fence.i implementations are doing. > > In the long run we're also going to need some ISA support for doing this > sanely, but that's sort of a different problem. I've been kind of > crossing my fingers and hoping that anyone who has a system where JIT > performance is important is also going to have some better write/fetch > ordering instructions, but given how long it's been maybe that's a bad > plan. > > That said, the direct fence.i is incorrect and it's likely that the > long-term solution involves making the VDSO call so it's probably best > to swap over. I remember having written a patch to do that at some > point, but I can't find it so maybe I just forgot to send it? Thanks for all those considerations about the design. It's very helpfull. > > To a certain extent, this code is hangover from the aarch64 port and we use fence.i to mimic isb. > >> Maybe there is a need to add __asm__ volatile ("fence":::"memory") at the beginning of this method. > > You're right. It'd better place a full data fence before the syscall, because we cannot guarantee here the syscall leave a data fence there before IPI remote fence.i to other harts > > That's not necessary with the current implementation, but it's not But looks like this is not reflected in kernel function flush_icache_mm? I checked the code and it looks to me that data fence is issued for only one path: 59 if (mm == current->active_mm && local) { 60 /* 61 * It's assumed that at least one strongly ordered operation is 62 * performed on this hart between setting a hart's cpumask bit 63 * and scheduling this MM context on that hart. Sending an SBI 64 * remote message will do this, but in the case where no 65 * messages are sent we still need to order this hart's writes 66 * with flush_icache_deferred(). 67 */ 68 smp_mb(); 69 } I just want to make sure that the data fence is there in this syscall with the current implementation. But I am not familar with the riscv linux kernel code and it's appreciated if you have more details. Thanks, Fei