From John.Rose at Sun.COM Sat Mar 1 14:58:38 2008 From: John.Rose at Sun.COM (John Rose) Date: Sat, 01 Mar 2008 14:58:38 -0800 Subject: Is 'optimized' a legit target? In-Reply-To: <025e01c87ae4$e99bc090$bcd341b0$@com> References: <025e01c87ae4$e99bc090$bcd341b0$@com> Message-ID: <1AE218E2-9540-430C-A0BE-428240022A99@sun.com> Yes, 'optimized' is legit. It supports more flags, for tuning experiments, etc. Its performance characteristics are closer to product, because it omits all the 'assert' code, Here are the various build subdirectories, in brief: product -- hardwires many flag values, no asserts, code is optimized optimized -- most flag values variable, no asserts, code is optimized fastdebug -- all flag values variable, asserts enabled, code is optimized jvmg -- all flag values variable, asserts enabled, code not optimized (debuggable) generated -- machine-generated source code and other stuff created during the build process debug -- old name for jvmg; this one should go away profiled -- dead a long time; this one should have gone away years ago Best, -- John From ted at tedneward.com Sat Mar 1 23:18:20 2008 From: ted at tedneward.com (Ted Neward) Date: Sat, 1 Mar 2008 23:18:20 -0800 Subject: Is 'optimized' a legit target? In-Reply-To: <1AE218E2-9540-430C-A0BE-428240022A99@sun.com> References: <025e01c87ae4$e99bc090$bcd341b0$@com> <1AE218E2-9540-430C-A0BE-428240022A99@sun.com> Message-ID: <022401c87c35$9b417b90$d1c472b0$@com> When I tried to build optimized (b24), it failed very early in the process. I'm assuming that's not supposed to happen? :-) Ted Neward Java, .NET, XML Services Consulting, Teaching, Speaking, Writing http://www.tedneward.com > -----Original Message----- > From: John.Rose at Sun.COM [mailto:John.Rose at Sun.COM] > Sent: Saturday, March 01, 2008 2:59 PM > To: Ted Neward > Cc: build-dev at openjdk.java.net; hotspot-dev at openjdk.dev.java.net > Subject: Re: Is 'optimized' a legit target? > > Yes, 'optimized' is legit. It supports more flags, for tuning > experiments, etc. > Its performance characteristics are closer to product, because > it omits all the 'assert' code, > > Here are the various build subdirectories, in brief: > > product -- hardwires many flag values, no asserts, code is optimized > optimized -- most flag values variable, no asserts, code is optimized > fastdebug -- all flag values variable, asserts enabled, code is > optimized > jvmg -- all flag values variable, asserts enabled, code not optimized > (debuggable) > generated -- machine-generated source code and other stuff created > during the build process > debug -- old name for jvmg; this one should go away > profiled -- dead a long time; this one should have gone away years ago > > Best, > -- John > > No virus found in this incoming message. > Checked by AVG Free Edition. > Version: 7.5.516 / Virus Database: 269.21.2/1305 - Release Date: > 2/29/2008 6:32 PM > No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.21.2/1305 - Release Date: 2/29/2008 6:32 PM From andi at firstfloor.org Sun Mar 2 16:24:54 2008 From: andi at firstfloor.org (Andi Kleen) Date: Mon, 3 Mar 2008 01:24:54 +0100 Subject: [PATCH] Linux NUMA support for HotSpot Message-ID: <20080303002454.GA28952@basil.nowhere.org> Hi, Some time ago I played a bit with the NUMA heap support in the hotspot sources. In particularly I implemented an interface to the Linux libnuma interface and wrote some simple benchmarks to see if it was any faster on a Opteron system (it wasn't unfortunately). After some debugging I concluded that my Linux NUMA interface was likely right, but the NUMA heap implementation seemed to be broken (I doubt it'll work very well even on Solaris) The implementation does not require to link libnuma always in, but dlopen()s it as needed to avoid a little bit of dll/so hell. It also uses the Linux getcpu() call which means it can actually adapt to migrating threads over nodes (unlike the Solaris implementation) Anyways just in case someone else wants to play with it here's libnuma support for Linux for HotSpot. It can be enabled with the usual options, but is disabled by default. The patch was originally against a fairly old snapshot (b13). It still applies perfectly to the latest 6 snapshot I downloaded. I wasn't able to retest it thought because I was unable to build the latest snapshot even after fiddling for an hour with all the undocumented environment variables. -Andi diff -u openjdk/hotspot/src/os/linux/vm/os_linux.cpp-o openjdk/hotspot/src/os/linux/vm/os_linux.cpp --- openjdk/hotspot/src/os/linux/vm/os_linux.cpp-o 2007-05-24 09:30:53.000000000 +0200 +++ openjdk/hotspot/src/os/linux/vm/os_linux.cpp 2007-06-24 16:56:57.000000000 +0200 @@ -56,6 +56,11 @@ # include # include # include +# include +# include +#ifdef __amd64__ +# include +#endif #define MAX_PATH (2 * K) @@ -81,6 +86,15 @@ char * os::Linux::_glibc_version = NULL; char * os::Linux::_libpthread_version = NULL; +void (*os::Linux::_numa_interleave_memory)(void *, size_t, const nodemask_t *) = NULL; +void (*os::Linux::_numa_setlocal_memory)(void *, size_t) = NULL; +int (*os::Linux::_numa_max_node)(void) = NULL; +nodemask_t (*os::Linux::_numa_get_run_node_mask)(void) = NULL; +int (*os::Linux::_numa_available)(void) = NULL; +int (*os::Linux::_getcpu)(unsigned *, unsigned *, void *) = NULL; +int (*os::Linux::_numa_node_to_cpus)(int node, unsigned long *buffer, int buffer_len); +bool os::Linux::getcpu_broken = false; + static jlong initial_time_count=0; static int clock_tics_per_sec = 100; @@ -739,10 +753,7 @@ osthread->set_thread_id(os::Linux::gettid()); if (UseNUMA) { - int lgrp_id = os::numa_get_group_id(); - if (lgrp_id != -1) { - thread->set_lgrp_id(lgrp_id); - } + thread->set_lgrp_id(-1); } // initialize signal mask for this thread os::Linux::hotspot_sigmask(thread); @@ -916,10 +927,10 @@ thread->set_osthread(osthread); if (UseNUMA) { - int lgrp_id = os::numa_get_group_id(); - if (lgrp_id != -1) { - thread->set_lgrp_id(lgrp_id); - } + // let it be retrieved on first access in thread context + // actually this really needs a timeout or calling getcpu with + // the cache + thread->set_lgrp_id(-1); } if (os::Linux::is_initial_thread()) { @@ -2224,25 +2231,230 @@ void os::realign_memory(char *addr, size_t bytes, size_t alignment_hint) { } void os::free_memory(char *addr, size_t bytes) { } -void os::numa_make_global(char *addr, size_t bytes) { } -void os::numa_make_local(char *addr, size_t bytes) { } -bool os::numa_topology_changed() { return false; } -size_t os::numa_get_groups_num() { return 1; } -int os::numa_get_group_id() { return 0; } -size_t os::numa_get_leaf_groups(int *ids, size_t size) { - if (size > 0) { - ids[0] = 0; - return 1; + +#ifndef __amd64__ // x86-64 has a vsyscall + +#ifndef SYS_getcpu +#ifdef __i386__ +#define SYS_getcpu 318 +#else +#error define getcpu for architecture +#endif +#endif + +static int getcpu_syscall(unsigned *cpu, unsigned *node, void *cache) +{ + return syscall(SYS_getcpu, cpu, node, cache); +} +#endif + +static __thread nodemask_t current_node_mask; + +void os::Linux::numa_init(void) +{ + int err = 0; + // load libnuma lazily because it is not on all systems + void *lnuma = dlopen("libnuma.so.1", RTLD_LAZY); + if (!lnuma) { + warning("NUMA requested but cannot open libnuma.so.1. NUMA disabled"); + UseNUMA = false; + return; + } + +#define NSYM(sym) \ + { typedef typeof(sym) f; \ + _##sym = (f *)dlsym(lnuma, #sym); err += (_##sym == NULL); \ + } + NSYM(numa_available); + NSYM(numa_interleave_memory); + NSYM(numa_setlocal_memory); + NSYM(numa_get_run_node_mask); + NSYM(numa_max_node); + NSYM(numa_node_to_cpus); + if (err) { + warning("NUMA requested but cannot find required symbol in libnuma. NUMA disabled"); + UseNUMA = false; + return; + } +#undef NSYM + // libnuma is never unloaded + +#ifdef __x86_64__ + // will return ENOSYS or work on all Linux x86-64 kernels + _getcpu = (int (*)(unsigned *,unsigned *,void *))VSYSCALL_ADDR(2); +#else + _getcpu = getcpu_syscall; +#endif + + if (_numa_available() < 0) { + // don't warn here for now for a simple non numa system + UseNUMA = false; + return; + } + + current_node_mask = _numa_get_run_node_mask(); +} + +// Make pages global: interleave over current cpuset limit +void os::numa_make_global(char *addr, size_t bytes) { +// os::Linux::_numa_interleave_memory(addr, bytes, ¤t_node_mask); +} + +// local memory is default, but set it anyways in case the global +// policy was differently set by numactl +// this only sets first-touch policy +// Linux also supports real page migration, but the NUMA allocator +// here doesn't seem to. +void os::numa_make_local(char *addr, size_t bytes) { + os::Linux::_numa_setlocal_memory(addr, bytes); +} + +// We just return true if the cpuset changed. +// RED-PEN this should be probably per thread +bool os::numa_topology_changed() { + nodemask_t newmask = os::Linux::_numa_get_run_node_mask(); + if (nodemask_equal(&newmask, ¤t_node_mask)) + return false; + fprintf(stderr,"numa topology changed\n"); + current_node_mask = newmask; + return true; +} + +size_t os::numa_get_groups_num() { + // older version of libnuma are not cpuset aware + // compute it from the runnode mask instead + nodemask_t mask = os::Linux::_numa_get_run_node_mask(); + int i, k; + k = 0; + for (i = 0; i < NUMA_NUM_NODES; i++) + if (nodemask_isset(&mask, i)) + k++; + return k; +} + +// copy because header file is still often missing +struct getcpu_cache { + unsigned long blob[128 / sizeof(long)]; +}; +static __thread getcpu_cache node_cache; + +// slow and complicated fallback method when getcpu is not working +// we do this only once per thread +int os::Linux::fallback_get_group_id() { + FILE *f = fopen("/proc/self/stat", "r"); + if (!f) + return 0; + + size_t linesz = 0; + char *line = NULL; + + int n = getline(&line, &linesz, f); + fclose(f); + if (n <= 0) { + free(line); + return 0; + } + + // find processor field + + // skip non numbers (3 fields) at the beginning + char ch; + size_t offset; + if (sscanf(line, "%*d %*s %c%n", &ch, &offset) != 1) { + free(line); + return 0; + } + + // process numbers; processor is the 39th field + char *p = line + offset; + int i, cpu; + for (i = 0; i < 39-3; i++) { + char *end; + cpu = strtol(p, &end, 0); + if (p == end) { + cpu = 0; + break; + } + p = end; + } + free(line); + + // convert to node using libnuma + int max_node = os::Linux::_numa_max_node(); + for (i = 0; i <= max_node; i++) { + unsigned long cpus[128]; + if (os::Linux::_numa_node_to_cpus(i, cpus, sizeof(cpus)) < 0) + continue; + if (cpus[cpu / sizeof(long)] & (1ULL<<(cpu%sizeof(long)))) { + return i; + } } return 0; } +int os::numa_get_group_id() { + // fast method only in Linux 2.6.19+ + // this should be fast enough it can be done everytime + unsigned cpu, node; + if (!os::Linux::getcpu_broken) { + if (os::Linux::_getcpu(&cpu, &node, &node_cache) == 0) + return node; + os::Linux::getcpu_broken = true; + } + + // otherwise use fallback once and then keep using the + // cached value. + // it would be better to have some kind of timeout + // but i don't know of a fast way to do this + int id = Thread::current()->lgrp_id(); + if (id != -1) + return id; + + id = os::Linux::fallback_get_group_id(); + Thread::current()->set_lgrp_id(id); + return id; +} + +size_t os::numa_get_leaf_groups(int *ids, size_t size) { + nodemask_t nodes; + nodes = os::Linux::_numa_get_run_node_mask(); + unsigned i, k; + k = 0; + for (i = 0; i < NUMA_NUM_NODES; i++) + if (nodemask_isset(&nodes, i)) { + // could happen when the cpuset shrinks during runtime I think + if (k >= size) + return k; + ids[k++] = i; + } + return k; +} + bool os::get_page_info(char *start, page_info* info) { + unsigned pol; + info->size = 0; + info->lgrp_id = -1; + // no nice way to detect huge pages here + if (syscall(SYS_get_mempolicy, &pol, NULL, 0, start, MPOL_F_NODE|MPOL_F_ADDR) == 0) { + info->size = os::Linux::_page_size; + info->lgrp_id = pol; + return true; + } return false; } + +// Scan the pages from start to end until a page different than +// the one described in the info parameter is encountered. char *os::scan_pages(char *start, char* end, page_info* page_expected, page_info* page_found) { - return end; + while (start < end) { + if (!get_page_info(start, page_found)) + return NULL; + if (page_expected->lgrp_id != page_found->lgrp_id) + return start; + start += os::Linux::_page_size; + } + return start; } bool os::uncommit_memory(char* addr, size_t size) { @@ -3571,6 +3783,9 @@ // initialize thread priority policy prio_init(); + if (UseNUMA) + Linux::numa_init(); + return JNI_OK; } diff -u openjdk/hotspot/src/os/linux/vm/os_linux.hpp-o openjdk/hotspot/src/os/linux/vm/os_linux.hpp --- openjdk/hotspot/src/os/linux/vm/os_linux.hpp-o 2007-05-24 09:30:53.000000000 +0200 +++ openjdk/hotspot/src/os/linux/vm/os_linux.hpp 2007-05-29 03:05:23.000000000 +0200 @@ -52,6 +52,17 @@ static int (*_clock_gettime)(clockid_t, struct timespec *); static int (*_pthread_getcpuclockid)(pthread_t, clockid_t *); + static void (*_numa_interleave_memory)(void *, size_t, const nodemask_t *); + static void (*_numa_setlocal_memory)(void *, size_t); + static nodemask_t (*_numa_get_run_node_mask)(void); + static int (*_numa_node_to_cpus)(int node, unsigned long *buffer, int buffer_len); + static int (*_numa_available)(void); + static int (*_numa_max_node)(void); + static int (*_getcpu)(unsigned *, unsigned *, void *); + static void numa_init(); + static bool getcpu_broken; + static int fallback_get_group_id(); + static address _initial_thread_stack_bottom; static uintptr_t _initial_thread_stack_size; From andi at firstfloor.org Sun Mar 2 16:26:28 2008 From: andi at firstfloor.org (Andi Kleen) Date: Mon, 3 Mar 2008 01:26:28 +0100 Subject: [PATCH] Fix a /tmp race in the linux code Message-ID: <20080303002628.GA28974@basil.nowhere.org> /tmp races are bad for you. Use mkstemp instead of a predictable name for the debugging file. -Andi diff -u openjdk/hotspot/src/os/linux/vm/os_linux.cpp-o openjdk/hotspot/src/os/linux/vm/os_linux.cpp --- openjdk/hotspot/src/os/linux/vm/os_linux.cpp-o 2007-05-24 09:30:53.000000000 +0200 +++ openjdk/hotspot/src/os/linux/vm/os_linux.cpp 2007-06-24 16:56:57.000000000 +0200 @@ -2185,15 +2196,12 @@ return; } - char buf[40]; int num = Atomic::add(1, &cnt); - sprintf(buf, "/tmp/hs-vm-%d-%d", os::current_process_id(), num); - unlink(buf); - - int fd = open(buf, O_CREAT | O_RDWR, S_IRWXU); + int fd = mkstemp("/tmp/hs-vm-XXXXXX"); if (fd != -1) { + fchmod(fd, S_IRWXU); off_t rv = lseek(fd, size-2, SEEK_SET); if (rv != (off_t)-1) { if (write(fd, "", 1) == 1) { @@ -2203,7 +2211,6 @@ } } close(fd); - unlink(buf); } } From andi at firstfloor.org Sun Mar 2 16:28:49 2008 From: andi at firstfloor.org (Andi Kleen) Date: Mon, 3 Mar 2008 01:28:49 +0100 Subject: [PATCH] Fix some dodgy inline assembler Message-ID: <20080303002849.GA28985@basil.nowhere.org> Potential miscompilation, there is no guarantee the clobbers won't conflict with the input registers. Also the ifdef is not really needed. Write it properly. Just spotted while looking at the linux lowlevel code. -Andi diff -u openjdk/hotspot/src/os/linux/launcher/java_md.c-o openjdk/hotspot/src/os/linux/launcher/java_md.c --- openjdk/hotspot/src/os/linux/launcher/java_md.c-o 2007-05-24 09:30:53.000000000 +0200 +++ openjdk/hotspot/src/os/linux/launcher/java_md.c 2007-05-28 15:42:23.000000000 +0200 @@ -1233,57 +1233,16 @@ uint32_t* ebxp, uint32_t* ecxp, uint32_t* edxp) { -#ifdef _LP64 - __asm__ volatile (/* Instructions */ - " movl %4, %%eax \n" - " cpuid \n" - " movl %%eax, (%0)\n" - " movl %%ebx, (%1)\n" - " movl %%ecx, (%2)\n" - " movl %%edx, (%3)\n" - : /* Outputs */ - : /* Inputs */ - "r" (eaxp), - "r" (ebxp), - "r" (ecxp), - "r" (edxp), - "r" (arg) - : /* Clobbers */ - "%rax", "%rbx", "%rcx", "%rdx", "memory" - ); -#else - uint32_t value_of_eax = 0; - uint32_t value_of_ebx = 0; - uint32_t value_of_ecx = 0; - uint32_t value_of_edx = 0; - __asm__ volatile (/* Instructions */ - /* ebx is callee-save, so push it */ - /* even though it's in the clobbers section */ - " pushl %%ebx \n" - " movl %4, %%eax \n" - " cpuid \n" - " movl %%eax, %0 \n" - " movl %%ebx, %1 \n" - " movl %%ecx, %2 \n" - " movl %%edx, %3 \n" - /* restore ebx */ - " popl %%ebx \n" - - : /* Outputs */ - "=m" (value_of_eax), - "=m" (value_of_ebx), - "=m" (value_of_ecx), - "=m" (value_of_edx) - : /* Inputs */ - "m" (arg) - : /* Clobbers */ - "%eax", "%ebx", "%ecx", "%edx" - ); - *eaxp = value_of_eax; - *ebxp = value_of_ebx; - *ecxp = value_of_ecx; - *edxp = value_of_edx; -#endif + asm(/* Instructions */ + "cpuid" + : /* Outputs */ + "=a" (*eaxp), + "=b" (*ebxp), + "=c" (*ecxp), + "=d" (*edxp) + : /* Inputs */ + "0" (arg) + ); } #endif /* __linux__ && i586 */ From David.Holmes at Sun.COM Sun Mar 2 21:11:24 2008 From: David.Holmes at Sun.COM (David Holmes) Date: Mon, 03 Mar 2008 15:11:24 +1000 Subject: [PATCH] Fix a /tmp race in the linux code In-Reply-To: <20080303002628.GA28974@basil.nowhere.org> References: <20080303002628.GA28974@basil.nowhere.org> Message-ID: <47CB887C.7020803@sun.com> Andi, Where is the race given the temp filename is unique? David Holmes Andi Kleen wrote: > /tmp races are bad for you. Use mkstemp instead of a predictable name for > the debugging file. > > -Andi > > diff -u openjdk/hotspot/src/os/linux/vm/os_linux.cpp-o openjdk/hotspot/src/os/linux/vm/os_linux.cpp > --- openjdk/hotspot/src/os/linux/vm/os_linux.cpp-o 2007-05-24 09:30:53.000000000 +0200 > +++ openjdk/hotspot/src/os/linux/vm/os_linux.cpp 2007-06-24 16:56:57.000000000 +0200 > @@ -2185,15 +2196,12 @@ > return; > } > > - char buf[40]; > int num = Atomic::add(1, &cnt); > > - sprintf(buf, "/tmp/hs-vm-%d-%d", os::current_process_id(), num); > - unlink(buf); > - > - int fd = open(buf, O_CREAT | O_RDWR, S_IRWXU); > + int fd = mkstemp("/tmp/hs-vm-XXXXXX"); > > if (fd != -1) { > + fchmod(fd, S_IRWXU); > off_t rv = lseek(fd, size-2, SEEK_SET); > if (rv != (off_t)-1) { > if (write(fd, "", 1) == 1) { > @@ -2203,7 +2211,6 @@ > } > } > close(fd); > - unlink(buf); > } > } > From Igor.Veresov at Sun.COM Mon Mar 3 01:52:40 2008 From: Igor.Veresov at Sun.COM (Igor Veresov) Date: Mon, 03 Mar 2008 12:52:40 +0300 Subject: [PATCH] Linux NUMA support for HotSpot In-Reply-To: <20080303002454.GA28952@basil.nowhere.org> References: <20080303002454.GA28952@basil.nowhere.org> Message-ID: <200803031252.40200.igor.veresov@sun.com> Andi, I haven't studied your changes in detail but I have a NUMA-aware allocator for Linux in works and I do see speedups, which are similar to what I was able to get from Solaris. About 8% for specjbb2005 on a dual-socket Opteron. So it's all working quite well, actually. igor On Monday 03 March 2008 03:24:54 Andi Kleen wrote: > Hi, > > Some time ago I played a bit with the NUMA heap support in the hotspot > sources. In particularly I implemented an interface to the Linux libnuma > interface and wrote some simple benchmarks to see if it was any faster on a > Opteron system (it wasn't unfortunately). After some debugging I concluded > that my Linux NUMA interface was likely right, but the NUMA heap > implementation seemed to be broken (I doubt it'll work very well even on > Solaris) > > The implementation does not require to link libnuma always in, > but dlopen()s it as needed to avoid a little bit of dll/so hell. > > It also uses the Linux getcpu() call which means it can actually > adapt to migrating threads over nodes (unlike the Solaris implementation) > > Anyways just in case someone else wants to play with it here's > libnuma support for Linux for HotSpot. It can be enabled with the usual > options, but is disabled by default. > > The patch was originally against a fairly old snapshot (b13). It still > applies perfectly to the latest 6 snapshot I downloaded. I wasn't able > to retest it thought because I was unable to build the latest snapshot > even after fiddling for an hour with all the undocumented environment > variables. > > -Andi From Igor.Veresov at Sun.COM Mon Mar 3 04:19:13 2008 From: Igor.Veresov at Sun.COM (Igor Veresov) Date: Mon, 03 Mar 2008 15:19:13 +0300 Subject: [PATCH] Linux NUMA support for HotSpot In-Reply-To: <20080303113017.GC5085@one.firstfloor.org> References: <20080303002454.GA28952@basil.nowhere.org> <200803031252.40200.igor.veresov@sun.com> <20080303113017.GC5085@one.firstfloor.org> Message-ID: <200803031519.13736.igor.veresov@sun.com> On Monday 03 March 2008 14:30:17 Andi Kleen wrote: > On Mon, Mar 03, 2008 at 12:52:40PM +0300, Igor Veresov wrote: > > I haven't studied your changes in detail but I have a NUMA-aware > > allocator for Linux in works > > Ok maybe you can do something with my patch then. > > > and I do see speedups, which are similar to what I was able to > > get from Solaris. About 8% for specjbb2005 on a dual-socket Opteron. So > > it's > > Ok I only did micro benchmarks. Maybe they were not strong enough. > For some simple allocations I didn't get any numa local placement > at least according to the benchmark numbers. > Well, obviously on microbenchmarks you should see even more speedup. To be exact, there is a 30% difference in latency in a 2-socket Opteron (1 HT hop) system and there's even more for 2 and 3-hop systems. > > all working quite well, actually. > > One obvious issue I found in the Solaris code was that it neither > binds threads to nodes (afaik), nor tries to keep up with a thread > migrating to another node. It just assumes always the same thread:node > mapping which surely cannot be correct? It is however correct. Solaris assigns a home locality group (a node) to each lwp. And while lwps can be temporary migrated to a remote group, page allocation still happens on a home node and the lwp is predominantly scheduled to run in its home lgroup. For more information you could refer to the NUMA chapter of the "Solaris Internals" book or to blogs of Jonathan Chew and Alexander Kolbasov from Solaris CMT/NUMA team. > > On the Linux implementation I solved that by using getcpu() on > each allocation (on recent Linux it is a special optimized fast path > that is quite fast) I doubt that executing a syscall on every allocation (even if it's a TLAB allocation) is a good idea. It's many times slower than the original "bump the pointer with the CAS spin" allocator. Linux scheduler is quite reluctant to move lwps between the nodes, so checking the lwp position every, say, 64 TLAB allocations proved to be adequate. On Solaris even that is not necessary. igor From Igor.Veresov at Sun.COM Mon Mar 3 05:25:53 2008 From: Igor.Veresov at Sun.COM (Igor Veresov) Date: Mon, 03 Mar 2008 16:25:53 +0300 Subject: [PATCH] Linux NUMA support for HotSpot In-Reply-To: <20080303125719.GA9853@one.firstfloor.org> References: <20080303002454.GA28952@basil.nowhere.org> <200803031519.13736.igor.veresov@sun.com> <20080303125719.GA9853@one.firstfloor.org> Message-ID: <200803031625.53418.igor.veresov@sun.com> On Monday 03 March 2008 15:57:19 Andi Kleen wrote: > On Mon, Mar 03, 2008 at 03:19:13PM +0300, Igor Veresov wrote: > > > > all working quite well, actually. > > > > > > One obvious issue I found in the Solaris code was that it neither > > > binds threads to nodes (afaik), nor tries to keep up with a thread > > > migrating to another node. It just assumes always the same thread:node > > > mapping which surely cannot be correct? > > > > It is however correct. Solaris assigns a home locality group (a node) to > > each lwp. And while lwps can be temporary migrated to a remote group, > > page allocation still happens on a home node and the lwp is predominantly > > scheduled to run in its home lgroup. For more information you could refer > > to > > Interesting. I tried a similar scheduling algorithm on Linux a long time > ago (it was called the "homenode scheduler") and it was a general loss on > my testing on smaller AMD systems. But maybe Solaris does it all different. > > Anyways on Linux that won't work because it doesn't have the concept > of a homenode. Yes, but it has static memory binding instead, which alleviates this problem. > > The other problems is that it seemed to always assume all the threads > will consume the whole system and set up for all nodes, which seemed dodgy. You mean the allocator? Actually it is adaptive to the allocation rate on a node, which in effect makes the whole eden space usable for applications with asymmetric per-thread allocation rate. This of course also helps with the case when the number of threads is less than the number of nodes. igor From Igor.Veresov at Sun.COM Mon Mar 3 06:20:26 2008 From: Igor.Veresov at Sun.COM (Igor Veresov) Date: Mon, 03 Mar 2008 17:20:26 +0300 Subject: [PATCH] Linux NUMA support for HotSpot In-Reply-To: <20080303133205.GB9853@one.firstfloor.org> References: <20080303002454.GA28952@basil.nowhere.org> <200803031625.53418.igor.veresov@sun.com> <20080303133205.GB9853@one.firstfloor.org> Message-ID: <200803031720.26828.igor.veresov@sun.com> On Monday 03 March 2008 16:32:05 Andi Kleen wrote: > On Mon, Mar 03, 2008 at 04:25:53PM +0300, Igor Veresov wrote: > > On Monday 03 March 2008 15:57:19 Andi Kleen wrote: > > > On Mon, Mar 03, 2008 at 03:19:13PM +0300, Igor Veresov wrote: > > > > > > all working quite well, actually. > > > > > > > > > > One obvious issue I found in the Solaris code was that it neither > > > > > binds threads to nodes (afaik), nor tries to keep up with a thread > > > > > migrating to another node. It just assumes always the same > > > > > thread:node mapping which surely cannot be correct? > > > > > > > > It is however correct. Solaris assigns a home locality group (a node) > > > > to each lwp. And while lwps can be temporary migrated to a remote > > > > group, page allocation still happens on a home node and the lwp is > > > > predominantly scheduled to run in its home lgroup. For more > > > > information you could refer to > > > > > > Interesting. I tried a similar scheduling algorithm on Linux a long > > > time ago (it was called the "homenode scheduler") and it was a general > > > loss on my testing on smaller AMD systems. But maybe Solaris does it > > > all different. > > > > > > Anyways on Linux that won't work because it doesn't have the concept > > > of a homenode. > > > > Yes, but it has static memory binding instead, which alleviates this > > problem. > > That would require statically binding the threads too which is by default > not a good idea without explicit user configuration Not necessarily. It works fine without the static cpu binding. Keep in mind, that most data we have in young generation is short-lived anyway and if the scheduler is reluctant enough to move threads between the nodes the application will have enough time to manipulate the data locally. For long-living data, yes, this won't work. > > The reasoning is that not using a CPU is always worse than using > remote memory at least on systems with reasonable small NUMA factor. > > (that is what killed the homenode scheduler too) As I've mentioned before, Solaris will run the thread remotely if there is a significant load imbalance. Because indeed, it's better to run remotely than not to run at all. But this thread will return to its home node at first opportunity. > > > > The other problems is that it seemed to always assume all the threads > > > will consume the whole system and set up for all nodes, which seemed > > > dodgy. > > > > You mean the allocator? Actually it is adaptive to the allocation rate on > > a node, which in effect makes the whole eden space usable for > > applications with asymmetric per-thread allocation rate. This of course > > also helps with the case when the number of threads is less than the > > number of nodes. > > It didn't seem to adapt though. Or maybe I'm misremembering the code, > it was some time ago. It will start adapting after 5 minor GCs or so, after it has enough statistics to make a decision. Try running with -XX:+UseNUMA and -XX: +PrintGCDetails -XX:+PrintHeapAtGC on Solaris and you'll see how the heap is being reshaped. igor From andi at firstfloor.org Mon Mar 3 00:53:20 2008 From: andi at firstfloor.org (Andi Kleen) Date: Mon, 03 Mar 2008 09:53:20 +0100 Subject: [PATCH] Fix a /tmp race in the linux code In-Reply-To: <47CB887C.7020803@sun.com> References: <20080303002628.GA28974@basil.nowhere.org> <47CB887C.7020803@sun.com> Message-ID: <47CBBC80.10200@firstfloor.org> David Holmes wrote: > Andi, > > Where is the race given the temp filename is unique? pids tend to be easily predictable within a small range and I suspect the number is also not entirely unpredictable -Andi From andi at firstfloor.org Mon Mar 3 03:30:17 2008 From: andi at firstfloor.org (Andi Kleen) Date: Mon, 3 Mar 2008 12:30:17 +0100 Subject: [PATCH] Linux NUMA support for HotSpot In-Reply-To: <200803031252.40200.igor.veresov@sun.com> References: <20080303002454.GA28952@basil.nowhere.org> <200803031252.40200.igor.veresov@sun.com> Message-ID: <20080303113017.GC5085@one.firstfloor.org> On Mon, Mar 03, 2008 at 12:52:40PM +0300, Igor Veresov wrote: > I haven't studied your changes in detail but I have a NUMA-aware allocator for > Linux in works Ok maybe you can do something with my patch then. > and I do see speedups, which are similar to what I was able to > get from Solaris. About 8% for specjbb2005 on a dual-socket Opteron. So it's Ok I only did micro benchmarks. Maybe they were not strong enough. For some simple allocations I didn't get any numa local placement at least according to the benchmark numbers. > all working quite well, actually. One obvious issue I found in the Solaris code was that it neither binds threads to nodes (afaik), nor tries to keep up with a thread migrating to another node. It just assumes always the same thread:node mapping which surely cannot be correct? On the Linux implementation I solved that by using getcpu() on each allocation (on recent Linux it is a special optimized fast path that is quite fast) But I think there were some other issues too. -Andi From andi at firstfloor.org Mon Mar 3 04:57:19 2008 From: andi at firstfloor.org (Andi Kleen) Date: Mon, 3 Mar 2008 13:57:19 +0100 Subject: [PATCH] Linux NUMA support for HotSpot In-Reply-To: <200803031519.13736.igor.veresov@sun.com> References: <20080303002454.GA28952@basil.nowhere.org> <200803031252.40200.igor.veresov@sun.com> <20080303113017.GC5085@one.firstfloor.org> <200803031519.13736.igor.veresov@sun.com> Message-ID: <20080303125719.GA9853@one.firstfloor.org> On Mon, Mar 03, 2008 at 03:19:13PM +0300, Igor Veresov wrote: > > > > all working quite well, actually. > > > > One obvious issue I found in the Solaris code was that it neither > > binds threads to nodes (afaik), nor tries to keep up with a thread > > migrating to another node. It just assumes always the same thread:node > > mapping which surely cannot be correct? > > It is however correct. Solaris assigns a home locality group (a node) to each > lwp. And while lwps can be temporary migrated to a remote group, page > allocation still happens on a home node and the lwp is predominantly > scheduled to run in its home lgroup. For more information you could refer to Interesting. I tried a similar scheduling algorithm on Linux a long time ago (it was called the "homenode scheduler") and it was a general loss on my testing on smaller AMD systems. But maybe Solaris does it all different. Anyways on Linux that won't work because it doesn't have the concept of a homenode. The other problems is that it seemed to always assume all the threads will consume the whole system and set up for all nodes, which seemed dodgy. > the NUMA chapter of the "Solaris Internals" book or to blogs of Jonathan Chew > and Alexander Kolbasov from Solaris CMT/NUMA team. > > > > > On the Linux implementation I solved that by using getcpu() on > > each allocation (on recent Linux it is a special optimized fast path > > that is quite fast) > > I doubt that executing a syscall on every allocation (even if it's a TLAB > allocation) is a good idea. It's many times slower than the original "bump A vsyscall is not a real syscall. It keeps running in ring 3 and just is some code the kernel maps into each user process. It's not more expensive than any indirect function call. The getcpu() vsyscall was especially designed for use by such NUMA aware allocators. > the pointer with the CAS spin" allocator. Linux scheduler is quite reluctant > to move lwps between the nodes, so checking the lwp position every, say, 64 > TLAB allocations proved to be adequate. On Solaris even that is not > necessary. getcpu() does this already by keeping a cache and a time stamp to check once per clock tick. This means it used to, in the latest kernels it is so fast now that even that was removed. -Andi From andi at firstfloor.org Mon Mar 3 05:32:05 2008 From: andi at firstfloor.org (Andi Kleen) Date: Mon, 3 Mar 2008 14:32:05 +0100 Subject: [PATCH] Linux NUMA support for HotSpot In-Reply-To: <200803031625.53418.igor.veresov@sun.com> References: <20080303002454.GA28952@basil.nowhere.org> <200803031519.13736.igor.veresov@sun.com> <20080303125719.GA9853@one.firstfloor.org> <200803031625.53418.igor.veresov@sun.com> Message-ID: <20080303133205.GB9853@one.firstfloor.org> On Mon, Mar 03, 2008 at 04:25:53PM +0300, Igor Veresov wrote: > On Monday 03 March 2008 15:57:19 Andi Kleen wrote: > > On Mon, Mar 03, 2008 at 03:19:13PM +0300, Igor Veresov wrote: > > > > > all working quite well, actually. > > > > > > > > One obvious issue I found in the Solaris code was that it neither > > > > binds threads to nodes (afaik), nor tries to keep up with a thread > > > > migrating to another node. It just assumes always the same thread:node > > > > mapping which surely cannot be correct? > > > > > > It is however correct. Solaris assigns a home locality group (a node) to > > > each lwp. And while lwps can be temporary migrated to a remote group, > > > page allocation still happens on a home node and the lwp is predominantly > > > scheduled to run in its home lgroup. For more information you could refer > > > to > > > > Interesting. I tried a similar scheduling algorithm on Linux a long time > > ago (it was called the "homenode scheduler") and it was a general loss on > > my testing on smaller AMD systems. But maybe Solaris does it all different. > > > > Anyways on Linux that won't work because it doesn't have the concept > > of a homenode. > > Yes, but it has static memory binding instead, which alleviates this problem. That would require statically binding the threads too which is by default not a good idea without explicit user configuration The reasoning is that not using a CPU is always worse than using remote memory at least on systems with reasonable small NUMA factor. (that is what killed the homenode scheduler too) > > > > > The other problems is that it seemed to always assume all the threads > > will consume the whole system and set up for all nodes, which seemed dodgy. > > You mean the allocator? Actually it is adaptive to the allocation rate on a > node, which in effect makes the whole eden space usable for applications with > asymmetric per-thread allocation rate. This of course also helps with the > case when the number of threads is less than the number of nodes. It didn't seem to adapt though. Or maybe I'm misremembering the code, it was some time ago. -Andi From gbenson at redhat.com Mon Mar 10 08:06:35 2008 From: gbenson at redhat.com (Gary Benson) Date: Mon, 10 Mar 2008 15:06:35 +0000 Subject: Linux current_stack_region() Message-ID: <20080310150634.GC3824@redhat.com> Hi all, Recently I've been investigating some stack-related failures on ppc, and trying to figure out how to make the stack region code work on ia64. The first thing I discovered is that the current linux code is wrong when there are guard pages. The comment above current_stack_region in os_linux_{i486,amd64,x86}.cpp puts the guard page outside the region reported by pthread_attr_getstack(), which is not the case. It needs to use pthread_attr_getguardsize() and trim that many bytes from the bottom of the region reported by pthread_attr_getstack(). I started modifying current_stack_region to do just that, but its comments contain warnings that pthread_getattr_np() returns bogus values for initial threads. os::Linux::capture_initial_stack() has more such warnings, though neither mentions exactly _what_ was bogus. Does anyone know? Without a working pthread_getattr_np() you can't use pthread_attr_getguardsize(), and without that it's not possible to implement current_stack_region() in the form it's currently defined. I spoke with our glibc maintainer and he assured me that pthread_getattr_np() returns good values for all threads, albeit more slowly for the initial thread. I rewrote current_stack_region() to use it. I attached what I wrote. Does it look ok? Cheers, Gary -- http://gbenson.net/ -------------- next part -------------- static void current_stack_region(address *bottom, size_t *size) { pthread_attr_t attr; int res = pthread_getattr_np(pthread_self(), &attr); if (res != 0) { if (res == ENOMEM) { vm_exit_out_of_memory(0, "pthread_getattr_np"); } else { fatal1("pthread_getattr_np failed with errno = %d", res); } } address stack_bottom; size_t stack_bytes; res = pthread_attr_getstack(&attr, (void **) &stack_bottom, &stack_bytes); if (res != 0) { fatal1("pthread_attr_getstack failed with errno = %d", res); } address stack_top = stack_bottom + stack_bytes; // The block of memory returned by pthread_attr_getstack() includes // guard pages where present. We need to trim these off. size_t page_bytes = os::Linux::page_size(); assert(((intptr_t) stack_bottom & (page_bytes - 1)) == 0, "unaligned stack"); size_t guard_bytes; res = pthread_attr_getguardsize(&attr, &guard_bytes); if (res != 0) { fatal1("pthread_attr_getguardsize failed with errno = %d", res); } int guard_pages = align_size_up(guard_bytes, page_bytes) / page_bytes; assert(guard_bytes == guard_pages * page_bytes, "unaligned guard"); #ifdef IA64 // IA64 has two stacks sharing the same area of memory, a normal // stack growing downwards and a register stack growing upwards. // Guard pages, if present, are in the centre. This code splits // the stack in two even without guard pages, though in theory // there's nothing to stop us allocating more to the normal stack // or more to the register stack if one or the other were found // to grow faster. int total_pages = align_size_down(stack_bytes, page_bytes) / page_bytes; stack_bottom += (total_pages - guard_pages) / 2 * page_bytes; #endif // IA64 stack_bottom += guard_bytes; pthread_attr_destroy(&attr); // The initial thread has a growable stack, and the size reported // by pthread_attr_getstack is the maximum size it could possibly // be given what currently mapped. This can be huge, so we cap it. if (os::Linux::is_initial_thread()) { stack_bytes = stack_top - stack_bottom; if (stack_bytes > JavaThread::stack_size_at_create()) stack_bytes = JavaThread::stack_size_at_create(); stack_bottom = stack_top - stack_bytes; } assert(os::current_stack_pointer() >= stack_bottom, "should do"); assert(os::current_stack_pointer() < stack_top, "should do"); *bottom = stack_bottom; *size = stack_top - stack_bottom; } From David.Holmes at Sun.COM Mon Mar 10 17:34:38 2008 From: David.Holmes at Sun.COM (David Holmes - Sun Microsystems) Date: Tue, 11 Mar 2008 10:34:38 +1000 Subject: Linux current_stack_region() In-Reply-To: <20080310150634.GC3824@redhat.com> References: <20080310150634.GC3824@redhat.com> Message-ID: <47D5D39E.3070402@sun.com> Hi Gary, Disclaimer: this isn't code I've worked with - though I did review the most recent changes - and stack management is a particularly confusing area. :) Gary Benson said the following on 11/03/08 01:06 AM: > The first thing I discovered is that the current linux code is wrong > when there are guard pages. The comment above current_stack_region > in os_linux_{i486,amd64,x86}.cpp puts the guard page outside the > region reported by pthread_attr_getstack(), which is not the case. Reading the POSIX specification I don't see anything that explicitly states this, but I would infer that the guard pages are not part of the region reported by pthread_attr_getstack from the statement: "The stack attributes specify the area of storage to be used for the created thread's stack." i.e. getstack reports the _usable_ stack for the thread. Hence any guard region is outside that. > I started modifying current_stack_region to do just that, but its > comments contain warnings that pthread_getattr_np() returns bogus > values for initial threads. os::Linux::capture_initial_stack() > has more such warnings, though neither mentions exactly _what_ was > bogus. Does anyone know? Without a working pthread_getattr_np() > you can't use pthread_attr_getguardsize(), and without that it's > not possible to implement current_stack_region() in the form it's > currently defined. The comment re pthread_getattr_np is about 5 years old and I couldn't find anything more specific than the inference that they discovered that it returned the wrong values on the initial thread on the distributions of the day (whatever they may have been). Hotspot is full of this kind of historical baggage with workarounds for a range of now defunct linux systems (and old Solaris versions too). Cheers, David Holmes From David.Holmes at Sun.COM Mon Mar 10 17:52:10 2008 From: David.Holmes at Sun.COM (David Holmes - Sun Microsystems) Date: Tue, 11 Mar 2008 10:52:10 +1000 Subject: Linux current_stack_region() In-Reply-To: <47D5D39E.3070402@sun.com> References: <20080310150634.GC3824@redhat.com> <47D5D39E.3070402@sun.com> Message-ID: <47D5D7BA.3080106@sun.com> Note also that pthread_attr_setguardsize may internally round-up the guardsize to a multiple of page-size; but pthread_attr_getguardsize returns the original supplied value not the rounded one. So you would have a problem trying to adjust for the true guardsize in a portable way. David David Holmes - Sun Microsystems said the following on 11/03/08 10:34 AM: > Hi Gary, > > Disclaimer: this isn't code I've worked with - though I did review the > most recent changes - and stack management is a particularly confusing > area. :) > > Gary Benson said the following on 11/03/08 01:06 AM: >> The first thing I discovered is that the current linux code is wrong >> when there are guard pages. The comment above current_stack_region >> in os_linux_{i486,amd64,x86}.cpp puts the guard page outside the >> region reported by pthread_attr_getstack(), which is not the case. > > Reading the POSIX specification I don't see anything that explicitly > states this, but I would infer that the guard pages are not part of the > region reported by pthread_attr_getstack from the statement: > > "The stack attributes specify the area of storage to be used for the > created thread's stack." > > i.e. getstack reports the _usable_ stack for the thread. Hence any guard > region is outside that. > >> I started modifying current_stack_region to do just that, but its >> comments contain warnings that pthread_getattr_np() returns bogus >> values for initial threads. os::Linux::capture_initial_stack() >> has more such warnings, though neither mentions exactly _what_ was >> bogus. Does anyone know? Without a working pthread_getattr_np() >> you can't use pthread_attr_getguardsize(), and without that it's >> not possible to implement current_stack_region() in the form it's >> currently defined. > > The comment re pthread_getattr_np is about 5 years old and I couldn't > find anything more specific than the inference that they discovered that > it returned the wrong values on the initial thread on the distributions > of the day (whatever they may have been). Hotspot is full of this kind > of historical baggage with workarounds for a range of now defunct linux > systems (and old Solaris versions too). > > > Cheers, > David Holmes From gbenson at redhat.com Tue Mar 11 02:32:57 2008 From: gbenson at redhat.com (Gary Benson) Date: Tue, 11 Mar 2008 09:32:57 +0000 Subject: Linux current_stack_region() In-Reply-To: <47D5D7BA.3080106@sun.com> References: <20080310150634.GC3824@redhat.com> <47D5D39E.3070402@sun.com> <47D5D7BA.3080106@sun.com> Message-ID: <20080311093257.GA3727@redhat.com> David Holmes wrote: > Gary Benson said the following on 11/03/08 01:06 AM: > > The first thing I discovered is that the current linux code is wrong > > when there are guard pages. The comment above current_stack_region > > in os_linux_{i486,amd64,x86}.cpp puts the guard page outside the > > region reported by pthread_attr_getstack(), which is not the case. > > Reading the POSIX specification I don't see anything that > explicitly states this, but I would infer that the guard pages are > not part of the region reported by pthread_attr_getstack from the > statement: > > "The stack attributes specify the area of storage to be used for > the created thread's stack." > > i.e. getstack reports the _usable_ stack for the thread. Hence any > guard region is outside that. That was how I understood it at first, but this is very definitely not the case with glibc. The glibc implementation takes the view that if a thread asks for X bytes of stack then it will get a region X bytes in size -- regardless of how that is then parcelled out. The ia64 case makes it a bit clearer, for me at least. ia64 has two stacks, both of which share the same region of memory. There's a normal stack which grows downwards, and a register stack which grows upwards. Guard pages, when specified, go in the middle. You can't really compensate for this without also compensating for the register stack, so you'd have this extra, unreported region (and threads silently allocating twice as much memory as you expected them to). It's awkward whichever way you decide to interpret the spec. > > I started modifying current_stack_region to do just that, but its > > comments contain warnings that pthread_getattr_np() returns bogus > > values for initial threads. os::Linux::capture_initial_stack() > > has more such warnings, though neither mentions exactly _what_ was > > bogus. Does anyone know? Without a working pthread_getattr_np() > > you can't use pthread_attr_getguardsize(), and without that it's > > not possible to implement current_stack_region() in the form it's > > currently defined. > > The comment re pthread_getattr_np is about 5 years old and I > couldn't find anything more specific than the inference that they > discovered that it returned the wrong values on the initial thread > on the distributions of the day (whatever they may have > been). Hotspot is full of this kind of historical baggage with > workarounds for a range of now defunct linux systems (and old > Solaris versions too). Ok. > Note also that pthread_attr_setguardsize may internally round-up the > guardsize to a multiple of page-size; but pthread_attr_getguardsize > returns the original supplied value not the rounded one. So you > would have a problem trying to adjust for the true guardsize in a > portable way. No, this only matters if the pthread_attr_t you're using is the same. If you create one using pthread_attr_init() then yes, the value returned by pthread_attr_getguardsize() will be exactly the value you set with pthread_attr_setguardsize(). The pthread_attr_t returned by pthread_getattr_np(), however, is created from the actual properties of the thread, so it contains the actual guard size in use. Cheers, Gary -- http://gbenson.net/ From David.Holmes at Sun.COM Tue Mar 11 07:07:17 2008 From: David.Holmes at Sun.COM (David Holmes - Sun Microsystems) Date: Wed, 12 Mar 2008 00:07:17 +1000 Subject: Linux current_stack_region() In-Reply-To: <20080311093257.GA3727@redhat.com> References: <20080310150634.GC3824@redhat.com> <47D5D39E.3070402@sun.com> <47D5D7BA.3080106@sun.com> <20080311093257.GA3727@redhat.com> Message-ID: <47D69215.1090409@sun.com> Hi Gary, I'd argue that glibc is incorrect then. But that aside what are the implications for hotspot? Does it mean we're placing our guard pages on top of glibc's? (Fortunately this will only affect natively attached threads that happen to use glibc guards). Thanks for clarifying my misreading regarding the rounding issue. Cheers, David Holmes Gary Benson said the following on 11/03/08 07:32 PM: > David Holmes wrote: >> Gary Benson said the following on 11/03/08 01:06 AM: >>> The first thing I discovered is that the current linux code is wrong >>> when there are guard pages. The comment above current_stack_region >>> in os_linux_{i486,amd64,x86}.cpp puts the guard page outside the >>> region reported by pthread_attr_getstack(), which is not the case. >> Reading the POSIX specification I don't see anything that >> explicitly states this, but I would infer that the guard pages are >> not part of the region reported by pthread_attr_getstack from the >> statement: >> >> "The stack attributes specify the area of storage to be used for >> the created thread's stack." >> >> i.e. getstack reports the _usable_ stack for the thread. Hence any >> guard region is outside that. > > That was how I understood it at first, but this is very definitely not > the case with glibc. The glibc implementation takes the view that if > a thread asks for X bytes of stack then it will get a region X bytes > in size -- regardless of how that is then parcelled out. > > The ia64 case makes it a bit clearer, for me at least. ia64 has two > stacks, both of which share the same region of memory. There's a > normal stack which grows downwards, and a register stack which grows > upwards. Guard pages, when specified, go in the middle. You can't > really compensate for this without also compensating for the register > stack, so you'd have this extra, unreported region (and threads > silently allocating twice as much memory as you expected them to). > It's awkward whichever way you decide to interpret the spec. > >>> I started modifying current_stack_region to do just that, but its >>> comments contain warnings that pthread_getattr_np() returns bogus >>> values for initial threads. os::Linux::capture_initial_stack() >>> has more such warnings, though neither mentions exactly _what_ was >>> bogus. Does anyone know? Without a working pthread_getattr_np() >>> you can't use pthread_attr_getguardsize(), and without that it's >>> not possible to implement current_stack_region() in the form it's >>> currently defined. >> The comment re pthread_getattr_np is about 5 years old and I >> couldn't find anything more specific than the inference that they >> discovered that it returned the wrong values on the initial thread >> on the distributions of the day (whatever they may have >> been). Hotspot is full of this kind of historical baggage with >> workarounds for a range of now defunct linux systems (and old >> Solaris versions too). > > Ok. > >> Note also that pthread_attr_setguardsize may internally round-up the >> guardsize to a multiple of page-size; but pthread_attr_getguardsize >> returns the original supplied value not the rounded one. So you >> would have a problem trying to adjust for the true guardsize in a >> portable way. > > No, this only matters if the pthread_attr_t you're using is the same. > If you create one using pthread_attr_init() then yes, the value > returned by pthread_attr_getguardsize() will be exactly the value you > set with pthread_attr_setguardsize(). The pthread_attr_t returned by > pthread_getattr_np(), however, is created from the actual properties > of the thread, so it contains the actual guard size in use. > > Cheers, > Gary > From gbenson at redhat.com Thu Mar 13 09:05:03 2008 From: gbenson at redhat.com (Gary Benson) Date: Thu, 13 Mar 2008 16:05:03 +0000 Subject: Linux current_stack_region() In-Reply-To: <47D69215.1090409@sun.com> References: <20080310150634.GC3824@redhat.com> <47D5D39E.3070402@sun.com> <47D5D7BA.3080106@sun.com> <20080311093257.GA3727@redhat.com> <47D69215.1090409@sun.com> Message-ID: <20080313160503.GG4394@redhat.com> David Holmes - Sun Microsystems wrote: > I'd argue that glibc is incorrect then. You'd not be the only one: https://bugzilla.redhat.com/show_bug.cgi?id=435337 > But that aside what are the implications for hotspot? Does it mean > we're placing our guard pages on top of glibc's? (Fortunately this > will only affect natively attached threads that happen to use glibc > guards). In Java threads with guard pages (attached threads, basically) HotSpot's red page will be in the same place as the glibc guard. In non-Java threads there will be a guard page at the end of the stack. > Thanks for clarifying my misreading regarding the rounding issue. Thanks for bringing it up. Someone else had mentioned something similar to me and I'd meant to check it but forgotten. Cheers, Gary -- http://gbenson.net/ From David.Holmes at Sun.COM Thu Mar 13 16:57:15 2008 From: David.Holmes at Sun.COM (David Holmes - Sun Microsystems) Date: Fri, 14 Mar 2008 09:57:15 +1000 Subject: Linux current_stack_region() In-Reply-To: <20080313160503.GG4394@redhat.com> References: <20080310150634.GC3824@redhat.com> <47D5D39E.3070402@sun.com> <47D5D7BA.3080106@sun.com> <20080311093257.GA3727@redhat.com> <47D69215.1090409@sun.com> <20080313160503.GG4394@redhat.com> Message-ID: <47D9BF5B.4060806@sun.com> Gary, Thanks for the bugzilla pointer - I've added my opinion there. I will file a bug against hotspot so that we can track this. David Gary Benson said the following on 14/03/08 02:05 AM: > David Holmes - Sun Microsystems wrote: >> I'd argue that glibc is incorrect then. > > You'd not be the only one: > https://bugzilla.redhat.com/show_bug.cgi?id=435337 > >> But that aside what are the implications for hotspot? Does it mean >> we're placing our guard pages on top of glibc's? (Fortunately this >> will only affect natively attached threads that happen to use glibc >> guards). > > In Java threads with guard pages (attached threads, basically) > HotSpot's red page will be in the same place as the glibc guard. > In non-Java threads there will be a guard page at the end of the > stack. > >> Thanks for clarifying my misreading regarding the rounding issue. > > Thanks for bringing it up. Someone else had mentioned something > similar to me and I'd meant to check it but forgotten. > > Cheers, > Gary > From David.Holmes at Sun.COM Thu Mar 13 17:14:07 2008 From: David.Holmes at Sun.COM (David Holmes - Sun Microsystems) Date: Fri, 14 Mar 2008 10:14:07 +1000 Subject: Linux current_stack_region() In-Reply-To: <47D9BF5B.4060806@sun.com> References: <20080310150634.GC3824@redhat.com> <47D5D39E.3070402@sun.com> <47D5D7BA.3080106@sun.com> <20080311093257.GA3727@redhat.com> <47D69215.1090409@sun.com> <20080313160503.GG4394@redhat.com> <47D9BF5B.4060806@sun.com> Message-ID: <47D9C34F.3040908@sun.com> Filed: 6675312 Linux glibc stack guard-pages can overlap with hotspot guard pages David David Holmes - Sun Microsystems said the following on 14/03/08 09:57 AM: > Gary, > > Thanks for the bugzilla pointer - I've added my opinion there. > > I will file a bug against hotspot so that we can track this. > > David > > Gary Benson said the following on 14/03/08 02:05 AM: >> David Holmes - Sun Microsystems wrote: >>> I'd argue that glibc is incorrect then. >> >> You'd not be the only one: >> https://bugzilla.redhat.com/show_bug.cgi?id=435337 >> >>> But that aside what are the implications for hotspot? Does it mean >>> we're placing our guard pages on top of glibc's? (Fortunately this >>> will only affect natively attached threads that happen to use glibc >>> guards). >> >> In Java threads with guard pages (attached threads, basically) >> HotSpot's red page will be in the same place as the glibc guard. >> In non-Java threads there will be a guard page at the end of the >> stack. >> >>> Thanks for clarifying my misreading regarding the rounding issue. >> >> Thanks for bringing it up. Someone else had mentioned something >> similar to me and I'd meant to check it but forgotten. >> >> Cheers, >> Gary >> From Scott.Oaks at Sun.COM Thu Mar 20 09:54:01 2008 From: Scott.Oaks at Sun.COM (Scott Oaks) Date: Thu, 20 Mar 2008 12:54:01 -0400 Subject: Infinite Looping code and jstack Message-ID: <1206032041.64035.184.camel@sr1-unyc10-08> We have an appserver installation where some set of threads get into an infinite loop. jstack always reports that the threads in question are at a specific line. Here's a snippet of the jstack output: "http80-Processor584" daemon prio=10 tid=0x015af400 nid=0x8d4 runnable [0x26bdf000..0x26bdf9f0] java.lang.Thread.State: RUNNABLE at org.apache.coyote.http11.InternalInputBuffer.parseHeader(InternalInputBuffer.java:805) at org.apache.coyote.http11.InternalInputBuffer.parseHeaders(InternalInputBuffer.java:607) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:667) parseHeader (line 805) isn't really a loop at all; there's no way I can account for the infinite loop being in that method. It's conceivable somehow that parseHeaders (line 607) could be in a loop: while (parseHeader()) doSomething But in that case, wouldn't I get somewhat differing results from successive invocations jstack (or even one jstack where multiple threads are in the same method)? I guess what I'm really asking is what the granularity of jstack is -- if I were in an infinite loop over two methods and 40-50 lines of code, is it really conceivable that jstack would always show me I was on the very same line (because that line corresponds to a safepoint or something)? Or is something else more likely going on? This is with JDK 1.6.0_02 and the server compiler. -Scott From Thomas.Rodriguez at Sun.COM Thu Mar 20 10:27:04 2008 From: Thomas.Rodriguez at Sun.COM (Tom Rodriguez) Date: Thu, 20 Mar 2008 10:27:04 -0700 Subject: Infinite Looping code and jstack In-Reply-To: <1206032041.64035.184.camel@sr1-unyc10-08> References: <1206032041.64035.184.camel@sr1-unyc10-08> Message-ID: <47E29E68.10706@sun.com> I believe that jstack may induce a safepoint in the target VM so the stack trace will always occur at safepoints in the code. This might cause the trace to tend to point at a particular locations. A trick that might work is to gcore the process and use jstack on that. That assumes your cores aren't huge though. If you are on s10 or later, pstack on s10 will decode Java stack traces though there are sometimes issues with getting full traces. It's pretty good at the top of the stack though which is where you're interested. What platform are you on? tom Scott Oaks wrote: > We have an appserver installation where some set of threads get into an > infinite loop. jstack always reports that the threads in question are at > a specific line. > > Here's a snippet of the jstack output: > "http80-Processor584" daemon prio=10 tid=0x015af400 nid=0x8d4 runnable > [0x26bdf000..0x26bdf9f0] > java.lang.Thread.State: RUNNABLE > at > org.apache.coyote.http11.InternalInputBuffer.parseHeader(InternalInputBuffer.java:805) > at > org.apache.coyote.http11.InternalInputBuffer.parseHeaders(InternalInputBuffer.java:607) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:667) > > parseHeader (line 805) isn't really a loop at all; there's no way I can > account for the infinite loop being in that method. It's conceivable > somehow that parseHeaders (line 607) could be in a loop: > > while (parseHeader()) > doSomething > > But in that case, wouldn't I get somewhat differing results from > successive invocations jstack (or even one jstack where multiple threads > are in the same method)? > > I guess what I'm really asking is what the granularity of jstack is -- > if I were in an infinite loop over two methods and 40-50 lines of code, > is it really conceivable that jstack would always show me I was on the > very same line (because that line corresponds to a safepoint or > something)? Or is something else more likely going on? > > This is with JDK 1.6.0_02 and the server compiler. > > -Scott > From Scott.Oaks at Sun.COM Thu Mar 20 10:34:08 2008 From: Scott.Oaks at Sun.COM (Scott Oaks) Date: Thu, 20 Mar 2008 13:34:08 -0400 Subject: Infinite Looping code and jstack In-Reply-To: <47E29E68.10706@sun.com> References: <1206032041.64035.184.camel@sr1-unyc10-08> <47E29E68.10706@sun.com> Message-ID: <1206034447.64035.210.camel@sr1-unyc10-08> I am on S10 (S10 U3), but I've never had much luck with pstack and busy java processes (and/or perhaps large -- the appserver has a few hundred threads and a 3GB heap) -- pstack itself seems to loop infinitely, or perhaps I'm just never patient enough...but after an hour or two, I still never have output. I'll try the gcore trick. -Scott On Thu, 2008-03-20 at 13:27, Tom Rodriguez wrote: > I believe that jstack may induce a safepoint in the target VM so the > stack trace will always occur at safepoints in the code. This might > cause the trace to tend to point at a particular locations. A trick > that might work is to gcore the process and use jstack on that. That > assumes your cores aren't huge though. > > If you are on s10 or later, pstack on s10 will decode Java stack traces > though there are sometimes issues with getting full traces. It's pretty > good at the top of the stack though which is where you're interested. > What platform are you on? > > tom > > Scott Oaks wrote: > > We have an appserver installation where some set of threads get into an > > infinite loop. jstack always reports that the threads in question are at > > a specific line. > > > > Here's a snippet of the jstack output: > > "http80-Processor584" daemon prio=10 tid=0x015af400 nid=0x8d4 runnable > > [0x26bdf000..0x26bdf9f0] > > java.lang.Thread.State: RUNNABLE > > at > > org.apache.coyote.http11.InternalInputBuffer.parseHeader(InternalInputBuffer.java:805) > > at > > org.apache.coyote.http11.InternalInputBuffer.parseHeaders(InternalInputBuffer.java:607) > > at > > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:667) > > > > parseHeader (line 805) isn't really a loop at all; there's no way I can > > account for the infinite loop being in that method. It's conceivable > > somehow that parseHeaders (line 607) could be in a loop: > > > > while (parseHeader()) > > doSomething > > > > But in that case, wouldn't I get somewhat differing results from > > successive invocations jstack (or even one jstack where multiple threads > > are in the same method)? > > > > I guess what I'm really asking is what the granularity of jstack is -- > > if I were in an infinite loop over two methods and 40-50 lines of code, > > is it really conceivable that jstack would always show me I was on the > > very same line (because that line corresponds to a safepoint or > > something)? Or is something else more likely going on? > > > > This is with JDK 1.6.0_02 and the server compiler. > > > > -Scott > > From David.Herron at Sun.COM Mon Mar 24 20:20:43 2008 From: David.Herron at Sun.COM (David Herron) Date: Mon, 24 Mar 2008 20:20:43 -0700 Subject: Compiling openjdk6 w/ gcc 4.2.x Message-ID: <47E86F8B.1080008@sun.com> I'm checking compiling OpenJDK 6 on Ubuntu 8.04 beta using gcc 4.2.x I got some compile errors in hotspot files that are compiled with -Werror. The warning that got upgraded to an error had to do with string constants being used with a "char *" of some kind. In some cases there was a struct definition that had a "char *" which then was initialized with a string constant. In other cases functions were called where the parameter was declared "char *" but a string constant given to it. In the following patches I either changed the struct definition or made a cast on the string constant, depending on the context. I also had an intermittent failure on hotspot/src/share/vm/opto/classes.cpp saying In file included from .../hotspot/src/share/vm/opto/classes.cpp:36: .../hotspot/src/share/vm/opto/classes.hpp: In member function 'virtual const Type* PartialSubtypeCheckNode::bottom_type() const': .../hotspot/src/share/vm/opto/classes.hpp:309: internal compiler error: Segmentation fault All the above issues do not happen on my Ubuntu 7.10 machine that has GCC 4.1.x installed. They happen on my Ubuntu 8.04 beta machine that has GCC 4.2.x - David Herron -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: diffs2 Url: http://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20080324/804481ee/attachment.ksh From Alan.Bateman at Sun.COM Fri Mar 28 04:42:39 2008 From: Alan.Bateman at Sun.COM (Alan Bateman) Date: Fri, 28 Mar 2008 11:42:39 +0000 Subject: Infinite Looping code and jstack In-Reply-To: <1206034447.64035.210.camel@sr1-unyc10-08> References: <1206032041.64035.184.camel@sr1-unyc10-08> <47E29E68.10706@sun.com> <1206034447.64035.210.camel@sr1-unyc10-08> Message-ID: <47ECD9AF.5090601@sun.com> Scott Oaks wrote: > I am on S10 (S10 U3), but I've never had much luck with pstack and busy > java processes (and/or perhaps large -- the appserver has a few hundred > threads and a 3GB heap) -- pstack itself seems to loop infinitely, or > perhaps I'm just never patient enough...but after an hour or two, I > still never have output. > > I'll try the gcore trick. > or instead of gcore + jstack, try "jstack -F " (the -F option is used to "force" a thread dump by invoking SA). -Alan. From volker.simonis at gmail.com Mon Mar 31 10:22:17 2008 From: volker.simonis at gmail.com (Volker Simonis) Date: Mon, 31 Mar 2008 19:22:17 +0200 Subject: Using VTune for HotSpot profiling (is UseVTune still functional and required?) Message-ID: Hi, does anybody know if the UseVTune option is still supported and functional. I browsed the code and saw that the vtune interface in "src/share/vm/runtime/vtune.hpp" is only implemented for Windows, but empty for Linux. Does anybody have experience in profiling the HotSpot with VTune on Linux? Do I need to implement the vtune interface mentioned above, or does it work out of the box? The VTune documentation states that it gets JIT information from the VM but I couldn't find out until now how this exactly works. An older VTune paper stated JVMPI, but I suppose this should be JVMTI now. As far as I know, it is possible to get JIT compilation information trough JVMTI, but no information about generated template code, stubs and adapters. This would line up with the VTune documentation which states that there may be a fair amount of non-assignable PCs which correspond to VM code that shouldn't bother the Java programmer:) Any hints are greatly appreciated! Regards, Volker From Alan.Bateman at Sun.COM Mon Mar 31 10:39:16 2008 From: Alan.Bateman at Sun.COM (Alan Bateman) Date: Mon, 31 Mar 2008 18:39:16 +0100 Subject: Using VTune for HotSpot profiling (is UseVTune still functional and required?) In-Reply-To: References: Message-ID: <47F121C4.1010507@sun.com> Volker Simonis wrote: > Hi, > > does anybody know if the UseVTune option is still supported and > functional. I browsed the code and saw that the vtune interface in > "src/share/vm/runtime/vtune.hpp" is only implemented for Windows, but > empty for Linux. > > Does anybody have experience in profiling the HotSpot with VTune on > Linux? Do I need to implement the vtune interface mentioned above, or > does it work out of the box? > > The VTune documentation states that it gets JIT information from the > VM but I couldn't find out until now how this exactly works. An older > VTune paper stated JVMPI, but I suppose this should be JVMTI now. As > far as I know, it is possible to get JIT compilation information > trough JVMTI, but no information about generated template code, stubs > and adapters. This would line up with the VTune documentation which > states that there may be a fair amount of non-assignable PCs which > correspond to VM code that shouldn't bother the Java programmer:) > > Any hints are greatly appreciated! > AFAIK, VTune uses JVM TI on Linux but still uses the older (and deprecated) JVMPI on Windows. In JVM TI the DynamicCodeGenerated event is invoked with the address range for the interpreter, stubs, and other generated code. It would be interesting to track down the "non-assignable PCs" in case there are any places where we are missing events. -Alan.