From John.Rose at Sun.COM  Sat Mar  1 14:58:38 2008
From: John.Rose at Sun.COM (John Rose)
Date: Sat, 01 Mar 2008 14:58:38 -0800
Subject: Is 'optimized' a legit target?
In-Reply-To: <025e01c87ae4$e99bc090$bcd341b0$@com>
References: <025e01c87ae4$e99bc090$bcd341b0$@com>
Message-ID: <1AE218E2-9540-430C-A0BE-428240022A99@sun.com>

Yes, 'optimized' is legit.  It supports more flags, for tuning  
experiments, etc.
Its performance characteristics are closer to product, because
it omits all the 'assert' code,

Here are the various build subdirectories, in brief:

product -- hardwires many flag values, no asserts, code is optimized
optimized -- most flag values variable, no asserts, code is optimized
fastdebug -- all flag values variable, asserts enabled, code is  
optimized
jvmg -- all flag values variable, asserts enabled, code not optimized  
(debuggable)
generated -- machine-generated source code and other stuff created  
during the build process
debug -- old name for jvmg; this one should go away
profiled -- dead a long time; this one should have gone away years ago

Best,
-- John


From ted at tedneward.com  Sat Mar  1 23:18:20 2008
From: ted at tedneward.com (Ted Neward)
Date: Sat, 1 Mar 2008 23:18:20 -0800
Subject: Is 'optimized' a legit target?
In-Reply-To: <1AE218E2-9540-430C-A0BE-428240022A99@sun.com>
References: <025e01c87ae4$e99bc090$bcd341b0$@com>
	<1AE218E2-9540-430C-A0BE-428240022A99@sun.com>
Message-ID: <022401c87c35$9b417b90$d1c472b0$@com>

When I tried to build optimized (b24), it failed very early in the process.
I'm assuming that's not supposed to happen? :-)

Ted Neward
Java, .NET, XML Services
Consulting, Teaching, Speaking, Writing
http://www.tedneward.com
 
> -----Original Message-----
> From: John.Rose at Sun.COM [mailto:John.Rose at Sun.COM]
> Sent: Saturday, March 01, 2008 2:59 PM
> To: Ted Neward
> Cc: build-dev at openjdk.java.net; hotspot-dev at openjdk.dev.java.net
> Subject: Re: Is 'optimized' a legit target?
> 
> Yes, 'optimized' is legit.  It supports more flags, for tuning
> experiments, etc.
> Its performance characteristics are closer to product, because
> it omits all the 'assert' code,
> 
> Here are the various build subdirectories, in brief:
> 
> product -- hardwires many flag values, no asserts, code is optimized
> optimized -- most flag values variable, no asserts, code is optimized
> fastdebug -- all flag values variable, asserts enabled, code is
> optimized
> jvmg -- all flag values variable, asserts enabled, code not optimized
> (debuggable)
> generated -- machine-generated source code and other stuff created
> during the build process
> debug -- old name for jvmg; this one should go away
> profiled -- dead a long time; this one should have gone away years ago
> 
> Best,
> -- John
> 
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.5.516 / Virus Database: 269.21.2/1305 - Release Date:
> 2/29/2008 6:32 PM
> 

No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.21.2/1305 - Release Date: 2/29/2008
6:32 PM
 

From andi at firstfloor.org  Sun Mar  2 16:24:54 2008
From: andi at firstfloor.org (Andi Kleen)
Date: Mon, 3 Mar 2008 01:24:54 +0100
Subject: [PATCH] Linux NUMA support for HotSpot
Message-ID: <20080303002454.GA28952@basil.nowhere.org>


Hi,

Some time ago I played a bit with the NUMA heap support in the hotspot sources.
In particularly I implemented an interface to the Linux libnuma interface
and wrote some simple benchmarks to see if it was any faster on a Opteron
system (it wasn't unfortunately). After some debugging I concluded that
my Linux NUMA interface was likely right, but the NUMA heap implementation
seemed to be broken (I doubt it'll work very well even on Solaris)

The implementation does not require to link libnuma always in,
but dlopen()s it as needed to avoid a little bit of dll/so hell.

It also uses the Linux getcpu() call which means it can actually
adapt to migrating threads over nodes (unlike the Solaris implementation) 

Anyways just in case someone else wants to play with it here's
libnuma support for Linux for HotSpot. It can be enabled with the usual 
options, but is disabled by default.

The patch was originally against a fairly old snapshot (b13). It still
applies perfectly to the latest 6 snapshot I downloaded. I wasn't able
to retest it thought because I was unable to build the latest snapshot
even after fiddling for an hour with all the undocumented environment
variables.

-Andi


diff -u openjdk/hotspot/src/os/linux/vm/os_linux.cpp-o openjdk/hotspot/src/os/linux/vm/os_linux.cpp
--- openjdk/hotspot/src/os/linux/vm/os_linux.cpp-o	2007-05-24 09:30:53.000000000 +0200
+++ openjdk/hotspot/src/os/linux/vm/os_linux.cpp	2007-06-24 16:56:57.000000000 +0200
@@ -56,6 +56,11 @@
 # include <sys/ipc.h>
 # include <sys/shm.h>
 # include <link.h>
+# include <numa.h>
+# include <numaif.h>
+#ifdef __amd64__
+# include <asm/vsyscall.h>
+#endif
 
 #define MAX_PATH    (2 * K)
 
@@ -81,6 +86,15 @@
 char * os::Linux::_glibc_version = NULL;
 char * os::Linux::_libpthread_version = NULL;
 
+void (*os::Linux::_numa_interleave_memory)(void *, size_t, const nodemask_t *) = NULL;
+void (*os::Linux::_numa_setlocal_memory)(void *, size_t) = NULL;
+int (*os::Linux::_numa_max_node)(void) = NULL;
+nodemask_t (*os::Linux::_numa_get_run_node_mask)(void) = NULL;
+int (*os::Linux::_numa_available)(void) = NULL;
+int (*os::Linux::_getcpu)(unsigned *, unsigned *, void *) = NULL;
+int (*os::Linux::_numa_node_to_cpus)(int node, unsigned long *buffer, int buffer_len);
+bool os::Linux::getcpu_broken = false;
+
 static jlong initial_time_count=0;
 
 static int clock_tics_per_sec = 100;
@@ -739,10 +753,7 @@
   osthread->set_thread_id(os::Linux::gettid());
 
   if (UseNUMA) {
-    int lgrp_id = os::numa_get_group_id();
-    if (lgrp_id != -1) {
-      thread->set_lgrp_id(lgrp_id);
-    }
+    thread->set_lgrp_id(-1);
   }
   // initialize signal mask for this thread
   os::Linux::hotspot_sigmask(thread);
@@ -916,10 +927,10 @@
   thread->set_osthread(osthread);
 
   if (UseNUMA) {
-    int lgrp_id = os::numa_get_group_id();
-    if (lgrp_id != -1) {
-      thread->set_lgrp_id(lgrp_id);
-    }
+    // let it be retrieved on first access in thread context
+    // actually this really needs a timeout or calling getcpu with
+    // the cache
+    thread->set_lgrp_id(-1);
   }
 
   if (os::Linux::is_initial_thread()) {
@@ -2224,25 +2231,230 @@
 
 void os::realign_memory(char *addr, size_t bytes, size_t alignment_hint) { }
 void os::free_memory(char *addr, size_t bytes)         { }
-void os::numa_make_global(char *addr, size_t bytes)    { }
-void os::numa_make_local(char *addr, size_t bytes)     { }
-bool os::numa_topology_changed()                       { return false; }
-size_t os::numa_get_groups_num()                       { return 1; }
-int os::numa_get_group_id()                            { return 0; }
-size_t os::numa_get_leaf_groups(int *ids, size_t size) {
-  if (size > 0) {
-    ids[0] = 0;
-    return 1;
+
+#ifndef __amd64__ // x86-64 has a vsyscall
+
+#ifndef SYS_getcpu
+#ifdef __i386__
+#define SYS_getcpu 318
+#else
+#error define getcpu for architecture
+#endif
+#endif
+
+static int getcpu_syscall(unsigned *cpu, unsigned *node, void *cache)
+{
+  return syscall(SYS_getcpu, cpu, node, cache);
+}
+#endif
+
+static __thread nodemask_t current_node_mask;
+
+void os::Linux::numa_init(void)
+{
+  int err = 0;
+  // load libnuma lazily because it is not on all systems
+  void *lnuma = dlopen("libnuma.so.1", RTLD_LAZY);
+  if (!lnuma) { 
+    warning("NUMA requested but cannot open libnuma.so.1. NUMA disabled");
+    UseNUMA = false;
+    return;
+  }
+
+#define NSYM(sym) \
+  { typedef typeof(sym) f; \
+    _##sym = (f *)dlsym(lnuma, #sym); err += (_##sym == NULL); \
+  }
+  NSYM(numa_available);
+  NSYM(numa_interleave_memory);
+  NSYM(numa_setlocal_memory);
+  NSYM(numa_get_run_node_mask);
+  NSYM(numa_max_node);
+  NSYM(numa_node_to_cpus);
+  if (err) { 
+    warning("NUMA requested but cannot find required symbol in libnuma. NUMA disabled");
+    UseNUMA = false;
+    return;
+  } 
+#undef NSYM
+  // libnuma is never unloaded
+    
+#ifdef __x86_64__
+  // will return ENOSYS or work on all Linux x86-64 kernels
+  _getcpu = (int (*)(unsigned *,unsigned *,void *))VSYSCALL_ADDR(2);
+#else
+  _getcpu = getcpu_syscall; 
+#endif
+
+  if (_numa_available() < 0) { 
+    // don't warn here for now for a simple non numa system
+    UseNUMA = false;
+    return;
+  }
+
+  current_node_mask = _numa_get_run_node_mask();
+}
+
+// Make pages global: interleave over current cpuset limit
+void os::numa_make_global(char *addr, size_t bytes) {
+//  os::Linux::_numa_interleave_memory(addr, bytes, &current_node_mask);   
+}
+
+// local memory is default, but set it anyways in case the global
+// policy was differently set by numactl
+// this only sets first-touch policy
+// Linux also supports real page migration, but the NUMA allocator
+// here doesn't seem to.
+void os::numa_make_local(char *addr, size_t bytes) {
+  os::Linux::_numa_setlocal_memory(addr, bytes);
+}
+
+// We just return true if the cpuset changed.
+// RED-PEN this should be probably per thread
+bool os::numa_topology_changed() { 
+  nodemask_t newmask = os::Linux::_numa_get_run_node_mask();
+  if (nodemask_equal(&newmask, &current_node_mask))
+    return false;
+  fprintf(stderr,"numa topology changed\n");
+  current_node_mask = newmask;
+  return true;
+}
+
+size_t os::numa_get_groups_num() {
+  // older version of libnuma are not cpuset aware
+  // compute it from the runnode mask instead
+  nodemask_t mask = os::Linux::_numa_get_run_node_mask();
+  int i, k;
+  k = 0;
+  for (i = 0; i < NUMA_NUM_NODES; i++)
+    if (nodemask_isset(&mask, i))
+      k++;
+  return k;
+}
+
+// copy because header file is still often missing
+struct getcpu_cache { 
+	unsigned long blob[128 / sizeof(long)];	
+};
+static __thread getcpu_cache node_cache;
+
+// slow and complicated fallback method when getcpu is not working
+// we do this only once per thread
+int os::Linux::fallback_get_group_id() {
+  FILE *f = fopen("/proc/self/stat", "r");
+  if (!f)
+    return 0;
+
+  size_t linesz = 0;
+  char *line = NULL; 
+  
+  int n = getline(&line, &linesz, f);
+  fclose(f);
+  if (n <= 0) { 
+    free(line);
+    return 0;
+  }
+  
+  // find processor field
+  
+  // skip non numbers (3 fields) at the beginning
+  char ch; 
+  size_t offset;
+  if (sscanf(line, "%*d %*s %c%n", &ch, &offset) != 1) {
+    free(line);
+    return 0;
+  }
+
+  // process numbers; processor is the 39th field
+  char *p = line + offset;
+  int i, cpu;
+  for (i = 0; i < 39-3; i++) {
+    char *end;
+    cpu = strtol(p, &end, 0);
+    if (p == end) {
+      cpu = 0;
+      break;
+    }
+    p = end;
+  }
+  free(line);
+  
+  // convert to node using libnuma
+  int max_node = os::Linux::_numa_max_node();
+  for (i = 0; i <= max_node; i++) {
+    unsigned long cpus[128]; 
+    if (os::Linux::_numa_node_to_cpus(i, cpus, sizeof(cpus)) < 0) 
+      continue;
+    if (cpus[cpu / sizeof(long)] & (1ULL<<(cpu%sizeof(long)))) {
+      return i;
+    }
   }
   return 0;
 }
 
+int os::numa_get_group_id() {
+    // fast method only in Linux 2.6.19+
+  // this should be fast enough it can be done everytime
+  unsigned cpu, node;
+  if (!os::Linux::getcpu_broken) {
+    if (os::Linux::_getcpu(&cpu, &node, &node_cache) == 0)
+      return node;  
+    os::Linux::getcpu_broken = true;
+  }
+
+  // otherwise use fallback once and then keep using the
+  // cached value. 
+  // it would be better to have some kind of timeout
+  // but i don't know of a fast way to do this
+  int id = Thread::current()->lgrp_id();
+  if (id != -1)
+    return id;
+
+  id = os::Linux::fallback_get_group_id();
+  Thread::current()->set_lgrp_id(id);
+  return id;
+}
+
+size_t os::numa_get_leaf_groups(int *ids, size_t size) {
+  nodemask_t nodes;
+  nodes = os::Linux::_numa_get_run_node_mask();
+  unsigned i, k;
+  k = 0;
+  for (i = 0; i < NUMA_NUM_NODES; i++)
+    if (nodemask_isset(&nodes, i)) { 
+      // could happen when the cpuset shrinks during runtime I think
+      if (k >= size)
+	return k;
+      ids[k++] = i;
+    }
+  return k;
+}
+
 bool os::get_page_info(char *start, page_info* info) {
+  unsigned pol;
+  info->size = 0;
+  info->lgrp_id = -1;
+  // no nice way to detect huge pages here
+  if (syscall(SYS_get_mempolicy, &pol, NULL, 0, start, MPOL_F_NODE|MPOL_F_ADDR) == 0) { 
+    info->size = os::Linux::_page_size;
+    info->lgrp_id = pol;
+    return true;
+  }
   return false;
 }
 
+
+// Scan the pages from start to end until a page different than
+// the one described in the info parameter is encountered.
 char *os::scan_pages(char *start, char* end, page_info* page_expected, page_info* page_found) {
-  return end;
+  while (start < end) { 
+    if (!get_page_info(start, page_found))
+      return NULL;
+    if (page_expected->lgrp_id != page_found->lgrp_id)
+      return start;
+    start += os::Linux::_page_size;
+  }
+  return start;
 }
 
 bool os::uncommit_memory(char* addr, size_t size) {
@@ -3571,6 +3783,9 @@
   // initialize thread priority policy
   prio_init();
 
+  if (UseNUMA)
+    Linux::numa_init();
+
   return JNI_OK;
 }
 
diff -u openjdk/hotspot/src/os/linux/vm/os_linux.hpp-o openjdk/hotspot/src/os/linux/vm/os_linux.hpp
--- openjdk/hotspot/src/os/linux/vm/os_linux.hpp-o	2007-05-24 09:30:53.000000000 +0200
+++ openjdk/hotspot/src/os/linux/vm/os_linux.hpp	2007-05-29 03:05:23.000000000 +0200
@@ -52,6 +52,17 @@
   static int (*_clock_gettime)(clockid_t, struct timespec *);
   static int (*_pthread_getcpuclockid)(pthread_t, clockid_t *);
 
+  static void (*_numa_interleave_memory)(void *, size_t, const nodemask_t *);
+  static void (*_numa_setlocal_memory)(void *, size_t);
+  static nodemask_t (*_numa_get_run_node_mask)(void);
+  static int (*_numa_node_to_cpus)(int node, unsigned long *buffer, int buffer_len);
+  static int (*_numa_available)(void);
+  static int (*_numa_max_node)(void);
+  static int (*_getcpu)(unsigned *, unsigned *, void *);
+  static void numa_init();
+  static bool getcpu_broken; 
+  static int fallback_get_group_id();
+
   static address   _initial_thread_stack_bottom;
   static uintptr_t _initial_thread_stack_size;
 

From andi at firstfloor.org  Sun Mar  2 16:26:28 2008
From: andi at firstfloor.org (Andi Kleen)
Date: Mon, 3 Mar 2008 01:26:28 +0100
Subject: [PATCH] Fix a /tmp race in the linux code
Message-ID: <20080303002628.GA28974@basil.nowhere.org>


/tmp races are bad for you. Use mkstemp instead of a predictable name for 
the debugging file.

-Andi

diff -u openjdk/hotspot/src/os/linux/vm/os_linux.cpp-o openjdk/hotspot/src/os/linux/vm/os_linux.cpp
--- openjdk/hotspot/src/os/linux/vm/os_linux.cpp-o	2007-05-24 09:30:53.000000000 +0200
+++ openjdk/hotspot/src/os/linux/vm/os_linux.cpp	2007-06-24 16:56:57.000000000 +0200
@@ -2185,15 +2196,12 @@
     return;
   }
 
-  char buf[40];
   int num = Atomic::add(1, &cnt);
 
-  sprintf(buf, "/tmp/hs-vm-%d-%d", os::current_process_id(), num);
-  unlink(buf);
- 
-  int fd = open(buf, O_CREAT | O_RDWR, S_IRWXU);
+  int fd = mkstemp("/tmp/hs-vm-XXXXXX");
 
   if (fd != -1) {
+    fchmod(fd, S_IRWXU);
     off_t rv = lseek(fd, size-2, SEEK_SET);
     if (rv != (off_t)-1) {
       if (write(fd, "", 1) == 1) {
@@ -2203,7 +2211,6 @@
       }
     }
     close(fd);
-    unlink(buf);
   }
 }
 

From andi at firstfloor.org  Sun Mar  2 16:28:49 2008
From: andi at firstfloor.org (Andi Kleen)
Date: Mon, 3 Mar 2008 01:28:49 +0100
Subject: [PATCH] Fix some dodgy inline assembler
Message-ID: <20080303002849.GA28985@basil.nowhere.org>


Potential miscompilation, there is no guarantee the clobbers
won't conflict with the input registers. Also the ifdef is not really needed.
Write it properly.

Just spotted while looking at the linux lowlevel code.

-Andi


diff -u openjdk/hotspot/src/os/linux/launcher/java_md.c-o openjdk/hotspot/src/os/linux/launcher/java_md.c
--- openjdk/hotspot/src/os/linux/launcher/java_md.c-o	2007-05-24 09:30:53.000000000 +0200
+++ openjdk/hotspot/src/os/linux/launcher/java_md.c	2007-05-28 15:42:23.000000000 +0200
@@ -1233,57 +1233,16 @@
           uint32_t* ebxp,
           uint32_t* ecxp,
           uint32_t* edxp) {
-#ifdef _LP64
-  __asm__ volatile (/* Instructions */
-                    "	movl	%4, %%eax  \n"
-                    "	cpuid              \n"
-                    "	movl    %%eax, (%0)\n"
-                    "	movl    %%ebx, (%1)\n"
-                    "	movl    %%ecx, (%2)\n"
-                    "	movl    %%edx, (%3)\n"
-                    : /* Outputs */
-                    : /* Inputs */
-                    "r" (eaxp),
-                    "r" (ebxp),
-                    "r" (ecxp),
-                    "r" (edxp),
-                    "r" (arg)
-                    : /* Clobbers */
-                    "%rax", "%rbx", "%rcx", "%rdx", "memory"
-                    );
-#else
-  uint32_t value_of_eax = 0;
-  uint32_t value_of_ebx = 0;
-  uint32_t value_of_ecx = 0;
-  uint32_t value_of_edx = 0;
-  __asm__ volatile (/* Instructions */
-                        /* ebx is callee-save, so push it */
-                        /* even though it's in the clobbers section */
-                    "   pushl   %%ebx      \n"
-                    "	movl	%4, %%eax  \n"
-                    "	cpuid              \n"
-                    "	movl    %%eax, %0  \n"
-                    "	movl    %%ebx, %1  \n"
-                    "	movl    %%ecx, %2  \n"
-                    "	movl    %%edx, %3  \n"
-                        /* restore ebx */
-                    "   popl    %%ebx      \n"
-
-                    : /* Outputs */
-                    "=m" (value_of_eax),
-                    "=m" (value_of_ebx),
-                    "=m" (value_of_ecx),
-                    "=m" (value_of_edx)
-                    : /* Inputs */
-                    "m" (arg)
-                    : /* Clobbers */
-                    "%eax", "%ebx", "%ecx", "%edx"
-                    );
-  *eaxp = value_of_eax;
-  *ebxp = value_of_ebx;
-  *ecxp = value_of_ecx;
-  *edxp = value_of_edx;
-#endif
+  asm(/* Instructions */
+	  "cpuid"
+	  : /* Outputs */
+	  "=a" (*eaxp),
+	  "=b" (*ebxp),
+	  "=c" (*ecxp),
+	  "=d" (*edxp)
+	  : /* Inputs */
+	  "0" (arg)
+	  );
 }
 
 #endif /* __linux__ && i586 */


From David.Holmes at Sun.COM  Sun Mar  2 21:11:24 2008
From: David.Holmes at Sun.COM (David Holmes)
Date: Mon, 03 Mar 2008 15:11:24 +1000
Subject: [PATCH] Fix a /tmp race in the linux code
In-Reply-To: <20080303002628.GA28974@basil.nowhere.org>
References: <20080303002628.GA28974@basil.nowhere.org>
Message-ID: <47CB887C.7020803@sun.com>

Andi,

Where is the race given the temp filename is unique?

David Holmes

Andi Kleen wrote:
> /tmp races are bad for you. Use mkstemp instead of a predictable name for 
> the debugging file.
> 
> -Andi
> 
> diff -u openjdk/hotspot/src/os/linux/vm/os_linux.cpp-o openjdk/hotspot/src/os/linux/vm/os_linux.cpp
> --- openjdk/hotspot/src/os/linux/vm/os_linux.cpp-o	2007-05-24 09:30:53.000000000 +0200
> +++ openjdk/hotspot/src/os/linux/vm/os_linux.cpp	2007-06-24 16:56:57.000000000 +0200
> @@ -2185,15 +2196,12 @@
>      return;
>    }
>  
> -  char buf[40];
>    int num = Atomic::add(1, &cnt);
>  
> -  sprintf(buf, "/tmp/hs-vm-%d-%d", os::current_process_id(), num);
> -  unlink(buf);
> - 
> -  int fd = open(buf, O_CREAT | O_RDWR, S_IRWXU);
> +  int fd = mkstemp("/tmp/hs-vm-XXXXXX");
>  
>    if (fd != -1) {
> +    fchmod(fd, S_IRWXU);
>      off_t rv = lseek(fd, size-2, SEEK_SET);
>      if (rv != (off_t)-1) {
>        if (write(fd, "", 1) == 1) {
> @@ -2203,7 +2211,6 @@
>        }
>      }
>      close(fd);
> -    unlink(buf);
>    }
>  }
>  


From Igor.Veresov at Sun.COM  Mon Mar  3 01:52:40 2008
From: Igor.Veresov at Sun.COM (Igor Veresov)
Date: Mon, 03 Mar 2008 12:52:40 +0300
Subject: [PATCH] Linux NUMA support for HotSpot
In-Reply-To: <20080303002454.GA28952@basil.nowhere.org>
References: <20080303002454.GA28952@basil.nowhere.org>
Message-ID: <200803031252.40200.igor.veresov@sun.com>

Andi,

I haven't studied your changes in detail but I have a NUMA-aware allocator for 
Linux in works and I do see speedups, which are similar to what I was able to 
get from Solaris. About 8% for specjbb2005 on a dual-socket Opteron. So it's 
all working quite well, actually.

igor

On Monday 03 March 2008 03:24:54 Andi Kleen wrote:
> Hi,
>
> Some time ago I played a bit with the NUMA heap support in the hotspot
> sources. In particularly I implemented an interface to the Linux libnuma
> interface and wrote some simple benchmarks to see if it was any faster on a
> Opteron system (it wasn't unfortunately). After some debugging I concluded
> that my Linux NUMA interface was likely right, but the NUMA heap
> implementation seemed to be broken (I doubt it'll work very well even on
> Solaris)
>
> The implementation does not require to link libnuma always in,
> but dlopen()s it as needed to avoid a little bit of dll/so hell.
>
> It also uses the Linux getcpu() call which means it can actually
> adapt to migrating threads over nodes (unlike the Solaris implementation)
>
> Anyways just in case someone else wants to play with it here's
> libnuma support for Linux for HotSpot. It can be enabled with the usual
> options, but is disabled by default.
>
> The patch was originally against a fairly old snapshot (b13). It still
> applies perfectly to the latest 6 snapshot I downloaded. I wasn't able
> to retest it thought because I was unable to build the latest snapshot
> even after fiddling for an hour with all the undocumented environment
> variables.
>
> -Andi


From Igor.Veresov at Sun.COM  Mon Mar  3 04:19:13 2008
From: Igor.Veresov at Sun.COM (Igor Veresov)
Date: Mon, 03 Mar 2008 15:19:13 +0300
Subject: [PATCH] Linux NUMA support for HotSpot
In-Reply-To: <20080303113017.GC5085@one.firstfloor.org>
References: <20080303002454.GA28952@basil.nowhere.org>
	<200803031252.40200.igor.veresov@sun.com>
	<20080303113017.GC5085@one.firstfloor.org>
Message-ID: <200803031519.13736.igor.veresov@sun.com>

On Monday 03 March 2008 14:30:17 Andi Kleen wrote:
> On Mon, Mar 03, 2008 at 12:52:40PM +0300, Igor Veresov wrote:
> > I haven't studied your changes in detail but I have a NUMA-aware
> > allocator for Linux in works
>
> Ok maybe you can do something with my patch then.
>
> > and I do see speedups, which are similar to what I was able to
> > get from Solaris. About 8% for specjbb2005 on a dual-socket Opteron. So
> > it's
>
> Ok I only did micro benchmarks. Maybe they were not strong enough.
> For some simple allocations I didn't get any numa local placement
> at least according to the benchmark numbers.
>

Well, obviously on microbenchmarks you should see even more speedup.
To be exact, there is a 30% difference in latency in a 2-socket Opteron (1 HT 
hop) system and there's even more for 2 and 3-hop systems.

> > all working quite well, actually.
>
> One obvious issue I found in the Solaris code was that it neither
> binds threads to nodes (afaik), nor tries to keep up with a thread
> migrating to another node. It just assumes always the same thread:node
> mapping which surely cannot be correct?

It is however correct. Solaris assigns a home locality group (a node) to each 
lwp. And while lwps can be temporary migrated to a remote group, page 
allocation still happens on a home node and the lwp is predominantly 
scheduled to run in its home lgroup. For more information you could refer to 
the NUMA chapter of the "Solaris Internals" book or to blogs of Jonathan Chew 
and Alexander Kolbasov from Solaris CMT/NUMA team.

>
> On the Linux implementation I solved that by using getcpu() on
> each allocation (on recent Linux it is a special optimized fast path
> that is quite fast)

I doubt that executing a syscall on every allocation (even if it's a TLAB 
allocation) is a good idea. It's many times slower than the original "bump 
the pointer with the CAS spin" allocator. Linux scheduler is quite reluctant 
to move lwps between the nodes, so checking the lwp position every, say, 64 
TLAB allocations proved to be adequate. On Solaris even that is not 
necessary.

igor


From Igor.Veresov at Sun.COM  Mon Mar  3 05:25:53 2008
From: Igor.Veresov at Sun.COM (Igor Veresov)
Date: Mon, 03 Mar 2008 16:25:53 +0300
Subject: [PATCH] Linux NUMA support for HotSpot
In-Reply-To: <20080303125719.GA9853@one.firstfloor.org>
References: <20080303002454.GA28952@basil.nowhere.org>
	<200803031519.13736.igor.veresov@sun.com>
	<20080303125719.GA9853@one.firstfloor.org>
Message-ID: <200803031625.53418.igor.veresov@sun.com>

On Monday 03 March 2008 15:57:19 Andi Kleen wrote:
> On Mon, Mar 03, 2008 at 03:19:13PM +0300, Igor Veresov wrote:
> > > > all working quite well, actually.
> > >
> > > One obvious issue I found in the Solaris code was that it neither
> > > binds threads to nodes (afaik), nor tries to keep up with a thread
> > > migrating to another node. It just assumes always the same thread:node
> > > mapping which surely cannot be correct?
> >
> > It is however correct. Solaris assigns a home locality group (a node) to
> > each lwp. And while lwps can be temporary migrated to a remote group,
> > page allocation still happens on a home node and the lwp is predominantly
> > scheduled to run in its home lgroup. For more information you could refer
> > to
>
> Interesting. I tried a similar scheduling algorithm on Linux a long time
> ago (it was called the "homenode scheduler") and it was a general loss on
> my testing on smaller AMD systems. But maybe Solaris does it all different.
>
> Anyways on Linux that won't work because it doesn't have the concept
> of a homenode.

Yes, but it has static memory binding instead, which alleviates this problem.

>
> The other problems is that it seemed to always assume all the threads
> will consume the whole system and set up for all nodes, which seemed dodgy.

You mean the allocator? Actually it is adaptive to the allocation rate on a 
node, which in effect makes the whole eden space usable for applications with 
asymmetric per-thread allocation rate. This of course also helps with the 
case when the number of threads is less than the number of nodes.

igor


From Igor.Veresov at Sun.COM  Mon Mar  3 06:20:26 2008
From: Igor.Veresov at Sun.COM (Igor Veresov)
Date: Mon, 03 Mar 2008 17:20:26 +0300
Subject: [PATCH] Linux NUMA support for HotSpot
In-Reply-To: <20080303133205.GB9853@one.firstfloor.org>
References: <20080303002454.GA28952@basil.nowhere.org>
	<200803031625.53418.igor.veresov@sun.com>
	<20080303133205.GB9853@one.firstfloor.org>
Message-ID: <200803031720.26828.igor.veresov@sun.com>

On Monday 03 March 2008 16:32:05 Andi Kleen wrote:
> On Mon, Mar 03, 2008 at 04:25:53PM +0300, Igor Veresov wrote:
> > On Monday 03 March 2008 15:57:19 Andi Kleen wrote:
> > > On Mon, Mar 03, 2008 at 03:19:13PM +0300, Igor Veresov wrote:
> > > > > > all working quite well, actually.
> > > > >
> > > > > One obvious issue I found in the Solaris code was that it neither
> > > > > binds threads to nodes (afaik), nor tries to keep up with a thread
> > > > > migrating to another node. It just assumes always the same
> > > > > thread:node mapping which surely cannot be correct?
> > > >
> > > > It is however correct. Solaris assigns a home locality group (a node)
> > > > to each lwp. And while lwps can be temporary migrated to a remote
> > > > group, page allocation still happens on a home node and the lwp is
> > > > predominantly scheduled to run in its home lgroup. For more
> > > > information you could refer to
> > >
> > > Interesting. I tried a similar scheduling algorithm on Linux a long
> > > time ago (it was called the "homenode scheduler") and it was a general
> > > loss on my testing on smaller AMD systems. But maybe Solaris does it
> > > all different.
> > >
> > > Anyways on Linux that won't work because it doesn't have the concept
> > > of a homenode.
> >
> > Yes, but it has static memory binding instead, which alleviates this
> > problem.
>
> That would require statically binding the threads too which is by default
> not a good idea without explicit user configuration

Not necessarily. It works fine without the static cpu binding. Keep in mind, 
that most data we have in young generation is short-lived anyway and if the 
scheduler is reluctant enough to move threads between the nodes the 
application will have enough time to manipulate the data locally. 

For long-living data, yes, this won't work.

>
> The reasoning is that not using a CPU is always worse than using
> remote memory at least on systems with reasonable small NUMA factor.
>
> (that is what killed the homenode scheduler too)

As I've mentioned before, Solaris will run the thread remotely if there is a 
significant load imbalance. Because indeed, it's better to run remotely than 
not to run at all. But this thread will return to its home node at first 
opportunity. 

>
> > > The other problems is that it seemed to always assume all the threads
> > > will consume the whole system and set up for all nodes, which seemed
> > > dodgy.
> >
> > You mean the allocator? Actually it is adaptive to the allocation rate on
> > a node, which in effect makes the whole eden space usable for
> > applications with asymmetric per-thread allocation rate. This of course
> > also helps with the case when the number of threads is less than the
> > number of nodes.
>
> It didn't seem to adapt though. Or maybe I'm misremembering the code,
> it was some time ago.

It will start adapting after 5 minor GCs or so, after it has enough statistics 
to make a decision. Try running with -XX:+UseNUMA and -XX:
+PrintGCDetails -XX:+PrintHeapAtGC on Solaris and you'll see how the heap is 
being reshaped.

igor


From andi at firstfloor.org  Mon Mar  3 00:53:20 2008
From: andi at firstfloor.org (Andi Kleen)
Date: Mon, 03 Mar 2008 09:53:20 +0100
Subject: [PATCH] Fix a /tmp race in the linux code
In-Reply-To: <47CB887C.7020803@sun.com>
References: <20080303002628.GA28974@basil.nowhere.org>
	<47CB887C.7020803@sun.com>
Message-ID: <47CBBC80.10200@firstfloor.org>

David Holmes wrote:
> Andi,
> 
> Where is the race given the temp filename is unique?

pids tend to be easily predictable within a small range and I suspect 
the number is also not entirely unpredictable

-Andi


From andi at firstfloor.org  Mon Mar  3 03:30:17 2008
From: andi at firstfloor.org (Andi Kleen)
Date: Mon, 3 Mar 2008 12:30:17 +0100
Subject: [PATCH] Linux NUMA support for HotSpot
In-Reply-To: <200803031252.40200.igor.veresov@sun.com>
References: <20080303002454.GA28952@basil.nowhere.org>
	<200803031252.40200.igor.veresov@sun.com>
Message-ID: <20080303113017.GC5085@one.firstfloor.org>

On Mon, Mar 03, 2008 at 12:52:40PM +0300, Igor Veresov wrote:
> I haven't studied your changes in detail but I have a NUMA-aware allocator for 
> Linux in works 

Ok maybe you can do something with my patch then.

> and I do see speedups, which are similar to what I was able to 
> get from Solaris. About 8% for specjbb2005 on a dual-socket Opteron. So it's 

Ok I only did micro benchmarks. Maybe they were not strong enough.
For some simple allocations I didn't get any numa local placement 
at least according to the benchmark numbers.

> all working quite well, actually.

One obvious issue I found in the Solaris code was that it neither
binds threads to nodes (afaik), nor tries to keep up with a thread migrating
to another node. It just assumes always the same thread:node mapping
which surely cannot be correct?

On the Linux implementation I solved that by using getcpu() on
each allocation (on recent Linux it is a special optimized fast path
that is quite fast) 

But I think there were some other issues too.

-Andi


From andi at firstfloor.org  Mon Mar  3 04:57:19 2008
From: andi at firstfloor.org (Andi Kleen)
Date: Mon, 3 Mar 2008 13:57:19 +0100
Subject: [PATCH] Linux NUMA support for HotSpot
In-Reply-To: <200803031519.13736.igor.veresov@sun.com>
References: <20080303002454.GA28952@basil.nowhere.org>
	<200803031252.40200.igor.veresov@sun.com>
	<20080303113017.GC5085@one.firstfloor.org>
	<200803031519.13736.igor.veresov@sun.com>
Message-ID: <20080303125719.GA9853@one.firstfloor.org>

On Mon, Mar 03, 2008 at 03:19:13PM +0300, Igor Veresov wrote:

> 
> > > all working quite well, actually.
> >
> > One obvious issue I found in the Solaris code was that it neither
> > binds threads to nodes (afaik), nor tries to keep up with a thread
> > migrating to another node. It just assumes always the same thread:node
> > mapping which surely cannot be correct?
> 
> It is however correct. Solaris assigns a home locality group (a node) to each 
> lwp. And while lwps can be temporary migrated to a remote group, page 
> allocation still happens on a home node and the lwp is predominantly 
> scheduled to run in its home lgroup. For more information you could refer to 

Interesting. I tried a similar scheduling algorithm on Linux a long time ago
(it was called the "homenode scheduler") and it was a general loss 
on my testing on smaller AMD systems. But maybe Solaris does it all different.

Anyways on Linux that won't work because it doesn't have the concept
of a homenode.

The other problems is that it seemed to always assume all the threads
will consume the whole system and set up for all nodes, which seemed dodgy.

> the NUMA chapter of the "Solaris Internals" book or to blogs of Jonathan Chew 
> and Alexander Kolbasov from Solaris CMT/NUMA team.
> 
> >
> > On the Linux implementation I solved that by using getcpu() on
> > each allocation (on recent Linux it is a special optimized fast path
> > that is quite fast)
> 
> I doubt that executing a syscall on every allocation (even if it's a TLAB 
> allocation) is a good idea. It's many times slower than the original "bump 

A vsyscall is not a real syscall. It keeps running in ring 3 and just
is some code the kernel maps into each user process. It's not more
expensive than any indirect function call. 

The getcpu() vsyscall was especially designed for use by such NUMA aware
allocators.


> the pointer with the CAS spin" allocator. Linux scheduler is quite reluctant 
> to move lwps between the nodes, so checking the lwp position every, say, 64 
> TLAB allocations proved to be adequate. On Solaris even that is not 
> necessary.

getcpu() does this already by keeping a cache and a time stamp to check
once per clock tick. This means it used to, in the latest kernels it is so fast 
now that even that was removed.

-Andi


From andi at firstfloor.org  Mon Mar  3 05:32:05 2008
From: andi at firstfloor.org (Andi Kleen)
Date: Mon, 3 Mar 2008 14:32:05 +0100
Subject: [PATCH] Linux NUMA support for HotSpot
In-Reply-To: <200803031625.53418.igor.veresov@sun.com>
References: <20080303002454.GA28952@basil.nowhere.org>
	<200803031519.13736.igor.veresov@sun.com>
	<20080303125719.GA9853@one.firstfloor.org>
	<200803031625.53418.igor.veresov@sun.com>
Message-ID: <20080303133205.GB9853@one.firstfloor.org>

On Mon, Mar 03, 2008 at 04:25:53PM +0300, Igor Veresov wrote:
> On Monday 03 March 2008 15:57:19 Andi Kleen wrote:
> > On Mon, Mar 03, 2008 at 03:19:13PM +0300, Igor Veresov wrote:
> > > > > all working quite well, actually.
> > > >
> > > > One obvious issue I found in the Solaris code was that it neither
> > > > binds threads to nodes (afaik), nor tries to keep up with a thread
> > > > migrating to another node. It just assumes always the same thread:node
> > > > mapping which surely cannot be correct?
> > >
> > > It is however correct. Solaris assigns a home locality group (a node) to
> > > each lwp. And while lwps can be temporary migrated to a remote group,
> > > page allocation still happens on a home node and the lwp is predominantly
> > > scheduled to run in its home lgroup. For more information you could refer
> > > to
> >
> > Interesting. I tried a similar scheduling algorithm on Linux a long time
> > ago (it was called the "homenode scheduler") and it was a general loss on
> > my testing on smaller AMD systems. But maybe Solaris does it all different.
> >
> > Anyways on Linux that won't work because it doesn't have the concept
> > of a homenode.
> 
> Yes, but it has static memory binding instead, which alleviates this problem.

That would require statically binding the threads too which is by default not
a good idea without explicit user configuration

The reasoning is that not using a CPU is always worse than using 
remote memory at least on systems with reasonable small NUMA factor.

(that is what killed the homenode scheduler too) 

> 
> >
> > The other problems is that it seemed to always assume all the threads
> > will consume the whole system and set up for all nodes, which seemed dodgy.
> 
> You mean the allocator? Actually it is adaptive to the allocation rate on a 
> node, which in effect makes the whole eden space usable for applications with 
> asymmetric per-thread allocation rate. This of course also helps with the 
> case when the number of threads is less than the number of nodes.

It didn't seem to adapt though. Or maybe I'm misremembering the code,
it was some time ago.

-Andi 


From gbenson at redhat.com  Mon Mar 10 08:06:35 2008
From: gbenson at redhat.com (Gary Benson)
Date: Mon, 10 Mar 2008 15:06:35 +0000
Subject: Linux current_stack_region()
Message-ID: <20080310150634.GC3824@redhat.com>

Hi all,

Recently I've been investigating some stack-related failures on ppc,
and trying to figure out how to make the stack region code work on
ia64.

The first thing I discovered is that the current linux code is wrong
when there are guard pages.  The comment above current_stack_region
in os_linux_{i486,amd64,x86}.cpp puts the guard page outside the
region reported by pthread_attr_getstack(), which is not the case.
It needs to use pthread_attr_getguardsize() and trim that many bytes
from the bottom of the region reported by pthread_attr_getstack().

I started modifying current_stack_region to do just that, but its
comments contain warnings that pthread_getattr_np() returns bogus
values for initial threads.  os::Linux::capture_initial_stack()
has more such warnings, though neither mentions exactly _what_ was
bogus.  Does anyone know?  Without a working pthread_getattr_np()
you can't use pthread_attr_getguardsize(), and without that it's
not possible to implement current_stack_region() in the form it's
currently defined.

I spoke with our glibc maintainer and he assured me that
pthread_getattr_np() returns good values for all threads, albeit more
slowly for the initial thread.  I rewrote current_stack_region() to
use it.

I attached what I wrote.  Does it look ok?

Cheers,
Gary

--
http://gbenson.net/
-------------- next part --------------
static void current_stack_region(address *bottom, size_t *size)
{
  pthread_attr_t attr;
  int res = pthread_getattr_np(pthread_self(), &attr);
  if (res != 0) {
    if (res == ENOMEM) {
      vm_exit_out_of_memory(0, "pthread_getattr_np");
    }
    else {
      fatal1("pthread_getattr_np failed with errno = %d", res);
    }
  }

  address stack_bottom;
  size_t stack_bytes;
  res = pthread_attr_getstack(&attr, (void **) &stack_bottom, &stack_bytes);
  if (res != 0) {
    fatal1("pthread_attr_getstack failed with errno = %d", res);
  }
  address stack_top = stack_bottom + stack_bytes;

  // The block of memory returned by pthread_attr_getstack() includes
  // guard pages where present.  We need to trim these off.
  size_t page_bytes = os::Linux::page_size();
  assert(((intptr_t) stack_bottom & (page_bytes - 1)) == 0, "unaligned stack");

  size_t guard_bytes;
  res = pthread_attr_getguardsize(&attr, &guard_bytes);
  if (res != 0) {
    fatal1("pthread_attr_getguardsize failed with errno = %d", res);
  }
  int guard_pages = align_size_up(guard_bytes, page_bytes) / page_bytes;
  assert(guard_bytes == guard_pages * page_bytes, "unaligned guard");

#ifdef IA64
  // IA64 has two stacks sharing the same area of memory, a normal
  // stack growing downwards and a register stack growing upwards.
  // Guard pages, if present, are in the centre.  This code splits
  // the stack in two even without guard pages, though in theory
  // there's nothing to stop us allocating more to the normal stack
  // or more to the register stack if one or the other were found
  // to grow faster.
  int total_pages = align_size_down(stack_bytes, page_bytes) / page_bytes;
  stack_bottom += (total_pages - guard_pages) / 2 * page_bytes;
#endif // IA64

  stack_bottom += guard_bytes;

  pthread_attr_destroy(&attr);

  // The initial thread has a growable stack, and the size reported
  // by pthread_attr_getstack is the maximum size it could possibly
  // be given what currently mapped.  This can be huge, so we cap it.
  if (os::Linux::is_initial_thread()) {
    stack_bytes = stack_top - stack_bottom;

    if (stack_bytes > JavaThread::stack_size_at_create())
      stack_bytes = JavaThread::stack_size_at_create();

    stack_bottom = stack_top - stack_bytes;
  }

  assert(os::current_stack_pointer() >= stack_bottom, "should do");
  assert(os::current_stack_pointer() < stack_top, "should do");

  *bottom = stack_bottom;
  *size = stack_top - stack_bottom;
}

From David.Holmes at Sun.COM  Mon Mar 10 17:34:38 2008
From: David.Holmes at Sun.COM (David Holmes - Sun Microsystems)
Date: Tue, 11 Mar 2008 10:34:38 +1000
Subject: Linux current_stack_region()
In-Reply-To: <20080310150634.GC3824@redhat.com>
References: <20080310150634.GC3824@redhat.com>
Message-ID: <47D5D39E.3070402@sun.com>

Hi Gary,

Disclaimer: this isn't code I've worked with - though I did review the 
most recent changes - and stack management is a particularly confusing 
area. :)

Gary Benson said the following on 11/03/08 01:06 AM:
> The first thing I discovered is that the current linux code is wrong
> when there are guard pages.  The comment above current_stack_region
> in os_linux_{i486,amd64,x86}.cpp puts the guard page outside the
> region reported by pthread_attr_getstack(), which is not the case.

Reading the POSIX specification I don't see anything that explicitly 
states this, but I would infer that the guard pages are not part of the 
region reported by pthread_attr_getstack from the statement:

"The stack attributes specify the area of storage to be used for the 
created thread's stack."

i.e. getstack reports the _usable_ stack for the thread. Hence any guard 
region is outside that.

> I started modifying current_stack_region to do just that, but its
> comments contain warnings that pthread_getattr_np() returns bogus
> values for initial threads.  os::Linux::capture_initial_stack()
> has more such warnings, though neither mentions exactly _what_ was
> bogus.  Does anyone know?  Without a working pthread_getattr_np()
> you can't use pthread_attr_getguardsize(), and without that it's
> not possible to implement current_stack_region() in the form it's
> currently defined.

The comment re pthread_getattr_np is about 5 years old and I couldn't 
find anything more specific than the inference that they discovered that 
it returned the wrong values on the initial thread on the distributions 
of the day (whatever they may have been). Hotspot is full of this kind 
of historical baggage with workarounds for a range of now defunct linux 
systems (and old Solaris versions too).


Cheers,
David Holmes


From David.Holmes at Sun.COM  Mon Mar 10 17:52:10 2008
From: David.Holmes at Sun.COM (David Holmes - Sun Microsystems)
Date: Tue, 11 Mar 2008 10:52:10 +1000
Subject: Linux current_stack_region()
In-Reply-To: <47D5D39E.3070402@sun.com>
References: <20080310150634.GC3824@redhat.com> <47D5D39E.3070402@sun.com>
Message-ID: <47D5D7BA.3080106@sun.com>

Note also that pthread_attr_setguardsize may internally round-up the 
guardsize to a multiple of page-size; but pthread_attr_getguardsize 
returns the original supplied value not the rounded one. So you would 
have a problem trying to adjust for the true guardsize in a portable way.

David

David Holmes - Sun Microsystems said the following on 11/03/08 10:34 AM:
> Hi Gary,
> 
> Disclaimer: this isn't code I've worked with - though I did review the 
> most recent changes - and stack management is a particularly confusing 
> area. :)
> 
> Gary Benson said the following on 11/03/08 01:06 AM:
>> The first thing I discovered is that the current linux code is wrong
>> when there are guard pages.  The comment above current_stack_region
>> in os_linux_{i486,amd64,x86}.cpp puts the guard page outside the
>> region reported by pthread_attr_getstack(), which is not the case.
> 
> Reading the POSIX specification I don't see anything that explicitly 
> states this, but I would infer that the guard pages are not part of the 
> region reported by pthread_attr_getstack from the statement:
> 
> "The stack attributes specify the area of storage to be used for the 
> created thread's stack."
> 
> i.e. getstack reports the _usable_ stack for the thread. Hence any guard 
> region is outside that.
> 
>> I started modifying current_stack_region to do just that, but its
>> comments contain warnings that pthread_getattr_np() returns bogus
>> values for initial threads.  os::Linux::capture_initial_stack()
>> has more such warnings, though neither mentions exactly _what_ was
>> bogus.  Does anyone know?  Without a working pthread_getattr_np()
>> you can't use pthread_attr_getguardsize(), and without that it's
>> not possible to implement current_stack_region() in the form it's
>> currently defined.
> 
> The comment re pthread_getattr_np is about 5 years old and I couldn't 
> find anything more specific than the inference that they discovered that 
> it returned the wrong values on the initial thread on the distributions 
> of the day (whatever they may have been). Hotspot is full of this kind 
> of historical baggage with workarounds for a range of now defunct linux 
> systems (and old Solaris versions too).
> 
> 
> Cheers,
> David Holmes


From gbenson at redhat.com  Tue Mar 11 02:32:57 2008
From: gbenson at redhat.com (Gary Benson)
Date: Tue, 11 Mar 2008 09:32:57 +0000
Subject: Linux current_stack_region()
In-Reply-To: <47D5D7BA.3080106@sun.com>
References: <20080310150634.GC3824@redhat.com> <47D5D39E.3070402@sun.com>
	<47D5D7BA.3080106@sun.com>
Message-ID: <20080311093257.GA3727@redhat.com>

David Holmes wrote:
> Gary Benson said the following on 11/03/08 01:06 AM:
> > The first thing I discovered is that the current linux code is wrong
> > when there are guard pages.  The comment above current_stack_region
> > in os_linux_{i486,amd64,x86}.cpp puts the guard page outside the
> > region reported by pthread_attr_getstack(), which is not the case.
> 
> Reading the POSIX specification I don't see anything that
> explicitly states this, but I would infer that the guard pages are
> not part of the region reported by pthread_attr_getstack from the
> statement:
> 
> "The stack attributes specify the area of storage to be used for
> the created thread's stack."
> 
> i.e. getstack reports the _usable_ stack for the thread. Hence any
> guard region is outside that.

That was how I understood it at first, but this is very definitely not
the case with glibc.  The glibc implementation takes the view that if
a thread asks for X bytes of stack then it will get a region X bytes
in size -- regardless of how that is then parcelled out.

The ia64 case makes it a bit clearer, for me at least.  ia64 has two
stacks, both of which share the same region of memory.  There's a
normal stack which grows downwards, and a register stack which grows
upwards.  Guard pages, when specified, go in the middle.  You can't
really compensate for this without also compensating for the register
stack, so you'd have this extra, unreported region (and threads
silently allocating twice as much memory as you expected them to).
It's awkward whichever way you decide to interpret the spec.

> > I started modifying current_stack_region to do just that, but its
> > comments contain warnings that pthread_getattr_np() returns bogus
> > values for initial threads.  os::Linux::capture_initial_stack()
> > has more such warnings, though neither mentions exactly _what_ was
> > bogus.  Does anyone know?  Without a working pthread_getattr_np()
> > you can't use pthread_attr_getguardsize(), and without that it's
> > not possible to implement current_stack_region() in the form it's
> > currently defined.
>
> The comment re pthread_getattr_np is about 5 years old and I
> couldn't find anything more specific than the inference that they
> discovered that it returned the wrong values on the initial thread
> on the distributions of the day (whatever they may have
> been). Hotspot is full of this kind of historical baggage with
> workarounds for a range of now defunct linux systems (and old
> Solaris versions too).

Ok.

> Note also that pthread_attr_setguardsize may internally round-up the
> guardsize to a multiple of page-size; but pthread_attr_getguardsize
> returns the original supplied value not the rounded one. So you
> would have a problem trying to adjust for the true guardsize in a
> portable way.

No, this only matters if the pthread_attr_t you're using is the same.
If you create one using pthread_attr_init() then yes, the value
returned by pthread_attr_getguardsize() will be exactly the value you
set with pthread_attr_setguardsize().  The pthread_attr_t returned by
pthread_getattr_np(), however, is created from the actual properties
of the thread, so it contains the actual guard size in use.

Cheers,
Gary

-- 
http://gbenson.net/


From David.Holmes at Sun.COM  Tue Mar 11 07:07:17 2008
From: David.Holmes at Sun.COM (David Holmes - Sun Microsystems)
Date: Wed, 12 Mar 2008 00:07:17 +1000
Subject: Linux current_stack_region()
In-Reply-To: <20080311093257.GA3727@redhat.com>
References: <20080310150634.GC3824@redhat.com> <47D5D39E.3070402@sun.com>
	<47D5D7BA.3080106@sun.com> <20080311093257.GA3727@redhat.com>
Message-ID: <47D69215.1090409@sun.com>

Hi Gary,

I'd argue that glibc is incorrect then.

But that aside what are the implications for hotspot? Does it mean we're 
  placing our guard pages on top of glibc's? (Fortunately this will only 
affect natively attached threads that happen to use glibc guards).

Thanks for clarifying my misreading regarding the rounding issue.

Cheers,
David Holmes

Gary Benson said the following on 11/03/08 07:32 PM:
> David Holmes wrote:
>> Gary Benson said the following on 11/03/08 01:06 AM:
>>> The first thing I discovered is that the current linux code is wrong
>>> when there are guard pages.  The comment above current_stack_region
>>> in os_linux_{i486,amd64,x86}.cpp puts the guard page outside the
>>> region reported by pthread_attr_getstack(), which is not the case.
>> Reading the POSIX specification I don't see anything that
>> explicitly states this, but I would infer that the guard pages are
>> not part of the region reported by pthread_attr_getstack from the
>> statement:
>>
>> "The stack attributes specify the area of storage to be used for
>> the created thread's stack."
>>
>> i.e. getstack reports the _usable_ stack for the thread. Hence any
>> guard region is outside that.
> 
> That was how I understood it at first, but this is very definitely not
> the case with glibc.  The glibc implementation takes the view that if
> a thread asks for X bytes of stack then it will get a region X bytes
> in size -- regardless of how that is then parcelled out.
> 
> The ia64 case makes it a bit clearer, for me at least.  ia64 has two
> stacks, both of which share the same region of memory.  There's a
> normal stack which grows downwards, and a register stack which grows
> upwards.  Guard pages, when specified, go in the middle.  You can't
> really compensate for this without also compensating for the register
> stack, so you'd have this extra, unreported region (and threads
> silently allocating twice as much memory as you expected them to).
> It's awkward whichever way you decide to interpret the spec.
> 
>>> I started modifying current_stack_region to do just that, but its
>>> comments contain warnings that pthread_getattr_np() returns bogus
>>> values for initial threads.  os::Linux::capture_initial_stack()
>>> has more such warnings, though neither mentions exactly _what_ was
>>> bogus.  Does anyone know?  Without a working pthread_getattr_np()
>>> you can't use pthread_attr_getguardsize(), and without that it's
>>> not possible to implement current_stack_region() in the form it's
>>> currently defined.
>> The comment re pthread_getattr_np is about 5 years old and I
>> couldn't find anything more specific than the inference that they
>> discovered that it returned the wrong values on the initial thread
>> on the distributions of the day (whatever they may have
>> been). Hotspot is full of this kind of historical baggage with
>> workarounds for a range of now defunct linux systems (and old
>> Solaris versions too).
> 
> Ok.
> 
>> Note also that pthread_attr_setguardsize may internally round-up the
>> guardsize to a multiple of page-size; but pthread_attr_getguardsize
>> returns the original supplied value not the rounded one. So you
>> would have a problem trying to adjust for the true guardsize in a
>> portable way.
> 
> No, this only matters if the pthread_attr_t you're using is the same.
> If you create one using pthread_attr_init() then yes, the value
> returned by pthread_attr_getguardsize() will be exactly the value you
> set with pthread_attr_setguardsize().  The pthread_attr_t returned by
> pthread_getattr_np(), however, is created from the actual properties
> of the thread, so it contains the actual guard size in use.
> 
> Cheers,
> Gary
> 


From gbenson at redhat.com  Thu Mar 13 09:05:03 2008
From: gbenson at redhat.com (Gary Benson)
Date: Thu, 13 Mar 2008 16:05:03 +0000
Subject: Linux current_stack_region()
In-Reply-To: <47D69215.1090409@sun.com>
References: <20080310150634.GC3824@redhat.com> <47D5D39E.3070402@sun.com>
	<47D5D7BA.3080106@sun.com> <20080311093257.GA3727@redhat.com>
	<47D69215.1090409@sun.com>
Message-ID: <20080313160503.GG4394@redhat.com>

David Holmes - Sun Microsystems wrote:
> I'd argue that glibc is incorrect then.

You'd not be the only one:
https://bugzilla.redhat.com/show_bug.cgi?id=435337

> But that aside what are the implications for hotspot? Does it mean
> we're placing our guard pages on top of glibc's? (Fortunately this
> will only affect natively attached threads that happen to use glibc
> guards).

In Java threads with guard pages (attached threads, basically)
HotSpot's red page will be in the same place as the glibc guard.
In non-Java threads there will be a guard page at the end of the
stack.

> Thanks for clarifying my misreading regarding the rounding issue.

Thanks for bringing it up.  Someone else had mentioned something
similar to me and I'd meant to check it but forgotten.

Cheers,
Gary

-- 
http://gbenson.net/


From David.Holmes at Sun.COM  Thu Mar 13 16:57:15 2008
From: David.Holmes at Sun.COM (David Holmes - Sun Microsystems)
Date: Fri, 14 Mar 2008 09:57:15 +1000
Subject: Linux current_stack_region()
In-Reply-To: <20080313160503.GG4394@redhat.com>
References: <20080310150634.GC3824@redhat.com> <47D5D39E.3070402@sun.com>
	<47D5D7BA.3080106@sun.com> <20080311093257.GA3727@redhat.com>
	<47D69215.1090409@sun.com> <20080313160503.GG4394@redhat.com>
Message-ID: <47D9BF5B.4060806@sun.com>

Gary,

Thanks for the bugzilla pointer - I've added my opinion there.

I will file a bug against hotspot so that we can track this.

David

Gary Benson said the following on 14/03/08 02:05 AM:
> David Holmes - Sun Microsystems wrote:
>> I'd argue that glibc is incorrect then.
> 
> You'd not be the only one:
> https://bugzilla.redhat.com/show_bug.cgi?id=435337
> 
>> But that aside what are the implications for hotspot? Does it mean
>> we're placing our guard pages on top of glibc's? (Fortunately this
>> will only affect natively attached threads that happen to use glibc
>> guards).
> 
> In Java threads with guard pages (attached threads, basically)
> HotSpot's red page will be in the same place as the glibc guard.
> In non-Java threads there will be a guard page at the end of the
> stack.
> 
>> Thanks for clarifying my misreading regarding the rounding issue.
> 
> Thanks for bringing it up.  Someone else had mentioned something
> similar to me and I'd meant to check it but forgotten.
> 
> Cheers,
> Gary
> 


From David.Holmes at Sun.COM  Thu Mar 13 17:14:07 2008
From: David.Holmes at Sun.COM (David Holmes - Sun Microsystems)
Date: Fri, 14 Mar 2008 10:14:07 +1000
Subject: Linux current_stack_region()
In-Reply-To: <47D9BF5B.4060806@sun.com>
References: <20080310150634.GC3824@redhat.com> <47D5D39E.3070402@sun.com>
	<47D5D7BA.3080106@sun.com> <20080311093257.GA3727@redhat.com>
	<47D69215.1090409@sun.com> <20080313160503.GG4394@redhat.com>
	<47D9BF5B.4060806@sun.com>
Message-ID: <47D9C34F.3040908@sun.com>

Filed: 6675312 Linux glibc stack guard-pages can overlap with hotspot 
guard pages

David

David Holmes - Sun Microsystems said the following on 14/03/08 09:57 AM:
> Gary,
> 
> Thanks for the bugzilla pointer - I've added my opinion there.
> 
> I will file a bug against hotspot so that we can track this.
> 
> David
> 
> Gary Benson said the following on 14/03/08 02:05 AM:
>> David Holmes - Sun Microsystems wrote:
>>> I'd argue that glibc is incorrect then.
>>
>> You'd not be the only one:
>> https://bugzilla.redhat.com/show_bug.cgi?id=435337
>>
>>> But that aside what are the implications for hotspot? Does it mean
>>> we're placing our guard pages on top of glibc's? (Fortunately this
>>> will only affect natively attached threads that happen to use glibc
>>> guards).
>>
>> In Java threads with guard pages (attached threads, basically)
>> HotSpot's red page will be in the same place as the glibc guard.
>> In non-Java threads there will be a guard page at the end of the
>> stack.
>>
>>> Thanks for clarifying my misreading regarding the rounding issue.
>>
>> Thanks for bringing it up.  Someone else had mentioned something
>> similar to me and I'd meant to check it but forgotten.
>>
>> Cheers,
>> Gary
>>


From Scott.Oaks at Sun.COM  Thu Mar 20 09:54:01 2008
From: Scott.Oaks at Sun.COM (Scott Oaks)
Date: Thu, 20 Mar 2008 12:54:01 -0400
Subject: Infinite Looping code and jstack
Message-ID: <1206032041.64035.184.camel@sr1-unyc10-08>

We have an appserver installation where some set of threads get into an
infinite loop. jstack always reports that the threads in question are at
a specific line.

Here's a snippet of the jstack output:
"http80-Processor584" daemon prio=10 tid=0x015af400 nid=0x8d4 runnable
[0x26bdf000..0x26bdf9f0]
   java.lang.Thread.State: RUNNABLE
        at
org.apache.coyote.http11.InternalInputBuffer.parseHeader(InternalInputBuffer.java:805)
        at
org.apache.coyote.http11.InternalInputBuffer.parseHeaders(InternalInputBuffer.java:607)
        at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:667)

parseHeader (line 805) isn't really a loop at all; there's no way I can
account for the infinite loop being in that method. It's conceivable
somehow that parseHeaders (line 607) could be in a loop:

    while (parseHeader())
          doSomething

But in that case, wouldn't I get somewhat differing results from
successive invocations jstack (or even one jstack where multiple threads
are in the same method)? 

I guess what I'm really asking is what the granularity of jstack is --
if I were in an infinite loop over two methods and 40-50 lines of code,
is it really conceivable that jstack would always show me I was on the
very same line (because that line corresponds to a safepoint or
something)? Or is something else more likely going on?

This is with JDK 1.6.0_02 and the server compiler.

-Scott


From Thomas.Rodriguez at Sun.COM  Thu Mar 20 10:27:04 2008
From: Thomas.Rodriguez at Sun.COM (Tom Rodriguez)
Date: Thu, 20 Mar 2008 10:27:04 -0700
Subject: Infinite Looping code and jstack
In-Reply-To: <1206032041.64035.184.camel@sr1-unyc10-08>
References: <1206032041.64035.184.camel@sr1-unyc10-08>
Message-ID: <47E29E68.10706@sun.com>

I believe that jstack may induce a safepoint in the target VM so the 
stack trace will always occur at safepoints in the code.  This might 
cause the trace to tend to point at a particular locations.  A trick 
that might work is to gcore the process and use jstack on that.  That 
assumes your cores aren't huge though.

If you are on s10 or later, pstack on s10 will decode Java stack traces 
though there are sometimes issues with getting full traces.  It's pretty 
good at the top of the stack though which is where you're interested. 
What platform are you on?

tom

Scott Oaks wrote:
> We have an appserver installation where some set of threads get into an
> infinite loop. jstack always reports that the threads in question are at
> a specific line.
> 
> Here's a snippet of the jstack output:
> "http80-Processor584" daemon prio=10 tid=0x015af400 nid=0x8d4 runnable
> [0x26bdf000..0x26bdf9f0]
>    java.lang.Thread.State: RUNNABLE
>         at
> org.apache.coyote.http11.InternalInputBuffer.parseHeader(InternalInputBuffer.java:805)
>         at
> org.apache.coyote.http11.InternalInputBuffer.parseHeaders(InternalInputBuffer.java:607)
>         at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:667)
> 
> parseHeader (line 805) isn't really a loop at all; there's no way I can
> account for the infinite loop being in that method. It's conceivable
> somehow that parseHeaders (line 607) could be in a loop:
> 
>     while (parseHeader())
>           doSomething
> 
> But in that case, wouldn't I get somewhat differing results from
> successive invocations jstack (or even one jstack where multiple threads
> are in the same method)? 
> 
> I guess what I'm really asking is what the granularity of jstack is --
> if I were in an infinite loop over two methods and 40-50 lines of code,
> is it really conceivable that jstack would always show me I was on the
> very same line (because that line corresponds to a safepoint or
> something)? Or is something else more likely going on?
> 
> This is with JDK 1.6.0_02 and the server compiler.
> 
> -Scott
> 


From Scott.Oaks at Sun.COM  Thu Mar 20 10:34:08 2008
From: Scott.Oaks at Sun.COM (Scott Oaks)
Date: Thu, 20 Mar 2008 13:34:08 -0400
Subject: Infinite Looping code and jstack
In-Reply-To: <47E29E68.10706@sun.com>
References: <1206032041.64035.184.camel@sr1-unyc10-08> <47E29E68.10706@sun.com>
Message-ID: <1206034447.64035.210.camel@sr1-unyc10-08>

I am on S10 (S10 U3), but I've never had much luck with pstack and busy
java processes (and/or perhaps large -- the appserver has a few hundred
threads and a 3GB heap) -- pstack itself seems to loop infinitely, or
perhaps I'm just never patient enough...but after an hour or two, I
still never have output.

I'll try the gcore trick.

-Scott

On Thu, 2008-03-20 at 13:27, Tom Rodriguez wrote:
> I believe that jstack may induce a safepoint in the target VM so the 
> stack trace will always occur at safepoints in the code.  This might 
> cause the trace to tend to point at a particular locations.  A trick 
> that might work is to gcore the process and use jstack on that.  That 
> assumes your cores aren't huge though.
> 
> If you are on s10 or later, pstack on s10 will decode Java stack traces 
> though there are sometimes issues with getting full traces.  It's pretty 
> good at the top of the stack though which is where you're interested. 
> What platform are you on?
> 
> tom
> 
> Scott Oaks wrote:
> > We have an appserver installation where some set of threads get into an
> > infinite loop. jstack always reports that the threads in question are at
> > a specific line.
> > 
> > Here's a snippet of the jstack output:
> > "http80-Processor584" daemon prio=10 tid=0x015af400 nid=0x8d4 runnable
> > [0x26bdf000..0x26bdf9f0]
> >    java.lang.Thread.State: RUNNABLE
> >         at
> > org.apache.coyote.http11.InternalInputBuffer.parseHeader(InternalInputBuffer.java:805)
> >         at
> > org.apache.coyote.http11.InternalInputBuffer.parseHeaders(InternalInputBuffer.java:607)
> >         at
> > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:667)
> > 
> > parseHeader (line 805) isn't really a loop at all; there's no way I can
> > account for the infinite loop being in that method. It's conceivable
> > somehow that parseHeaders (line 607) could be in a loop:
> > 
> >     while (parseHeader())
> >           doSomething
> > 
> > But in that case, wouldn't I get somewhat differing results from
> > successive invocations jstack (or even one jstack where multiple threads
> > are in the same method)? 
> > 
> > I guess what I'm really asking is what the granularity of jstack is --
> > if I were in an infinite loop over two methods and 40-50 lines of code,
> > is it really conceivable that jstack would always show me I was on the
> > very same line (because that line corresponds to a safepoint or
> > something)? Or is something else more likely going on?
> > 
> > This is with JDK 1.6.0_02 and the server compiler.
> > 
> > -Scott
> > 


From David.Herron at Sun.COM  Mon Mar 24 20:20:43 2008
From: David.Herron at Sun.COM (David Herron)
Date: Mon, 24 Mar 2008 20:20:43 -0700
Subject: Compiling openjdk6 w/ gcc 4.2.x
Message-ID: <47E86F8B.1080008@sun.com>

I'm checking compiling OpenJDK 6 on Ubuntu 8.04 beta using gcc 4.2.x

I got some compile errors in hotspot files that are compiled with 
-Werror.  The warning that got upgraded to an error had to do with 
string constants being used with a "char *" of some kind.

In some cases there was a struct definition that had a "char *" which 
then was initialized with a string constant.  In other cases functions 
were called where the parameter was declared "char *" but a string 
constant given to it.

In the following patches I either changed the struct definition or made 
a cast on the string constant, depending on the context.

I also had an intermittent failure on 
hotspot/src/share/vm/opto/classes.cpp saying

   In file included from .../hotspot/src/share/vm/opto/classes.cpp:36:
   .../hotspot/src/share/vm/opto/classes.hpp: In member function 
'virtual const Type* PartialSubtypeCheckNode::bottom_type() const':
   .../hotspot/src/share/vm/opto/classes.hpp:309: internal compiler 
error: Segmentation fault

All the above issues do not happen on my Ubuntu 7.10 machine that has 
GCC 4.1.x installed.  They happen on my Ubuntu 8.04 beta machine that 
has GCC 4.2.x

- David Herron


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: diffs2
Url: http://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20080324/804481ee/attachment.ksh 

From Alan.Bateman at Sun.COM  Fri Mar 28 04:42:39 2008
From: Alan.Bateman at Sun.COM (Alan Bateman)
Date: Fri, 28 Mar 2008 11:42:39 +0000
Subject: Infinite Looping code and jstack
In-Reply-To: <1206034447.64035.210.camel@sr1-unyc10-08>
References: <1206032041.64035.184.camel@sr1-unyc10-08> <47E29E68.10706@sun.com>
	<1206034447.64035.210.camel@sr1-unyc10-08>
Message-ID: <47ECD9AF.5090601@sun.com>

Scott Oaks wrote:
> I am on S10 (S10 U3), but I've never had much luck with pstack and busy
> java processes (and/or perhaps large -- the appserver has a few hundred
> threads and a 3GB heap) -- pstack itself seems to loop infinitely, or
> perhaps I'm just never patient enough...but after an hour or two, I
> still never have output.
>
> I'll try the gcore trick.
>   
or instead of gcore + jstack, try "jstack -F <pid>" (the -F option is 
used to "force" a thread dump by invoking SA).

-Alan.


From volker.simonis at gmail.com  Mon Mar 31 10:22:17 2008
From: volker.simonis at gmail.com (Volker Simonis)
Date: Mon, 31 Mar 2008 19:22:17 +0200
Subject: Using VTune for HotSpot profiling (is UseVTune still functional and
	required?)
Message-ID: <d65861880803311022n61bd1069hc674a7bf4032ec7a@mail.gmail.com>

Hi,

does anybody know if the UseVTune option is still supported and
functional. I browsed the code and saw that the vtune interface in
"src/share/vm/runtime/vtune.hpp" is only implemented for Windows, but
empty for Linux.

Does anybody have experience in profiling the HotSpot with VTune on
Linux? Do I need to implement the vtune interface mentioned above, or
does it work out of the box?

The VTune documentation states that it gets JIT information from the
VM but I couldn't find out until now how this exactly works. An older
VTune paper stated JVMPI, but I suppose this should be JVMTI  now. As
far as I know, it is possible to get JIT compilation information
trough JVMTI, but no information about generated template code, stubs
and adapters. This would line up with the VTune documentation which
states that there may be a fair amount of non-assignable PCs which
correspond to VM code that shouldn't bother the Java programmer:)

Any hints are greatly appreciated!

Regards,
Volker


From Alan.Bateman at Sun.COM  Mon Mar 31 10:39:16 2008
From: Alan.Bateman at Sun.COM (Alan Bateman)
Date: Mon, 31 Mar 2008 18:39:16 +0100
Subject: Using VTune for HotSpot profiling (is UseVTune still functional
	and	required?)
In-Reply-To: <d65861880803311022n61bd1069hc674a7bf4032ec7a@mail.gmail.com>
References: <d65861880803311022n61bd1069hc674a7bf4032ec7a@mail.gmail.com>
Message-ID: <47F121C4.1010507@sun.com>

Volker Simonis wrote:
> Hi,
>
> does anybody know if the UseVTune option is still supported and
> functional. I browsed the code and saw that the vtune interface in
> "src/share/vm/runtime/vtune.hpp" is only implemented for Windows, but
> empty for Linux.
>
> Does anybody have experience in profiling the HotSpot with VTune on
> Linux? Do I need to implement the vtune interface mentioned above, or
> does it work out of the box?
>
> The VTune documentation states that it gets JIT information from the
> VM but I couldn't find out until now how this exactly works. An older
> VTune paper stated JVMPI, but I suppose this should be JVMTI  now. As
> far as I know, it is possible to get JIT compilation information
> trough JVMTI, but no information about generated template code, stubs
> and adapters. This would line up with the VTune documentation which
> states that there may be a fair amount of non-assignable PCs which
> correspond to VM code that shouldn't bother the Java programmer:)
>
> Any hints are greatly appreciated!
>   
AFAIK, VTune uses JVM TI on Linux but still uses the older (and 
deprecated) JVMPI on Windows. In JVM TI the DynamicCodeGenerated event 
is invoked with the address range for the interpreter, stubs, and other 
generated code. It would be interesting to track down the 
"non-assignable PCs" in case there are any places where we are missing 
events.

-Alan.