From david.r.chase at oracle.com Mon Apr 15 13:22:56 2013 From: david.r.chase at oracle.com (david.r.chase at oracle.com) Date: Mon, 15 Apr 2013 20:22:56 +0000 Subject: hg: sumatra/sumatra-dev/scratch: Summary: Initial prototype for flat-storage marshalling and unmarshalling Message-ID: <20130415202257.1254448307@hg.openjdk.java.net> Changeset: 015a6c34a221 Author: drchase Date: 2013-04-15 16:18 -0400 URL: http://hg.openjdk.java.net/sumatra/sumatra-dev/scratch/rev/015a6c34a221 Summary: Initial prototype for flat-storage marshalling and unmarshalling Keywords: data, layout Reviewed-by: jrose + .hgignore + .jcheck/conf + src/org/openjdk/sumatra/data/prototype/AliasedLocation.java + src/org/openjdk/sumatra/data/prototype/ArrayBitLayout.java + src/org/openjdk/sumatra/data/prototype/ArrayDefaultLayout.java + src/org/openjdk/sumatra/data/prototype/ArrayLayout.java + src/org/openjdk/sumatra/data/prototype/ArrayLocation.java + src/org/openjdk/sumatra/data/prototype/AtomBitLayout.java + src/org/openjdk/sumatra/data/prototype/AtomLayout.java + src/org/openjdk/sumatra/data/prototype/B.java + src/org/openjdk/sumatra/data/prototype/CompoundLayout.java + src/org/openjdk/sumatra/data/prototype/D.java + src/org/openjdk/sumatra/data/prototype/F.java + src/org/openjdk/sumatra/data/prototype/I.java + src/org/openjdk/sumatra/data/prototype/J.java + src/org/openjdk/sumatra/data/prototype/Layout.java + src/org/openjdk/sumatra/data/prototype/LayoutFactory.java + src/org/openjdk/sumatra/data/prototype/Location.java + src/org/openjdk/sumatra/data/prototype/Matrix.java + src/org/openjdk/sumatra/data/prototype/MatrixFactory.java + src/org/openjdk/sumatra/data/prototype/OpaquePointerLayout.java + src/org/openjdk/sumatra/data/prototype/PointerDomain.java + src/org/openjdk/sumatra/data/prototype/PointerLayout.java + src/org/openjdk/sumatra/data/prototype/PrivateUtil.java + src/org/openjdk/sumatra/data/prototype/README + src/org/openjdk/sumatra/data/prototype/README.security + src/org/openjdk/sumatra/data/prototype/S.java + src/org/openjdk/sumatra/data/prototype/TranslatedPointerLayout.java + src/org/openjdk/sumatra/data/prototype/TupleLayout.java + src/org/openjdk/sumatra/data/prototype/Z.java + test/org/openjdk/sumatra/data/prototype_test/LayoutFactoryErrors.java + test/org/openjdk/sumatra/data/prototype_test/LayoutFactoryTest.java + test/org/openjdk/sumatra/data/prototype_test/LayoutFactoryTestBoolean.java + test/org/openjdk/sumatra/data/prototype_test/LayoutFactoryTestByte.java + test/org/openjdk/sumatra/data/prototype_test/LayoutFactoryTestDouble.java + test/org/openjdk/sumatra/data/prototype_test/LayoutFactoryTestFloat.java + test/org/openjdk/sumatra/data/prototype_test/LayoutFactoryTestLong.java + test/org/openjdk/sumatra/data/prototype_test/LayoutFactoryTestPointer.java + test/org/openjdk/sumatra/data/prototype_test/LayoutFactoryTestShort.java + test/org/openjdk/sumatra/data/prototype_test/TestCommon.java + test/org/openjdk/sumatra/data/prototype_test/TestMappedLocation.java From john.r.rose at oracle.com Thu Apr 18 17:23:29 2013 From: john.r.rose at oracle.com (John Rose) Date: Thu, 18 Apr 2013 17:23:29 -0700 Subject: FYI: Dot Product Thoughts References: Message-ID: This was just posted on lambda-dev: http://mail.openjdk.java.net/pipermail/lambda-dev/2013-April/009512.html The posting illuminates the challenges of rendering JDK 8 lambdas as GPU-like vector operations. (Bharadwaj, I know you're thinking about this!) ? John Begin forwarded message: From: Richard Warburton Subject: Dot Product Thoughts Date: April 18, 2013 4:09:38 PM PDT To: "lambda-dev at openjdk.java.net" Hi, Implementing a dot product between a pair of vectors brought up a few observations about the library: ... From bharadwaj.yadavalli at oracle.com Fri Apr 19 08:06:18 2013 From: bharadwaj.yadavalli at oracle.com (Bharadwaj Yadavalli) Date: Fri, 19 Apr 2013 11:06:18 -0400 Subject: FYI: Dot Product Thoughts In-Reply-To: References: Message-ID: <51715D6A.8060207@oracle.com> On 4/18/2013 8:23 PM, John Rose wrote: > This was just posted on lambda-dev: > http://mail.openjdk.java.net/pipermail/lambda-dev/2013-April/009512.html > > The posting illuminates the challenges of rendering JDK 8 lambdas as > GPU-like vector operations. > > (Bharadwaj, I know you're thinking about this!) > Thanks for pointing the thread on lambda-dev ML, John. Yes, I have also been reading it with interest and (very) slowly inching towards discovering potential issues to be addressed albeit (as you know) using matrix multiply as my example. Regards, Bharadwaj From eric.caspole at amd.com Tue Apr 23 08:24:28 2013 From: eric.caspole at amd.com (Eric Caspole) Date: Tue, 23 Apr 2013 11:24:28 -0400 Subject: Overview of our Sumatra demo JDK Message-ID: <5176A7AC.2070909@amd.com> Hello Sumatra readers, We want to explain on the public list how our internal Sumatra demo JDK works as a platform for more discussion. Hopefully later we can push this JDK to the Sumatra scratch area but for the time being we can explain it. With this JDK we can convert a variety of Stream API lambda functions to OpenCL kernels, where the stream is using parallel() and ends with forEach() which is where we have inserted our code to do this. Our current version is using a modified version of Aparapi http://code.google.com/p/aparapi/ directly integrated into a demo JDK build to process the relevant bytecode and emit the gpu kernel. We chose to operate on streams and arrays because this allowed us to work within Aparapi's constraints. As an initial vector add example Streams.intRange(0, in.length).parallel().forEach( id -> {c[id]=a[id]+b[id];}); In the code above, as an example, we can create a kernel from the lambda in the forEach() block. In the OpenCL source, we use the Java iteration variable ("id" above) as the OpenCL gid. That means each OpenCL work item is working on one value of id. Here is a more complex stream version of a mandelbrot demo app: static final int width = 768; static final int height = 768; final int[] rgb; final int pallette[]; void getNextImage(float x, float y, float scale) { Streams.intRange(0, width*height).parallel().forEach( p -> { /** Translate the gid into an x an y value. */ float lx = (((p % width * scale) - ((scale / 2) * width)) / width) + x; float ly = (((p / width * scale) - ((scale / 2) * height)) / height) + y; int count = 0; { float zx = lx; float zy = ly; float new_zx = 0f; // Iterate until the algorithm converges or until // maxIterations are reached. while (count < maxIterations && zx * zx + zy * zy < 8) { new_zx = zx * zx - zy * zy + lx; zy = 2 * zx * zy + ly; zx = new_zx; count++; } } // Pull the value out of the palette for this iteration count. rgb[p] = pallette[count]; }); } In the code above, width, height, rgb and palette are fields in the containing class. Again we create a kernel from the whole lambda in the forEach() block. Here we use the Java iteration variable ("p" above) as the OpenCL gid. That means each OpenCL work item is working on one value of p. Whilst we tried to minimize our changes to the JDK, we found that we had to make java.lang.invoke.InnerClassLambdaMetafactory public so we could get at the bytecode of the dynamically created Consumer object; we hold the Consumers byte streams in a hash table in InnerClassLambdaMetafactory. We also modified java.util.stream.ForEachOps to be able to immediately try to compile the target lambda for gpu and also have a related server compiler intrinsic to intercept compilation of ForEach.evaluateParallel. You can turn on the immediate redirection with a -D property. We have not been merging with Lambda JDK tip changes in the last 3-4 weeks but that was how the Stream API code was structured when we last merged. Either of those intercept points will call into modified Aparapi code. The kernel is created by getting the bytecode of the Consumer object from InnerClassLambdaMetafactory byte stream hash table we added. By looking at that bytecode for the accept() method we get to the target lambda. By looking at the fields of the Consumer, we build the information about the parameters for the lambda/kernel which we will pass to OpenCL. Next we produce the OpenCL source for the target lambda using the bytecode for the lambda method in the class file. Once the kernel source is ready we use JNI code to call OpenCL to compile the kernel into the executable format, and use the parameter information we collected in the above steps to pass the parameters to OpenCL. In our demo JDK, we keep a hash table of the generated kernels in our Java API that is called from the redirection points, and extract the new arguments from the Consumer object on each call. Then we call the OpenCL API to update the new parameters. We also have a version that can combine a flow of stream API lambdas into one OpenCL kernel such as Arrays.parallel(pArray).filter(/* predicate lambda */). peek(/* statement lambda and continue stream */). filter(/* predicate lambda */). forEach(/* statement lambda and terminate stream*/); so all 4 lambdas in this kind of statement can be combined into one OpenCL kernel. In a Graal version we will be working on next, there are a couple things that come to mind that should be different from what we did here. - How to add Graal into the JDK as a second compiler where the rest of the system is probably using server compiler as usual? - How to store the Graal generated kernels for later use? - Is it necessary to use Graal to extract any required parameter info that might be needed to pass to a gpu runtime? - How to intercept/select the Stream API calls that are good gpu kernel candidates more automagically than we did here? In the demo JDK, we redirect to our own Java API that fixes up the parameters and then calls native code to execute the kernel. Hopefully this explains what we have so far and the intent of how we want to proceed. Regards, Eric From christian.thalinger at oracle.com Tue Apr 23 11:59:05 2013 From: christian.thalinger at oracle.com (Christian Thalinger) Date: Tue, 23 Apr 2013 11:59:05 -0700 Subject: Overview of our Sumatra demo JDK In-Reply-To: <5176A7AC.2070909@amd.com> References: <5176A7AC.2070909@amd.com> Message-ID: On Apr 23, 2013, at 8:24 AM, Eric Caspole wrote: > Hello Sumatra readers, > > We want to explain on the public list how our internal Sumatra demo JDK works as a platform for more discussion. Hopefully later we can push this JDK to the Sumatra scratch area but for the time being we can explain it. > > With this JDK we can convert a variety of Stream API lambda functions to OpenCL kernels, where the stream is using parallel() and ends with forEach() which is where we have inserted our code to do this. > > Our current version is using a modified version of Aparapi > > http://code.google.com/p/aparapi/ > > directly integrated into a demo JDK build to process the relevant bytecode and emit the gpu kernel. > > We chose to operate on streams and arrays because this allowed us to work within Aparapi's constraints. > > As an initial vector add example > > Streams.intRange(0, in.length).parallel().forEach( id -> {c[id]=a[id]+b[id];}); > > In the code above, as an example, we can create a kernel from the lambda in the forEach() block. In the OpenCL source, we use the Java iteration variable ("id" above) as the OpenCL gid. That means each OpenCL work item is working on one value of id. > > Here is a more complex stream version of a mandelbrot demo app: > > static final int width = 768; > static final int height = 768; > final int[] rgb; > final int pallette[]; > > void getNextImage(float x, float y, float scale) { > > Streams.intRange(0, width*height).parallel().forEach( p -> { > > /** Translate the gid into an x an y value. */ > float lx = (((p % width * scale) - ((scale / 2) * width)) / > width) + x; > float ly = (((p / width * scale) - ((scale / 2) * height)) / > height) + y; > > int count = 0; > { > float zx = lx; > float zy = ly; > float new_zx = 0f; > > // Iterate until the algorithm converges or until > // maxIterations are reached. > while (count < maxIterations && zx * zx + zy * zy < 8) { > new_zx = zx * zx - zy * zy + lx; > zy = 2 * zx * zy + ly; > zx = new_zx; > count++; > } > } > // Pull the value out of the palette for this iteration count. > rgb[p] = pallette[count]; > }); > } > > In the code above, width, height, rgb and palette are fields in the containing class. > > Again we create a kernel from the whole lambda in the forEach() block. Here we use the Java iteration variable ("p" above) as the OpenCL gid. That means each OpenCL work item is working on one value of p. > > Whilst we tried to minimize our changes to the JDK, we found that we had to make java.lang.invoke.InnerClassLambdaMetafactory public so we could get at the bytecode of the dynamically created Consumer object; we hold the Consumers byte streams in a hash table in InnerClassLambdaMetafactory. > > We also modified java.util.stream.ForEachOps to be able to immediately try to compile the target lambda for gpu and also have a related server compiler intrinsic to intercept compilation of ForEach.evaluateParallel. > > You can turn on the immediate redirection with a -D property. > > We have not been merging with Lambda JDK tip changes in the last 3-4 weeks but that was how the Stream API code was structured when we last merged. > > Either of those intercept points will call into modified Aparapi code. > > The kernel is created by getting the bytecode of the Consumer object from InnerClassLambdaMetafactory byte stream hash table we added. By looking at that bytecode for the accept() method we get to the target lambda. > > By looking at the fields of the Consumer, we build the information about the parameters for the lambda/kernel which we will pass to OpenCL. > > Next we produce the OpenCL source for the target lambda using the bytecode for the lambda method in the class file. > > Once the kernel source is ready we use JNI code to call OpenCL to compile the kernel into the executable format, and use the parameter information we collected in the above steps to pass the parameters to OpenCL. > > In our demo JDK, we keep a hash table of the generated kernels in our Java API that is called from the redirection points, and extract the new arguments from the Consumer object on each call. Then we call the OpenCL API to update the new parameters. > > > We also have a version that can combine a flow of stream API lambdas into one OpenCL kernel such as > > Arrays.parallel(pArray).filter(/* predicate lambda */). > peek(/* statement lambda and continue stream */). > filter(/* predicate lambda */). > forEach(/* statement lambda and terminate stream*/); > > so all 4 lambdas in this kind of statement can be combined into one OpenCL kernel. Thank you for this writeup. It helps to combine our effort to go in the same direction. > > > In a Graal version we will be working on next, there are a couple things that come to mind that should be different from what we did here. > > - How to add Graal into the JDK as a second compiler where the rest of > the system is probably using server compiler as usual? What you want is Graal in hosted mode (hosted on C1/C2). In that mode no compilation requests are sent to Graal by default; you have to request them. See CompileBroker::compile_method_base in: http://hg.openjdk.java.net/graal/graal/file/tip/src/share/vm/compiler/compileBroker.cpp around line 1124. (Unfortunately our old Mercurial version doesn't support line number anchors.) An early prototype I had used an annotation to identify methods which should be compiled by Graal. Something like this might be helpful until we have better detection machinery. > > - How to store the Graal generated kernels for later use? Presumably by kernels you mean PTX or HSAIL code? That could will be stored in nmethods as compiled code (although you can't execute it directly). There are two remaining problems to solve: 1) We need some kind of trampoline code that can be called from Java code and redirects to the method in the GPU. I think some kind of generated Java bytecode would be best but I haven't thought about this enough. 2) How do we map Java methods to GPU methods? Maybe the answer is that a Method* should support multiple nmethods (right now it only can have one): nmethod* volatile _code; // Points to the corresponding piece of native code Potentially a Method* can have an unlimited (well, not really) number of different compiled codes: host code (e.g. x86, SPARC, ...), PTX, HSAIL, ARM, ? (depends on how many cards you can put into your machine). > > - Is it necessary to use Graal to extract any required parameter info > that might be needed to pass to a gpu runtime? I'm not sure I understand this question. > > - How to intercept/select the Stream API calls that are good gpu kernel > candidates more automagically than we did here? In the demo JDK, we > redirect to our own Java API that fixes up the parameters and then calls > native code to execute the kernel. As I've mentioned above one possible solution for now would be to annotate Lambdas a developer thinks are worth being GPU-compiled. For a more sophisticated algorithm I think we need running code first and do experiments. -- Chris > > > Hopefully this explains what we have so far and the intent of how we want to proceed. > Regards, > Eric > > From eric.caspole at amd.com Tue Apr 23 13:57:19 2013 From: eric.caspole at amd.com (Eric Caspole) Date: Tue, 23 Apr 2013 16:57:19 -0400 Subject: Overview of our Sumatra demo JDK In-Reply-To: References: <5176A7AC.2070909@amd.com> Message-ID: <5176F5AF.8080707@amd.com> On 04/23/2013 02:59 PM, Christian Thalinger wrote: > > On Apr 23, 2013, at 8:24 AM, Eric Caspole wrote: > >> Hello Sumatra readers, >> >> We want to explain on the public list how our internal Sumatra demo JDK works as a platform for more discussion. Hopefully later we can push this JDK to the Sumatra scratch area but for the time being we can explain it. >> >> With this JDK we can convert a variety of Stream API lambda functions to OpenCL kernels, where the stream is using parallel() and ends with forEach() which is where we have inserted our code to do this. >> >> Our current version is using a modified version of Aparapi >> >> http://code.google.com/p/aparapi/ >> >> directly integrated into a demo JDK build to process the relevant bytecode and emit the gpu kernel. >> >> We chose to operate on streams and arrays because this allowed us to work within Aparapi's constraints. >> >> As an initial vector add example >> >> Streams.intRange(0, in.length).parallel().forEach( id -> {c[id]=a[id]+b[id];}); >> >> In the code above, as an example, we can create a kernel from the lambda in the forEach() block. In the OpenCL source, we use the Java iteration variable ("id" above) as the OpenCL gid. That means each OpenCL work item is working on one value of id. >> >> Here is a more complex stream version of a mandelbrot demo app: >> >> static final int width = 768; >> static final int height = 768; >> final int[] rgb; >> final int pallette[]; >> >> void getNextImage(float x, float y, float scale) { >> >> Streams.intRange(0, width*height).parallel().forEach( p -> { >> >> /** Translate the gid into an x an y value. */ >> float lx = (((p % width * scale) - ((scale / 2) * width)) / >> width) + x; >> float ly = (((p / width * scale) - ((scale / 2) * height)) / >> height) + y; >> >> int count = 0; >> { >> float zx = lx; >> float zy = ly; >> float new_zx = 0f; >> >> // Iterate until the algorithm converges or until >> // maxIterations are reached. >> while (count < maxIterations && zx * zx + zy * zy < 8) { >> new_zx = zx * zx - zy * zy + lx; >> zy = 2 * zx * zy + ly; >> zx = new_zx; >> count++; >> } >> } >> // Pull the value out of the palette for this iteration count. >> rgb[p] = pallette[count]; >> }); >> } >> >> In the code above, width, height, rgb and palette are fields in the containing class. >> >> Again we create a kernel from the whole lambda in the forEach() block. Here we use the Java iteration variable ("p" above) as the OpenCL gid. That means each OpenCL work item is working on one value of p. >> >> Whilst we tried to minimize our changes to the JDK, we found that we had to make java.lang.invoke.InnerClassLambdaMetafactory public so we could get at the bytecode of the dynamically created Consumer object; we hold the Consumers byte streams in a hash table in InnerClassLambdaMetafactory. >> >> We also modified java.util.stream.ForEachOps to be able to immediately try to compile the target lambda for gpu and also have a related server compiler intrinsic to intercept compilation of ForEach.evaluateParallel. >> >> You can turn on the immediate redirection with a -D property. >> >> We have not been merging with Lambda JDK tip changes in the last 3-4 weeks but that was how the Stream API code was structured when we last merged. >> >> Either of those intercept points will call into modified Aparapi code. >> >> The kernel is created by getting the bytecode of the Consumer object from InnerClassLambdaMetafactory byte stream hash table we added. By looking at that bytecode for the accept() method we get to the target lambda. >> >> By looking at the fields of the Consumer, we build the information about the parameters for the lambda/kernel which we will pass to OpenCL. >> >> Next we produce the OpenCL source for the target lambda using the bytecode for the lambda method in the class file. >> >> Once the kernel source is ready we use JNI code to call OpenCL to compile the kernel into the executable format, and use the parameter information we collected in the above steps to pass the parameters to OpenCL. >> >> In our demo JDK, we keep a hash table of the generated kernels in our Java API that is called from the redirection points, and extract the new arguments from the Consumer object on each call. Then we call the OpenCL API to update the new parameters. >> >> >> We also have a version that can combine a flow of stream API lambdas into one OpenCL kernel such as >> >> Arrays.parallel(pArray).filter(/* predicate lambda */). >> peek(/* statement lambda and continue stream */). >> filter(/* predicate lambda */). >> forEach(/* statement lambda and terminate stream*/); >> >> so all 4 lambdas in this kind of statement can be combined into one OpenCL kernel. > > Thank you for this writeup. It helps to combine our effort to go in the same direction. > >> >> >> In a Graal version we will be working on next, there are a couple things that come to mind that should be different from what we did here. >> >> - How to add Graal into the JDK as a second compiler where the rest of >> the system is probably using server compiler as usual? > > What you want is Graal in hosted mode (hosted on C1/C2). In that mode no compilation requests are sent to Graal by default; you have to request them. > > See CompileBroker::compile_method_base in: > > http://hg.openjdk.java.net/graal/graal/file/tip/src/share/vm/compiler/compileBroker.cpp > > around line 1124. > > (Unfortunately our old Mercurial version doesn't support line number anchors.) > > An early prototype I had used an annotation to identify methods which should be compiled by Graal. Something like this might be helpful until we have better detection machinery. > Thanks for that reference. >> >> - How to store the Graal generated kernels for later use? > > Presumably by kernels you mean PTX or HSAIL code? That could will be stored in nmethods as compiled code (although you can't execute it directly). There are two remaining problems to solve: > > 1) We need some kind of trampoline code that can be called from Java code and redirects to the method in the GPU. I think some kind of generated Java bytecode would be best but I haven't thought about this enough. > > 2) How do we map Java methods to GPU methods? Maybe the answer is that a Method* should support multiple nmethods (right now it only can have one): > > nmethod* volatile _code; // Points to the corresponding piece of native code > > Potentially a Method* can have an unlimited (well, not really) number of different compiled codes: host code (e.g. x86, SPARC, ...), PTX, HSAIL, ARM, ? (depends on how many cards you can put into your machine). > We were just talking here about this multiple _code idea; maybe the new one could point to the bytecode invoker you mentioned? Once the gpu isa is emitted by Graal and sent to the gpu runtime, it gets built into a executable kernel managed by the gpu runtime, so you might not need to store the graal compiled isa. >> >> - Is it necessary to use Graal to extract any required parameter info >> that might be needed to pass to a gpu runtime? > > I'm not sure I understand this question. With the current OpenCL, or probably any other gpu runtime copying memory to the card, you need to send all the java heap args with the call. In particular, if there are java object field references used in the kernel code, they must be extracted ahead of time to pass with the call. This might not be necessary with HSA but I am not sure yet. > >> >> - How to intercept/select the Stream API calls that are good gpu kernel >> candidates more automagically than we did here? In the demo JDK, we >> redirect to our own Java API that fixes up the parameters and then calls >> native code to execute the kernel. > > As I've mentioned above one possible solution for now would be to annotate Lambdas a developer thinks are worth being GPU-compiled. For a more sophisticated algorithm I think we need running code first and do experiments. > > -- Chris > >> >> >> Hopefully this explains what we have so far and the intent of how we want to proceed. >> Regards, >> Eric >> >> > >