From eric.caspole at amd.com Wed Sep 3 01:11:26 2014 From: eric.caspole at amd.com (eric.caspole at amd.com) Date: Wed, 03 Sep 2014 01:11:26 +0000 Subject: hg: sumatra/sumatra-dev/jdk: Roll back weak ref based unloading scheme, fix missing spliterator Message-ID: <201409030111.s831BQHl010660@aojmv0008> Changeset: 96be27e9e136 Author: ecaspole Date: 2014-09-02 21:10 -0400 URL: http://hg.openjdk.java.net/sumatra/sumatra-dev/jdk/rev/96be27e9e136 Roll back weak ref based unloading scheme, fix missing spliterator ! src/share/classes/java/util/stream/PipelineInfo.java From tom.deneau at amd.com Tue Sep 9 22:02:44 2014 From: tom.deneau at amd.com (Deneau, Tom) Date: Tue, 9 Sep 2014 22:02:44 +0000 Subject: determining when to offload to the gpu Message-ID: The following is an issue we need to resolve in Sumatra. We intend to file this in the openjdk bugs system once we get the Sumatra project set up as a project there. Meanwhile, comments are welcome. In the current prototype, a config flag enables offload and if a Stream API parallel().forEach call is encountered which meets the other criteria for being offloaded, then on its first invocation it is compiled for the HSA target and executed. The compilation happens once, the compiled kernel is saved and can be reused on subsequent invocations of the same lambda. (Note: if for any reason the lambda cannot be compiled for an HSA target, offload is disabled for this lambda and the usual CPU parallel path is used). The logic for deciding whether to offload or not is all in the special Sumatra-modified JDK classes in java/util/stream. The above logic could be improved: a) instead of being offloaded on the first invocation, the lambda should first be executed thru the interpreter so that profiling information is gathered which could then be useful in the eventual HSAIL compilation step. b) instead of being offloaded unconditionally, it would be good if the lambda would be offloaded only if the offload is determined profitable when compared to running parallel on the CPU. We assume that in general it is not possible to predict the profitability of GPU offload statically and that measurement will be necessary. So how to meet the above needs? Our current thoughts are that at the JDK level where we decide to offload a particular parallel lambda invocation would go thru a number of stages: * Interpreted (to gather profiling information) * Compiled and executed on Parallel CPU and timed * Compiled and executed on Parallel GPU and timed And then at that point make some decision about which way is faster and use that going forward. Do people think making these measurements back at the JDK API level is the right place? (It seems to fit there since that is where we decide whether or not to offload) Some concerns ------------- This comparison works well if the work per stream call is similar for all invocations. However, even the range may not be the same from invocation to invocation. We should try to compare parCPU and parGPU runs with the same range. If we can't find runs with the same range, we could derive a time per workitem measurement and compare those. However, time per workitem for a small range may be quite different for time per workitem for a large range so would be difficult to compare. Even then the work per run may be different (might take different paths thru the lambda). How to detect that we are in the "Compiled" stage for the Parallel CPU runs? I guess knowing the range of each forEach call we should be able to estimate this, or just see a reduction in the runtime. -- Tom Deneau From Ryan.LaMothe at pnnl.gov Wed Sep 10 00:02:52 2014 From: Ryan.LaMothe at pnnl.gov (LaMothe, Ryan R) Date: Wed, 10 Sep 2014 00:02:52 +0000 Subject: determining when to offload to the gpu In-Reply-To: References: Message-ID: Hi Tom, I thought this may be a good point to jump in and make a quick comment on some thoughts. A question: At what level is it better to encapsulate this in the JVM and at what level is this better left to the user/utility functions? For example, in the Aparapi project there is an example project named correlation-matrix that gives a pretty good idea about what it takes to realistically decide in code whether to run a specific matrix computation on CPU or GPU and how to split up the work. This is a very basic example and is only a sample of the real code base from which it was derived, but should help highlight the issue. Instead of the JVM trying to figure out how to decompose the lambda functions optimally and offload to HSA automatically for all possible cases, might it be better to take the following approach: - Implement the base functionality in the JVM for HSA offload and then search the entire JDK for places where offloading may be obvious or easily achieved (i.e. Matrix Math, etc.)? Maybe this even means implementing new base classes for specific packages that are HSA-enabled. - For non-obvious cases, allow the developer to somehow indicate in the lambda that they want the execution to occur via HSA/offload, if possible, and provide some form of annotations or other functionality to give the JVM hints about how they would like it done? Maybe that seems like steps backwards, but thought it was worth mentioning. -Ryan On 9/9/14, 3:02 PM, "Deneau, Tom" wrote: >The following is an issue we need to resolve in Sumatra. We >intend to file this in the openjdk bugs system once we get the Sumatra >project set up as a project there. Meanwhile, comments are welcome. > > >In the current prototype, a config flag enables offload and if a >Stream API parallel().forEach call is encountered which meets the >other criteria for being offloaded, then on its first invocation it is >compiled for the HSA target and executed. The compilation happens >once, the compiled kernel is saved and can be reused on subsequent >invocations of the same lambda. (Note: if for any reason the lambda >cannot be compiled for an HSA target, offload is disabled for this >lambda and the usual CPU parallel path is used). The logic for >deciding whether to offload or not is all in the special >Sumatra-modified JDK classes in java/util/stream. > >The above logic could be improved: > > a) instead of being offloaded on the first invocation, the lambda > should first be executed thru the interpreter so that profiling > information is gathered which could then be useful in the > eventual HSAIL compilation step. > > b) instead of being offloaded unconditionally, it would be good if > the lambda would be offloaded only if the offload is determined > profitable when compared to running parallel on the CPU. We > assume that in general it is not possible to predict the > profitability of GPU offload statically and that measurement > will be necessary. > >So how to meet the above needs? Our current thoughts are that at the >JDK level where we decide to offload a particular parallel lambda >invocation would go thru a number of stages: > > * Interpreted (to gather profiling information) > * Compiled and executed on Parallel CPU and timed > * Compiled and executed on Parallel GPU and timed > >And then at that point make some decision about which way is faster >and use that going forward. > >Do people think making these measurements back at the JDK API level is >the right place? (It seems to fit there since that is where we decide >whether or not to offload) > >Some concerns >------------- >This comparison works well if the work per stream call is similar for >all invocations. However, even the range may not be the same from >invocation to invocation. We should try to compare parCPU and parGPU >runs with the same range. If we can't find runs with the same range, >we could derive a time per workitem measurement and compare those. >However, time per workitem for a small range may be quite different >for time per workitem for a large range so would be difficult to >compare. Even then the work per run may be different (might take >different paths thru the lambda). > >How to detect that we are in the "Compiled" stage for the Parallel CPU >runs? I guess knowing the range of each forEach call we should be >able to estimate this, or just see a reduction in the runtime. > >-- Tom Deneau > > From christian.thalinger at oracle.com Wed Sep 10 19:46:25 2014 From: christian.thalinger at oracle.com (Christian Thalinger) Date: Wed, 10 Sep 2014 12:46:25 -0700 Subject: determining when to offload to the gpu In-Reply-To: References: Message-ID: <601A2476-F617-4F74-BB23-8DF5695BB54D@oracle.com> On Sep 9, 2014, at 3:02 PM, Deneau, Tom wrote: > The following is an issue we need to resolve in Sumatra. We > intend to file this in the openjdk bugs system once we get the Sumatra > project set up as a project there. Meanwhile, comments are welcome. > > > In the current prototype, a config flag enables offload and if a > Stream API parallel().forEach call is encountered which meets the > other criteria for being offloaded, then on its first invocation it is > compiled for the HSA target and executed. The compilation happens > once, the compiled kernel is saved and can be reused on subsequent > invocations of the same lambda. (Note: if for any reason the lambda > cannot be compiled for an HSA target, offload is disabled for this > lambda and the usual CPU parallel path is used). The logic for > deciding whether to offload or not is all in the special > Sumatra-modified JDK classes in java/util/stream. > > The above logic could be improved: > > a) instead of being offloaded on the first invocation, the lambda > should first be executed thru the interpreter so that profiling > information is gathered which could then be useful in the > eventual HSAIL compilation step. > > b) instead of being offloaded unconditionally, it would be good if > the lambda would be offloaded only if the offload is determined > profitable when compared to running parallel on the CPU. We > assume that in general it is not possible to predict the > profitability of GPU offload statically and that measurement > will be necessary. > > So how to meet the above needs? Our current thoughts are that at the > JDK level where we decide to offload a particular parallel lambda > invocation would go thru a number of stages: > > * Interpreted (to gather profiling information) > * Compiled and executed on Parallel CPU and timed > * Compiled and executed on Parallel GPU and timed > > And then at that point make some decision about which way is faster > and use that going forward. > > Do people think making these measurements back at the JDK API level is > the right place? (It seems to fit there since that is where we decide > whether or not to offload) The API level doesn?t seem to be the right place; the JIT compiler should make these decisions. Eventually we want unmodified Java code to be offloaded, too. > > Some concerns > ------------- > This comparison works well if the work per stream call is similar for > all invocations. However, even the range may not be the same from > invocation to invocation. We should try to compare parCPU and parGPU > runs with the same range. If we can't find runs with the same range, > we could derive a time per workitem measurement and compare those. > However, time per workitem for a small range may be quite different > for time per workitem for a large range so would be difficult to > compare. Even then the work per run may be different (might take > different paths thru the lambda). > > How to detect that we are in the "Compiled" stage for the Parallel CPU > runs? I guess knowing the range of each forEach call we should be > able to estimate this, or just see a reduction in the runtime. > > -- Tom Deneau > > From tom.deneau at amd.com Thu Sep 18 15:04:55 2014 From: tom.deneau at amd.com (Deneau, Tom) Date: Thu, 18 Sep 2014 15:04:55 +0000 Subject: determining when to offload to the gpu In-Reply-To: References: Message-ID: Ryan -- So I believe you are saying: a) Given a lambda marked parallel to execute across a range, the decision of where to run it does not have to be an all-CPU or all-GPU decision. It may be possible to subdivide the problem and run part of it on the GPU and part on the CPU. This subdividing could be part of the framework. b) There should be an API that allows the expert user to break up the problem and control which parts run on the CPU and GPU. I think solving part a) in the JVM or JDK is an interesting (and difficult) problem for the future but may be beyond the scope of the current Sumatra. I will definitely open an issue on this once we get the Sumatra project in place on bugs.openjdk.java.net. Meanwhile, for now I think we will limit the automatic decision of where to run to all-GPU or all-CPU. I think there is a middle ground of problems that either may or may not gain thru offloading (for example depending on GPU or CPU hardware capabilities) and where the programmer wants to leave that decision up to the framework. I will also enter an issue for Part b). I agree this is something that an expert user might want. -- Tom ------------------------------------------------- -----Original Message----- From: LaMothe, Ryan R [mailto:Ryan.LaMothe at pnnl.gov] Sent: Tuesday, September 09, 2014 7:03 PM To: Deneau, Tom; sumatra-dev at openjdk.java.net; graal-dev at openjdk.java.net Subject: Re: determining when to offload to the gpu Hi Tom, I thought this may be a good point to jump in and make a quick comment on some thoughts. A question: At what level is it better to encapsulate this in the JVM and at what level is this better left to the user/utility functions? For example, in the Aparapi project there is an example project named correlation-matrix that gives a pretty good idea about what it takes to realistically decide in code whether to run a specific matrix computation on CPU or GPU and how to split up the work. This is a very basic example and is only a sample of the real code base from which it was derived, but should help highlight the issue. Instead of the JVM trying to figure out how to decompose the lambda functions optimally and offload to HSA automatically for all possible cases, might it be better to take the following approach: - Implement the base functionality in the JVM for HSA offload and then search the entire JDK for places where offloading may be obvious or easily achieved (i.e. Matrix Math, etc.)? Maybe this even means implementing new base classes for specific packages that are HSA-enabled. - For non-obvious cases, allow the developer to somehow indicate in the lambda that they want the execution to occur via HSA/offload, if possible, and provide some form of annotations or other functionality to give the JVM hints about how they would like it done? Maybe that seems like steps backwards, but thought it was worth mentioning. -Ryan On 9/9/14, 3:02 PM, "Deneau, Tom" wrote: >The following is an issue we need to resolve in Sumatra. We intend to >file this in the openjdk bugs system once we get the Sumatra project >set up as a project there. Meanwhile, comments are welcome. > > >In the current prototype, a config flag enables offload and if a Stream >API parallel().forEach call is encountered which meets the other >criteria for being offloaded, then on its first invocation it is >compiled for the HSA target and executed. The compilation happens >once, the compiled kernel is saved and can be reused on subsequent >invocations of the same lambda. (Note: if for any reason the lambda >cannot be compiled for an HSA target, offload is disabled for this >lambda and the usual CPU parallel path is used). The logic for >deciding whether to offload or not is all in the special >Sumatra-modified JDK classes in java/util/stream. > >The above logic could be improved: > > a) instead of being offloaded on the first invocation, the lambda > should first be executed thru the interpreter so that profiling > information is gathered which could then be useful in the > eventual HSAIL compilation step. > > b) instead of being offloaded unconditionally, it would be good if > the lambda would be offloaded only if the offload is determined > profitable when compared to running parallel on the CPU. We > assume that in general it is not possible to predict the > profitability of GPU offload statically and that measurement > will be necessary. > >So how to meet the above needs? Our current thoughts are that at the >JDK level where we decide to offload a particular parallel lambda >invocation would go thru a number of stages: > > * Interpreted (to gather profiling information) > * Compiled and executed on Parallel CPU and timed > * Compiled and executed on Parallel GPU and timed > >And then at that point make some decision about which way is faster and >use that going forward. > >Do people think making these measurements back at the JDK API level is >the right place? (It seems to fit there since that is where we decide >whether or not to offload) > >Some concerns >------------- >This comparison works well if the work per stream call is similar for >all invocations. However, even the range may not be the same from >invocation to invocation. We should try to compare parCPU and parGPU >runs with the same range. If we can't find runs with the same range, >we could derive a time per workitem measurement and compare those. >However, time per workitem for a small range may be quite different for >time per workitem for a large range so would be difficult to compare. >Even then the work per run may be different (might take different paths >thru the lambda). > >How to detect that we are in the "Compiled" stage for the Parallel CPU >runs? I guess knowing the range of each forEach call we should be able >to estimate this, or just see a reduction in the runtime. > >-- Tom Deneau > > From jani.vainola at gmail.com Fri Sep 19 08:22:21 2014 From: jani.vainola at gmail.com (=?UTF-8?B?SmFuaSBWw6RpbsO2bMOk?=) Date: Fri, 19 Sep 2014 10:22:21 +0200 Subject: determining when to offload to the gpu In-Reply-To: References: Message-ID: Hi (I am new here so please bare with me. I just couldn't stay silent) About part b), as a java developer I would love if there was some kind of expert API that gives me control of this feature. I would like to able to control where my code is executed (if it is possible). For instance, say that I would write a custom collection class, I want to be able to write a member function that does a few operations on the data in a data-parallell manner and a therefore always (if possible) run those parts of it on the GPU. That is, I want to make the decisions inside my collection but hide it from the users. In my opinion, they should only call my collection and not have any clue on where or how the execution is done. I guess I could slice up the code into a few private functions that use the parallel functionality on the data and use those in the public function but it would be very neat to have a good API for this instead. Regards Jani 2014-09-18 17:04 GMT+02:00 Deneau, Tom : > Ryan -- > > So I believe you are saying: > > a) Given a lambda marked parallel to execute across a range, the > decision of where to run it does not have to be an all-CPU or > all-GPU decision. It may be possible to subdivide the problem > and run part of it on the GPU and part on the CPU. This > subdividing could be part of the framework. > > b) There should be an API that allows the expert user to break up > the problem and control which parts run on the CPU and GPU. > > > I think solving part a) in the JVM or JDK is an interesting (and > difficult) problem for the future but may be beyond the scope of the > current Sumatra. I will definitely open an issue on this once we get > the Sumatra project in place on bugs.openjdk.java.net. > > Meanwhile, for now I think we will limit the automatic decision of > where to run to all-GPU or all-CPU. I think there is a middle ground > of problems that either may or may not gain thru offloading (for > example depending on GPU or CPU hardware capabilities) and where the > programmer wants to leave that decision up to the framework. > > I will also enter an issue for Part b). I agree this is something > that an expert user might want. > > -- Tom > > ------------------------------------------------- > -----Original Message----- > From: LaMothe, Ryan R [mailto:Ryan.LaMothe at pnnl.gov] > Sent: Tuesday, September 09, 2014 7:03 PM > To: Deneau, Tom; sumatra-dev at openjdk.java.net; graal-dev at openjdk.java.net > Subject: Re: determining when to offload to the gpu > > Hi Tom, > > I thought this may be a good point to jump in and make a quick comment on > some thoughts. > > A question: At what level is it better to encapsulate this in the JVM and > at what level is this better left to the user/utility functions? > > > For example, in the Aparapi project there is an example project named > correlation-matrix that gives a pretty good idea about what it takes to > realistically decide in code whether to run a specific matrix computation > on CPU or GPU and how to split up the work. This is a very basic example > and is only a sample of the real code base from which it was derived, but > should help highlight the issue. > > Instead of the JVM trying to figure out how to decompose the lambda > functions optimally and offload to HSA automatically for all possible > cases, might it be better to take the following approach: > > - Implement the base functionality in the JVM for HSA offload and then > search the entire JDK for places where offloading may be obvious or easily > achieved (i.e. Matrix Math, etc.)? Maybe this even means implementing new > base classes for specific packages that are HSA-enabled. > > - For non-obvious cases, allow the developer to somehow indicate in the > lambda that they want the execution to occur via HSA/offload, if possible, > and provide some form of annotations or other functionality to give the JVM > hints about how they would like it done? > > > Maybe that seems like steps backwards, but thought it was worth mentioning. > > > -Ryan > > > On 9/9/14, 3:02 PM, "Deneau, Tom" wrote: > > >The following is an issue we need to resolve in Sumatra. We intend to > >file this in the openjdk bugs system once we get the Sumatra project > >set up as a project there. Meanwhile, comments are welcome. > > > > > >In the current prototype, a config flag enables offload and if a Stream > >API parallel().forEach call is encountered which meets the other > >criteria for being offloaded, then on its first invocation it is > >compiled for the HSA target and executed. The compilation happens > >once, the compiled kernel is saved and can be reused on subsequent > >invocations of the same lambda. (Note: if for any reason the lambda > >cannot be compiled for an HSA target, offload is disabled for this > >lambda and the usual CPU parallel path is used). The logic for > >deciding whether to offload or not is all in the special > >Sumatra-modified JDK classes in java/util/stream. > > > >The above logic could be improved: > > > > a) instead of being offloaded on the first invocation, the lambda > > should first be executed thru the interpreter so that profiling > > information is gathered which could then be useful in the > > eventual HSAIL compilation step. > > > > b) instead of being offloaded unconditionally, it would be good if > > the lambda would be offloaded only if the offload is determined > > profitable when compared to running parallel on the CPU. We > > assume that in general it is not possible to predict the > > profitability of GPU offload statically and that measurement > > will be necessary. > > > >So how to meet the above needs? Our current thoughts are that at the > >JDK level where we decide to offload a particular parallel lambda > >invocation would go thru a number of stages: > > > > * Interpreted (to gather profiling information) > > * Compiled and executed on Parallel CPU and timed > > * Compiled and executed on Parallel GPU and timed > > > >And then at that point make some decision about which way is faster and > >use that going forward. > > > >Do people think making these measurements back at the JDK API level is > >the right place? (It seems to fit there since that is where we decide > >whether or not to offload) > > > >Some concerns > >------------- > >This comparison works well if the work per stream call is similar for > >all invocations. However, even the range may not be the same from > >invocation to invocation. We should try to compare parCPU and parGPU > >runs with the same range. If we can't find runs with the same range, > >we could derive a time per workitem measurement and compare those. > >However, time per workitem for a small range may be quite different for > >time per workitem for a large range so would be difficult to compare. > >Even then the work per run may be different (might take different paths > >thru the lambda). > > > >How to detect that we are in the "Compiled" stage for the Parallel CPU > >runs? I guess knowing the range of each forEach call we should be able > >to estimate this, or just see a reduction in the runtime. > > > >-- Tom Deneau > > > > > >