[concurrency-interest] Numerical Stream code

Howard Lovatt howard.lovatt at gmail.com
Fri Feb 15 12:26:09 PST 2013


Hi,

Thanks for all the replies. This is largely a holding email. I am travelling with work and don't have my laptop. When get home I will post some more code. 

@Jin: I did warm up the code, but I do agree that benchmarks are tricky. As I said I was expecting some overhead but was surprised at how much. 

@Brian: The reason I factored t0 and tg0 out into methods is that they are common between the serial and parallel versions and I thought the code read better. I don't think it makes any difference, but I will check. 

@Others: To avoid writing over an old array I will have to allocate each time round the t loop. I will give this a try and see if it helps. The discussion about the parallel problems is interesting, but how come the serial version is so slow? Could a problem with the Stream code in general be the underlying problem with the parallel version?

Sent from my iPad

On 15/02/2013, at 3:48 AM, Stanimir Simeonoff <stanimir at riflexo.com> wrote:

> 
>> > Do element sizes matter (byte vs. short vs. int  vs. long)? 
>> 
>> I don't think so.  All of this assumes that the proper instruction is used.  For example, if 2 threads are writing to adjacent bytes, then the "mov" instruction has to only write the byte.  If the compiler, decides to read 32-bits, mask in the 8-bits and write 32-bits then the data will be corrupted. 
> JLS mandates no corruption for neighbor writes.
>  
>> I believe that HotSpot will only generate the write byte mov instruction.
> That would be the correct one. The case affects only boolean[]/byte[]/short[]/char[]  as simple primitive fields are always at least 32bits.
> 
> Stanimir
> 
>  
>> Nathan Reynolds | Architect | 602.333.9091
>> Oracle PSR Engineering | Server Technology
>> On 2/14/2013 8:56 AM, Peter Levart wrote:
>>> On 02/14/2013 03:45 PM, Brian Goetz wrote: 
>>>>> The parallel version is almost certainly suffering false cache line 
>>>>> sharing when adjacent tasks are writing to the shared arrays u0, etc. 
>>>>> Nothing to do with streams, just a standard parallelism gotcha.
>>>> Cure: don't write to shared arrays from parallel tasks.
>>> Hi, 
>>> 
>>> I would like to discuss this a little bit (hence the cc: concurrency-interest - the conversation can continue on this list only). 
>>> 
>>> Is it really important to avoid writing to shared arrays from multiple threads (of course without synchronization, not even volatile writes/reads) when indexes are not shared (each thread writes/reads it's own disjunct subset). 
>>> 
>>> Do element sizes matter (byte vs. short vs. int  vs. long)? 
>>> 
>>> I had a (false?) feeling that cache lines are not invalidated when writes are performed without fences. 
>>> 
>>> Also I don't know how short (byte, char) writes are combined into memory words on the hardware when they come from different cores and whether this is connected to any performance issues. 
>>> 
>>> Thanks, 
>>> 
>>> Peter 
>>> 
>>> _______________________________________________ 
>>> Concurrency-interest mailing list 
>>> Concurrency-interest at cs.oswego.edu 
>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>> 
>> 
>> _______________________________________________
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
> 
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest


More information about the lambda-dev mailing list