A question about bytecodes + unsigned load performance ./. add performace

Tue Jan 20 04:30:22 PST 2009

Hi,

I'm much impressed, how sophisticated HotSpot optimizer is, and that 
there are some guys looking deeper to find the source of the unsigned 
load deceleration.
I'll keep my fingers crossed for you finding out the mystery, so we can 
get rid of the +128 trick.

What does "NP-hard" mean?

-Ulf

Am 20.01.2009 09:13, John Rose schrieb:
> On Jan 19, 2009, at 9:37 AM, Christian Thalinger wrote:
>
>> Maybe someone can explain to me why the generated code is so different?
>
> You have observed a common traffic jam of loads at the top of an 
> unrolled loop.  That happens because, when generating a CFG from the 
> sea of nodes, we allow load operations to float upward in the graph 
> (toward earlier points).  That means they all get scheduled at the top 
> of the loop body.  This is valid, and it usually makes sense on 
> machines with lots of registers, but it is kind of dumb when the 
> number of loaded values is larger than the number of available temp 
> registers.  But we can't measure register pressure at the point the 
> CFG is built.  It's the typical NP-hard chicken-and-egg problem, of 
> needing a CFG to build a register allocation, but needing a register 
> allocation to make a good CFG assignment.  Years ago we decided to led 
> the loads float up, and assume that modern hardware would supply the 
> temp. registers and/or stack caches.  It would be more graceful to 
> throttle the load placement a little more (e.g., to try to reduce 
> load-to-last-use distance).  But this seems to be a medium-large 
> project, and there have always been more pressing matters to deal with.
>
> I can't say off-hand why the x+0x80 version schedules the loads 
> better. The +0x80 part probably gets subsumed into the index ranges 
> which are split in complicated ways (as I mentioned earlier).  I have 
> no idea why the signed loads (feeding the +0x80 ops) would trickle 
> downward in the graph compared with the unsigned loads bouncing up to 
> the loop top.  This might be a case where PrintOptoAssembly would 
> supply a missing clue; the ideal graph visualizer would be even 
> better.  Maybe there are "invisible" (non-coding) nodes which 
> participate in the CFG and accidentally prevent the signed loads from 
> rising.  Or maybe the heuristics are just rolling opposite ways on 
> similar but slightly different inputs; this also is typical of NP-hard 
> algorithms.
>
> -- John
>
>