A question about bytecodes + unsigned load performance ./. add performace

Tue Jan 20 00:13:27 PST 2009

On Jan 19, 2009, at 9:37 AM, Christian Thalinger wrote:

> Maybe someone can explain to me why the generated code is so  
> different?

You have observed a common traffic jam of loads at the top of an  
unrolled loop.  That happens because, when generating a CFG from the  
sea of nodes, we allow load operations to float upward in the graph  
(toward earlier points).  That means they all get scheduled at the  
top of the loop body.  This is valid, and it usually makes sense on  
machines with lots of registers, but it is kind of dumb when the  
number of loaded values is larger than the number of available temp  
registers.  But we can't measure register pressure at the point the  
CFG is built.  It's the typical NP-hard chicken-and-egg problem, of  
needing a CFG to build a register allocation, but needing a register  
allocation to make a good CFG assignment.  Years ago we decided to  
led the loads float up, and assume that modern hardware would supply  
the temp. registers and/or stack caches.  It would be more graceful  
to throttle the load placement a little more (e.g., to try to reduce  
load-to-last-use distance).  But this seems to be a medium-large  
project, and there have always been more pressing matters to deal with.

I can't say off-hand why the x+0x80 version schedules the loads  
better. The +0x80 part probably gets subsumed into the index ranges  
which are split in complicated ways (as I mentioned earlier).  I have  
no idea why the signed loads (feeding the +0x80 ops) would trickle  
downward in the graph compared with the unsigned loads bouncing up to  
the loop top.  This might be a case where PrintOptoAssembly would  
supply a missing clue; the ideal graph visualizer would be even  
better.  Maybe there are "invisible" (non-coding) nodes which  
participate in the CFG and accidentally prevent the signed loads from  
rising.  Or maybe the heuristics are just rolling opposite ways on  
similar but slightly different inputs; this also is typical of NP- 
hard algorithms.

-- John