Quantcast
Channel: Forums - Recent Threads
Viewing all articles
Browse latest Browse all 262198

Very inefficient code generation: FUNC_ALWAYS_INLINE pragma ignored and overhead from unexpected conditional code

$
0
0

In the artificial testcase below, the compiler (cl6x 7.4.2 with options  -mv6600 -os -k -o3) refuses to inline ffswap() into test_ffswap(), even with a FUNC_ALWAYS_INLINE pragma.

#pragma FUNC_ALWAYS_INLINE(ffswap);

static inline DFF ffswap(DFF dff) {

    float tmp = dff.x; dff.x = dff.y;; dff.y = tmp; return dff; }

DFF test_ffswap(DFF dff) { return ffswap(dff); }

 

The result is that test_ffswap() takes 30 cpu cycles (19 within test_ffswap() and then another 21 within ffswap() which is 5 times longer than it should take.  I.e., with inlining (and proper optimization), test_ffswap() should be reduced to 3 moves (to do the swap), return instruction and, perhaps a nop for a total of 6 cycles. 

(Please visit the site to view this file)

There are two sources of the inefficiency.  First, the function is not being inlined.  Second, the function ffswap() contains some conditional code that is being generated by the compiler performing some additional loads and stores that significantly increases the compile-time for the function. 

1) Why isn't the function ffswap() being inlined?  How do I get the function to be inlined?

2) What is the conditional code in ffswap() doing?  It's obviously coming from the backend of the compiler because it does not show up in the optimizer comments.  Is this alignment related, something else?  How can I avoid this code?

I noticed that these two problems show up together.  When I come across a function that cannot be inlined it usually contains this conditional code, so I assume that the problems are related.

I can force inlining by passing the structure in pieces (see below), but this is very ugly and then the unexpected conditional code is moved to test_ffswap().  The resulting function test_ffswap() is now twice as fast as the original code (15 cycles compared to 30 cycles) but it still takes 2.5 times longer than it should.  If I write the ffswap() function in assembly, then it cannot be inlined, so this does not solve the problem.

(Please visit the site to view this file)

For this case, I could avoid the problem by converting back and forth to the x128_t type for each call, but I need a solution that will work for structures that are not exactly 128 bytes as well.


Viewing all articles
Browse latest Browse all 262198

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>