- •Introduction
- •Assembly language syntax
- •Microprocessor versions covered by this manual
- •Getting started with optimization
- •Speed versus program clarity and security
- •Choice of programming language
- •Choice of algorithm
- •Memory model
- •Finding the hot spots
- •Literature
- •Optimizing in C++
- •Use optimization options
- •Identify the most critical parts of your code
- •Break dependence chains
- •Use local variables
- •Use array of structures rather than structure of arrays
- •Alignment of data
- •Division
- •Function calls
- •Conversion from floating-point numbers to integers
- •Character arrays versus string objects
- •Combining assembly and high level language
- •Inline assembly
- •Calling conventions
- •Data storage in C++
- •Register usage in 16 bit mode DOS or Windows
- •Register usage in 32 bit Windows
- •Register usage in Linux
- •Making compiler-independent code
- •Adding support for multiple compilers in .asm modules
- •Further compiler incompatibilities
- •Object file formats
- •Using MASM under Linux
- •Object oriented programming
- •Other high level languages
- •Debugging and verifying assembly code
- •Reducing code size
- •Detecting processor type
- •Checking for operating system support for XMM registers
- •Alignment
- •Cache
- •First time versus repeated execution
- •Out-of-order execution (PPro, P2, P3, P4)
- •Instructions are split into uops
- •Register renaming
- •Dependence chains
- •Branch prediction (all processors)
- •Prediction methods for conditional jumps
- •Branch prediction in P1
- •Branch prediction in PMMX, PPro, P2, and P3
- •Branch prediction in P4
- •Indirect jumps (all processors)
- •Returns (all processors except P1)
- •Static prediction
- •Close jumps
- •Avoiding jumps (all processors)
- •Optimizing for P1 and PMMX
- •Pairing integer instructions
- •Address generation interlock
- •Splitting complex instructions into simpler ones
- •Prefixes
- •Scheduling floating-point code
- •Optimizing for PPro, P2, and P3
- •The pipeline in PPro, P2 and P3
- •Register renaming
- •Register read stalls
- •Out of order execution
- •Retirement
- •Partial register stalls
- •Partial memory stalls
- •Bottlenecks in PPro, P2, P3
- •Optimizing for P4
- •Trace cache
- •Instruction decoding
- •Execution units
- •Do the floating-point and MMX units run at half speed?
- •Transfer of data between execution units
- •Retirement
- •Partial registers and partial flags
- •Partial memory access
- •Memory intermediates in dependencies
- •Breaking dependencies
- •Choosing the optimal instructions
- •Bottlenecks in P4
- •Loop optimization (all processors)
- •Loops in P1 and PMMX
- •Loops in PPro, P2, and P3
- •Loops in P4
- •Macro loops (all processors)
- •Single-Instruction-Multiple-Data programming
- •Problematic Instructions
- •XCHG (all processors)
- •Shifts and rotates (P4)
- •Rotates through carry (all processors)
- •String instructions (all processors)
- •Bit test (all processors)
- •Integer multiplication (all processors)
- •Division (all processors)
- •LEA instruction (all processors)
- •WAIT instruction (all processors)
- •FCOM + FSTSW AX (all processors)
- •FPREM (all processors)
- •FRNDINT (all processors)
- •FSCALE and exponential function (all processors)
- •FPTAN (all processors)
- •FSQRT (P3 and P4)
- •FLDCW (PPro, P2, P3, P4)
- •Bit scan (P1 and PMMX)
- •Special topics
- •Freeing floating-point registers (all processors)
- •Transitions between floating-point and MMX instructions (PMMX, P2, P3, P4)
- •Converting from floating-point to integer (All processors)
- •Using integer instructions for floating-point operations
- •Using floating-point instructions for integer operations
- •Moving blocks of data (All processors)
- •Self-modifying code (All processors)
- •Testing speed
- •List of instruction timings for P1 and PMMX
- •Integer instructions (P1 and PMMX)
- •Floating-point instructions (P1 and PMMX)
- •MMX instructions (PMMX)
- •List of instruction timings and uop breakdown for PPro, P2 and P3
- •Integer instructions (PPro, P2 and P3)
- •Floating-point instructions (PPro, P2 and P3)
- •MMX instructions (P2 and P3)
- •List of instruction timings and uop breakdown for P4
- •integer instructions
- •Floating-point instructions
- •SIMD integer instructions
- •SIMD floating-point instructions
- •Comparison of the different microprocessors
Uop retirement
The retirement station can handle 3 uops per clock cycle. Taken branches can only be handled by the first of the three slots in the retirement station.
If you aim at an average throughput of 3 uops per clock cycle then avoid an excessive number of jumps, calls and branches. Small critical loops should preferably have a number of uops divisible by 3 (see page 88).
Instruction decoding
If the critical part of your code doesn't fit into the trace cache, then the limiting stage may be instruction decoding. The decoder can handle one instruction per clock cycle, provided that the instruction generates no more than 4 uops and no microcode, and has no more than
one prefix (see page 80). If decoding is a bottleneck, then you may try to minimize the number of instructions rather than the number of uops.
Branch prediction
The calculations of latencies and throughputs are only valid if all branches are predicted. Branch mispredictions can seriously slow down performance when latency or throughput is the limiting factor.
Avoid poorly predictable branches in critical parts of your code unless the alternative (e.g. conditional moves) outweighs the advantage by adding costly extra dependencies and latency. See page 45 for details.
16 Loop optimization (all processors)
When analyzing a program, you often find that most of the time consumption lies in the innermost loop. The way to improve the speed is to carefully optimize the most timeconsuming loop using assembly language. The rest of the program may be left in high-level language.
In all the following examples it is assumed that all data are likely to be in the cache most of the time. If the speed is limited by cache misses then there is no reason to optimize the instructions. Rather, you should concentrate on organizing your data in a way that minimizes cache misses (see page 29).
A loop generally contains a counter controlling how many times to iterate, and often array access reading or writing one array element for each iteration. I have chosen as example a procedure that reads integers from an array, changes the sign of each integer, and stores the results in another array. A C++ language code for this procedure would be:
void ChangeSign (int * A, int * B, int N) { for (int i=0; i<N; i++) B[i] = -A[i];}
Translating to assembly, we might write the procedure like this:
; Example 16.1: |
|
|
|
_ChangeSign PROC NEAR |
|
||
|
PUSH |
ESI |
|
|
PUSH |
EDI |
|
A |
EQU |
DWORD PTR [ESP+12] |
; addresses of parameters on stack |
B |
EQU |
DWORD PTR [ESP+16] |
|
N |
EQU |
DWORD PTR [ESP+20] |
|
|
MOV |
ECX, [N] |
|
|
JECXZ |
L2 |
; skip if N = 0 |
|
MOV |
ESI, [A] |
; pointer to source A |
|
MOV |
EDI, [B] |
; pointer to destination B |
|
CLD |
|
|
L1: |
LODSD |
|
; read |
|
NEG |
EAX |
; change sign |
|
STOSD |
|
; write |
|
LOOP |
L1 |
; repeat |
L2: |
POP |
EDI |
|
|
POP |
ESI |
|
|
RET |
|
;(no extra pop if _cdecl calling convention) |
_ChangeSign |
ENDP |
|
This looks like a nice solution, but it is not optimal because it uses the complex instructions JECXZ, CLD, LODSD, STOSD and LOOP, which are inefficient on most processors. It can be improved by avoiding these instructions:
; Example 16.2: |
|
|
|
|
MOV |
ECX, [N] |
; ECX = counter |
|
MOV |
ESI, [A] |
|
|
TEST |
ECX, ECX |
|
|
JZ |
SHORT L2 |
; skip if N = 0 |
|
MOV |
EDI, [B] |
|
L1: |
MOV |
EAX, [ESI] |
; read |
|
ADD |
ESI, 4 |
; increment source pointer |
|
NEG |
EAX |
; change sign |
|
MOV |
[EDI], EAX |
; write |
|
ADD |
EDI, 4 |
; increment destination pointer |
|
SUB |
ECX, 1 |
; decrement loop counter |
|
JNZ |
L1 |
; loop |
L2: |
|
|
|
Here I am using SUB ECX,1 instead of DEC ECX because the latter instruction uses one extra uop in the P4. It is an advantage to have the loop control branch in the bottom of the loop because if the branch were at the top of the loop then we would need an extra jump in the bottom of the loop.
Using the same register for counter and index reduces the number of instructions:
; Example 16.3: |
|
|
|
|
MOV |
ESI, [A] |
|
|
MOV |
EDI, [B] |
|
|
MOV |
ECX, [N] |
|
|
SUB |
EDX, EDX |
; set counter EDX = 0 |
|
TEST |
ECX, ECX |
|
|
JZ |
SHORT L2 |
|
L1: |
MOV |
EAX, [ESI+4*EDX] |
; use base pointer and index |
|
NEG |
EAX |
|
|
MOV |
[EDI+4*EDX], EAX |
|
|
ADD |
EDX, 1 |
; increment loop counter |
|
CMP |
EDX, ECX |
|
|
JB |
L1 |
|
L2: |
|
|
|
We can get rid of the CMP instruction in example 16.3 by letting the loop counter end at zero and use the zero flag for detecting when the loop is finished as we did in example 16.2. One way of doing this would be to execute the loop backwards taking the last array elements first. However, data caches are optimized for accessing data forwards, not backwards, so if cache misses are likely then you should rather start the counter at -N and count through negative values up to zero. This is possible if you let the base registers point to the end of the arrays rather than the beginning:
; Example 16.4:
MOV |
ECX, [N] |
|
MOV |
ESI, |
[A] |
MOV |
EDI, |
[B] |
|
LEA |
ESI, [ESI+4*ECX] |
; point to end of array A |
|
LEA |
EDI, [EDI+4*ECX] |
; point to end of array B |
|
NEG |
ECX |
; -N |
|
JZ |
SHORT L2 |
; skip if N = 0 |
L1: |
MOV |
EAX, [ESI+4*ECX] |
|
|
NEG |
EAX |
|
|
MOV |
[EDI+4*ECX], EAX |
|
|
ADD |
ECX, 1 |
|
|
JNZ |
L1 |
|
L2: |
|
|
|
Now we are down to two simple instructions for loop overhead, which is as low as you can get.
On processors with out-of-order execution, it is likely that the second calculation will start before the first calculation is finished. In some situations, the processor may start several iterations until the maximum throughput of the execution units is reached.
The out-of-order capabilities cannot be utilized, however, in the case where each calculation depends on the result of the previous one. Such a continued dependence chain is the worst situation you can have, and you should definitely try to find a way to break down the dependence chain. Assume, for example, that we have to multiply a long series of integers. The C++ code looks like this:
int MultiplyList (int * List, int N) { int product = 1, i;
for (i=0; i<N; i++) product *= List[i]; return product;}
The best thing you can do here is to roll out the loop and use two accumulators:
; Example 16.5: |
|
|
|
_MultiplyList PROC NEAR |
|
||
|
PUSH |
ESI |
|
|
PUSH |
EBX |
|
List |
EQU |
DWORD PTR [ESP+12] |
; addresses of parameters on stack |
N |
EQU |
DWORD PTR [ESP+16] |
|
|
MOV |
ESI, [List] |
|
|
MOV |
ECX, [N] |
|
|
MOV |
EAX, 1 |
; accumulator one |
|
MOV |
EBX, EAX |
; accumulator two |
|
SUB |
EDX, EDX |
; counter starts at 0 |
|
SUB |
ECX, 1 |
; N-1 |
|
JS |
SHORT L3 |
; N-1 < 0 means N = 0 |
|
SHL |
ECX, 2 |
; 4*(N-1) |
L1: |
IMUL |
EAX, [ESI+EDX] |
; multiply first accumulator |
|
IMUL |
EBX, [ESI+EDX+4] |
; multiply second accumulator |
|
ADD |
EDX, 8 |
; add 2*(data size) to counter |
|
CMP |
EDX, ECX |
; do we have at least 2 more? |
|
JB |
L1 |
|
|
JA |
SHORT L2 |
; finished if counter > N-1 |
|
IMUL |
EAX, [ESI+EDX] |
; N is odd, do one more |
L2: |
IMUL |
EAX, EBX |
; combine the two accumulators |
L3: |
POP |
EBX |
|
|
POP |
ESI |
|
|
RET |
|
|
_MultiplyList ENDP
Now we are doing two operations in parallel and the long dependence chain is split into two parallel dependence chains of half the length. This will reduce the calculation time with almost 50% if the multiplication unit is pipelined.
When we roll out the loop by two, we have to check if the number of factors in List is odd. If N is odd, then we have to do the odd multiplication outside the loop. We can do the odd one either before or after the main loop. It may be more efficient to do it after the loop if List is aligned by 8.
In general, if you roll out a loop by R, i.e. if you do R calculations per loop iteration, then the number of extra calculations to do outside the loop is E = N modulo R. If you want to do the extra E calculations before the main loop, then you have to calculate E, which requires a division if R is not a power of 2. This can be avoided by doing the extra calculations after the main loop, as shown in example 16.5. Of course it is an advantage to choose R so that N is divisible by R, if possible.
A suitable roll-out factor R can be found by dividing the latency of the most critical calculation instruction with the reciprocal throughput of the same instruction. Remember that it not necessary to roll out the loop if there is no continued dependence chain. Excessive loop unrolling will only fill up the code cache or trace cache without any significant speed advantage. However, you may choose to roll out a loop if this improves the prediction of the loop control branch.
In example 16.5, we are using three instructions for counter and loop control. It is possible to reduce this to two instructions, as in example 16.4, but this will make the code quite awkward. In most cases, the loop control instructions can execute in parallel with the calculations so you don't have to care about minimizing the loop control overhead.
16.1 Loops in P1 and PMMX
The P1 and PMMX processors have no capabilities for out-of-order execution. Instead you have to care about pairing opportunities. The code in example 16.4 may be changed to:
; Example 16.6: |
|
|
|
|
MOV |
ESI, [A] |
|
|
MOV |
EAX, [N] |
|
|
MOV |
EDI, [B] |
|
|
XOR |
ECX, ECX |
|
|
LEA |
ESI, [ESI+4*EAX] |
; point to end of array A |
|
SUB |
ECX, EAX |
; -N |
|
LEA |
EDI, [EDI+4*EAX] |
; point to end of array B |
|
JZ |
SHORT L3 |
|
|
XOR |
EBX, EBX |
; start first calculation |
|
MOV |
EAX, [ESI+4*ECX] |
|
|
INC |
ECX |
|
|
JZ |
SHORT L2 |
|
L1: |
SUB |
EBX, EAX |
; u |
|
MOV |
EAX, [ESI+4*ECX] |
; v (pairs) |
|
MOV |
[EDI+4*ECX-4], EBX |
; u |
|
INC |
ECX |
; v (pairs) |
|
MOV |
EBX, 0 |
; u |
|
JNZ |
L1 |
; v (pairs) |
L2: |
SUB |
EBX, EAX |
; end last calculation |
|
MOV |
[EDI+4*ECX-4], EBX |
|
L3: |
|
|
|
Here the iterations are overlapped in order to improve pairing opportunities. We begin reading the second value before we have stored the first one. The MOV EBX,0 instruction has been put in between INC ECX and JNZ L1, not to improve pairing, but to avoid the AGI stall that would result from using ECX as address index in the first instruction pair after it has been incremented.
Loops with floating-point operations are somewhat different because the floating-point instructions are overlapping rather than pairing. Consider the C++ language code: