- •Introduction
- •Assembly language syntax
- •Microprocessor versions covered by this manual
- •Getting started with optimization
- •Speed versus program clarity and security
- •Choice of programming language
- •Choice of algorithm
- •Memory model
- •Finding the hot spots
- •Literature
- •Optimizing in C++
- •Use optimization options
- •Identify the most critical parts of your code
- •Break dependence chains
- •Use local variables
- •Use array of structures rather than structure of arrays
- •Alignment of data
- •Division
- •Function calls
- •Conversion from floating-point numbers to integers
- •Character arrays versus string objects
- •Combining assembly and high level language
- •Inline assembly
- •Calling conventions
- •Data storage in C++
- •Register usage in 16 bit mode DOS or Windows
- •Register usage in 32 bit Windows
- •Register usage in Linux
- •Making compiler-independent code
- •Adding support for multiple compilers in .asm modules
- •Further compiler incompatibilities
- •Object file formats
- •Using MASM under Linux
- •Object oriented programming
- •Other high level languages
- •Debugging and verifying assembly code
- •Reducing code size
- •Detecting processor type
- •Checking for operating system support for XMM registers
- •Alignment
- •Cache
- •First time versus repeated execution
- •Out-of-order execution (PPro, P2, P3, P4)
- •Instructions are split into uops
- •Register renaming
- •Dependence chains
- •Branch prediction (all processors)
- •Prediction methods for conditional jumps
- •Branch prediction in P1
- •Branch prediction in PMMX, PPro, P2, and P3
- •Branch prediction in P4
- •Indirect jumps (all processors)
- •Returns (all processors except P1)
- •Static prediction
- •Close jumps
- •Avoiding jumps (all processors)
- •Optimizing for P1 and PMMX
- •Pairing integer instructions
- •Address generation interlock
- •Splitting complex instructions into simpler ones
- •Prefixes
- •Scheduling floating-point code
- •Optimizing for PPro, P2, and P3
- •The pipeline in PPro, P2 and P3
- •Register renaming
- •Register read stalls
- •Out of order execution
- •Retirement
- •Partial register stalls
- •Partial memory stalls
- •Bottlenecks in PPro, P2, P3
- •Optimizing for P4
- •Trace cache
- •Instruction decoding
- •Execution units
- •Do the floating-point and MMX units run at half speed?
- •Transfer of data between execution units
- •Retirement
- •Partial registers and partial flags
- •Partial memory access
- •Memory intermediates in dependencies
- •Breaking dependencies
- •Choosing the optimal instructions
- •Bottlenecks in P4
- •Loop optimization (all processors)
- •Loops in P1 and PMMX
- •Loops in PPro, P2, and P3
- •Loops in P4
- •Macro loops (all processors)
- •Single-Instruction-Multiple-Data programming
- •Problematic Instructions
- •XCHG (all processors)
- •Shifts and rotates (P4)
- •Rotates through carry (all processors)
- •String instructions (all processors)
- •Bit test (all processors)
- •Integer multiplication (all processors)
- •Division (all processors)
- •LEA instruction (all processors)
- •WAIT instruction (all processors)
- •FCOM + FSTSW AX (all processors)
- •FPREM (all processors)
- •FRNDINT (all processors)
- •FSCALE and exponential function (all processors)
- •FPTAN (all processors)
- •FSQRT (P3 and P4)
- •FLDCW (PPro, P2, P3, P4)
- •Bit scan (P1 and PMMX)
- •Special topics
- •Freeing floating-point registers (all processors)
- •Transitions between floating-point and MMX instructions (PMMX, P2, P3, P4)
- •Converting from floating-point to integer (All processors)
- •Using integer instructions for floating-point operations
- •Using floating-point instructions for integer operations
- •Moving blocks of data (All processors)
- •Self-modifying code (All processors)
- •Testing speed
- •List of instruction timings for P1 and PMMX
- •Integer instructions (P1 and PMMX)
- •Floating-point instructions (P1 and PMMX)
- •MMX instructions (PMMX)
- •List of instruction timings and uop breakdown for PPro, P2 and P3
- •Integer instructions (PPro, P2 and P3)
- •Floating-point instructions (PPro, P2 and P3)
- •MMX instructions (P2 and P3)
- •List of instruction timings and uop breakdown for P4
- •integer instructions
- •Floating-point instructions
- •SIMD integer instructions
- •SIMD floating-point instructions
- •Comparison of the different microprocessors
; (Only for |
processors with SSE2) |
|
||
INTPOWER |
MACRO X, Y, N |
|
|
|
LOCAL |
I, |
YUSED |
; define |
local identifiers |
I = N |
|
|
; I used |
for shifting N |
YUSED |
= 0 |
|
; remember if Y contains valid data |
|
REPT 32 |
|
; maximum repeat count is 32 |
||
IF |
I AND 1 |
; test bit 0 |
||
|
IF |
YUSED |
; If Y already contains data |
|
|
|
MULPD Y, X |
; multiply Y with a power of X |
|
|
ELSE |
; If this is first time Y is used: |
||
|
|
MOVAPD Y, X |
; copy data to Y |
|
|
|
YUSED = 1 |
; remember that Y now contains data |
|
|
ENDIF |
; end of |
IF YUSED |
|
ENDIF |
|
; end of |
IF I AND 1 |
|
I = I |
SHR 1 |
; shift right I one place |
||
IF |
I EQ 0 |
; stop when I = 0 |
||
|
EXITM |
; exit REPT 32 loop prematurely |
||
ENDIF |
|
; end of |
IF I EQ 0 |
|
MULPD |
X, X |
; square |
X |
|
ENDM |
|
|
; end of |
REPT 32 loop |
ENDM |
|
|
; end of |
INTPOWER macro definition |
This macro generates the minimum number of instructions needed to do the job. There is no loop overhead, prolog or epilog in the final code. And, most importantly, no branches. All branches have been resolved by the macro preprocessor. To calculate XMM0 to the power of 12, you write:
INTPOWER XMM0, XMM1, 12
This will be resolved to:
MULPD |
XMM0, XMM0 |
; x^2 |
||
MULPD |
XMM0, XMM0 |
; x^4 |
||
MOVAPD |
XMM1, XMM0 |
; save x^4 |
||
MULPD |
XMM0, |
XMM0 |
; |
x^8 |
MULPD |
XMM1, |
XMM0 |
; |
x^4 * x^8 = x^12 |
This even has fewer instructions than the optimized loop (example 16.13). The expanded macro takes 25 clock cycles in this example.
17 Single-Instruction-Multiple-Data programming
Since there are technological limits to the maximum clock frequency of microprocessors, the trend goes towards increasing processor throughput by handling multiple data in parallel.
When optimizing code, it is important to consider if there are data that can be handled in parallel. The principle of Single-Instruction-Multiple-Data (SIMD) programming is that a vector or set of data are packed together in one large register and handled together in one operation.
Multiple data can be packed into 64-bit or 128-bit registers in the following ways:
data type |
data per pack |
register size |
instruction set |
microprocessor |
8-bit integer |
8 |
64 bit (MMX) |
MMX |
PMMX and later |
16-bit integer |
4 |
64 bit (MMX) |
MMX |
PMMX and later |
32-bit integer |
2 |
64 bit (MMX) |
MMX |
PMMX and later |
64-bit integer |
1 |
64 bit (MMX) |
SSE2 |
P4 and later |
32-bit float |
2 |
64 bit (MMX) |
3DNow |
AMD only |
8-bit integer |
16 |
128 bit (XMM) |
SSE2 |
P4 and later |
16-bit integer |
8 |
128 bit (XMM) |
SSE2 |
P4 and later |
32-bit integer |
4 |
128 bit (XMM) |
SSE2 |
P4 and later |
64-bit integer |
2 |
128 bit (XMM) |
SSE2 |
P4 and later |
32-bit float |
4 |
128 bit (XMM) |
SSE |
P3 and later |
64-bit float |
2 |
128 bit (XMM) |
SSE2 |
P4 and later |
All these packing modes are available on the latest microprocessors from Intel and AMD, except for the 3DNow mode, which is available only on AMD processors. Whether the different instruction sets are supported on a particular microprocessor can be determined with the CPUID instruction, as explained on page 26. The 64-bit MMX registers cannot be used together with the floating-point registers. The 128-bit XMM registers can only be used if supported by the operating system. See page 27 for how to check if the use of XMM registers is enabled.
You may make two or more versions of the critical part of your code: one that will run on old microprocessors, and one that uses the most advantageous packing mode and instruction set. The program should automatically select the version of the code that is appropriate for the system on which it is running.
Choose the smallest data size that fits your purpose in order to pack as many data as possible into one register. Mathematical computations may require double precision (64-bit) floats in order to avoid loss of precision in the intermediate calculations, even if single precision is sufficient for the final result.
Before you choose to use SIMD instructions for integer operations, you have to consider whether the resulting code will be faster than the simple integer instructions in 32-bit registers. Simple operations such as integer additions take four times as long in SIMD registers as in 32-bit registers on the P4. The SIMD instructions are therefore only advantageous for integer additions if they can handle at least four data in parallel. Loading and storing memory operands take no longer for 64-bit and 128-bit registers than for 32-bit registers. Integer shift and multiplication is faster in 64-bit and 128-bit registers than in 32-bit registers on the P4. With SIMD code, you may spend more instructions on trivial things such as moving data into the right positions in the registers and emulating conditional moves, than on the actual calculations. Example 17.1 below is an example of this.
For floating-point calculations on the P4, it is often advantageous to use XMM registers, even if there are no opportunities for handling data in parallel. The latency of floating-point operations is shorter in XMM registers than in floating-point registers, and you can make conversions between integers and floating-point numbers without using a memory intermediate. Furthermore, you get rid of the annoying floating-point register stack.
Memory operands for SIMD instructions have to be properly aligned. See page 28 for how to align data in memory. The alignment requirement makes it complicated to pass 64-bit and 128-bit function parameters on the stack (see Intel's optimization reference manual). It is therefore recommended to pass 64-bit and 128-bit parameters in registers or through a pointer, rather than on the stack.
Conditional moves in SIMD registers
Let's repeat example 16.13 page 107 with two double-precision floats in XMM registers. This enables us to compute x0n0 and x1n1 in parallel:
; Example 17.1 (P4) |
|
||
DATA |
SEGMENT PARA PUBLIC 'DATA' |
||
ONE |
DQ |
1.0, 1.0 |
|
X |
DQ |
?, ? |
; X0, X1 |
N |
DD |
?, ? |
; N0, N1 |
DATA |
ENDS |
|
|
CODE |
SEGMENT BYTE PUBLIC 'CODE' |
;register use:
;XMM0 = xp
;XMM1 = power
; XMM2 = i (i0 and i1 each stored twice as DWORD integers)
;XMM3 = 1.0 if not(i & 1)
;XMM4 = xp if (i & 1)
MOVQ |
XMM2, [N] |
; load |
N0, N1 |
|
|
PUNPCKLDQ |
XMM2, XMM2 |
; copy |
to |
get |
N0, N0, N1, N1 |
MOVAPD |
XMM0, [X] |
; load |
X0, X1 |
|
|
MOVAPD |
XMM1, [ONE] |
; power init. = 1.0 |
|||
MOV |
EAX, [N] |
; N0 |
|
|
|
OR |
EAX, [N+4] |
; N0 OR N1 to |
get highest significant bit |
||
XOR |
ECX, ECX |
; 0 if |
N0 |
and |
N1 both zero |
BSR |
ECX, EAX |
; compute |
repeat count for max(N0,N1) |
||
A: MOVDQA |
XMM3, XMM2 |
; copy |
i |
|
|
PSLLD |
XMM3, 31 |
; get least significant bit of i |
|||
PSRAD |
XMM3, 31 |
; copy |
to |
all |
bit positions |
PSRLD |
XMM2, 1 |
; i >>= 1 |
|
|
|
MOVAPD |
XMM4, XMM0 |
; copy |
of |
xp |
|
ANDPD |
XMM4, XMM3 |
; xp if bit = |
1 |
||
ANDNPD |
XMM3, [ONE] |
; 1.0 if bit = 0 |
|||
ORPD |
XMM3, XMM4 |
; (i & |
1) |
? xp : 1.0 |
|
MULPD |
XMM1, XMM3 |
; power *= (i |
& 1) ? xp : 1.0 |
||
MULPD |
XMM0, XMM0 |
; xp *= xp |
|
|
|
SUB |
ECX, 1 |
|
|
|
|
JNS |
A |
; repeat ECX+1 times |
Conditional moves in SIMD registers are implemented by generating a mask of all 1's representing the condition; AND'ing the first operand with the mask; AND'ing the second operand with the inverted mask; and OR'ing the two together. The mask can be generated by a compare instruction or, as here, by shifting a bit into the most significant position and then shifting it arithmetically to copy it into all positions. The arithmetic shift does not exist in a 64-bit version, so we have to use the 32-bit version with two identical copies of the condition operand.
The repeat count of the loop is calculated separately outside the loop in order to reduce the number of instructions inside the loop.
Timing analysis for example 17.1 in P4: There are four continued dependence chains: XMM0: 6 clocks, XMM1: 6 clocks, XMM2: 2 clocks, ECX: ½ clock. Throughput for the different execution units: MMX-SHIFT: 3 uops, 6 clocks. MMX-ALU: 3 uops, 6 clocks. FP-MUL: 2 uops, 4 clocks. Throughput for port 1: 8 uops, 8 clocks. Thus, the loop appears to be limited by port 1 throughput. The best timing we can hope for is 8 clocks per iteration which is the number of uops that must go to port 1. However, three of the continued dependence chains are interconnected by two broken, but quite long, dependence chains involving XMM3 and XMM4, which take 22 and 18 clocks, respectively. This tends to hinder the optimal reordering of uops. The measured time is approximately 10 uops per iteration. This timing actually requires a quite impressive reordering capability, considering that several iterations must be overlapped and several dependence chains interwoven in order to satisfy the restrictions on all ports and execution units.
In situations like this where it is difficult to obtain the optimal reordering of uops, it may require some experimentation to find the optimal solution. By experimentation I found that the code in example 17.1 can be made approximately 16 clocks faster in total by modifying it to the following:
; Example |
17.2 (P4) |
|
|
|
MOVQ |
XMM2, [N] |
; load |
N0, N1 |
|
PUNPCKLDQ |
XMM2, XMM2 |
; copy |
to get |
N0, N0, N1, N1 |
MOVAPD |
XMM1, [ONE] |
; power init. = 1.0 |
||
MOVAPD |
XMM0, [X] |
; load |
X0, X1 |
|
MOV |
EAX, [N] |
; N0 |
|
|
OR |
EAX, [N+4] |
; N0 OR N1 to |
get highest significant bit |
|
XOR |
ECX, ECX |
; 0 if |
N0 and |
N1 both zero |
BSR |
ECX, EAX |
; compute repeat count for max(N0,N1) |
||
A: MOVDQA |
XMM3, XMM2 |
; copy |
i |
|
MOVDQA |
XMM4, [ONE] |
; temporary 1.0 |
||
PSLLD |
XMM3, 31 |
; get least significant bit of i |
||
PSRAD |
XMM3, 31 |
; copy |
to all |
bit positions |
PSRLD |
XMM2, 1 |
; i >>= 1 |
|
|
PXOR |
XMM4, XMM0 |
; get bits that differ between xp and 1.0 |
||
PANDN |
XMM3, XMM4 |
; mask |
out if |
(i & 1) |
XORPD |
XMM3, XMM0 |
; (i & |
1) ? xp : 1.0 |
|
MULPD |
XMM0, XMM0 |
; xp *= xp |
|
|
MULPD |
XMM1, XMM3 |
; power *= (i |
& 1) ? xp : 1.0 |
|
SUB |
ECX, 1 |
|
|
|
JNS |
A |
; repeat ECX+1 times |
Here I have used PXOR and PANDN, rather than XORPD and ANDNPD, where the result of the operation may not be a valid floating-point number.
These examples shows that conditional moves in SIMD registers are quite complicated and involve considerable dependence chains. However, conditional moves are unavoidable in SIMD programming if the conditions are not the same for all operands in a pack. In the above example, the conditional moves can only be replaced by branches if n0 and n1 have the same value.
Conditional moves in 32-bit registers are no faster than in SIMD registers. The reason why conditional moves are so complex and inefficient is that they have three dependencies, while the hardware design does not allow any uop to have more than two dependencies.
Packing operands in 32-bit registers
Sometimes it is possible to handle packed data in 32-bit registers. You may use this method to take advantage of the fact that 32-bit operations are fast, or to make the code compatible with old microprocessors.
A 32-bit register can hold two 16-bit integers, four 8-bit integers, or 32 Booleans. When doing calculations on packed integers in 32-bit registers, you have to take special care to avoid carries from one operand going into the next operand if overflow is possible. The following example adds 2 to all four bytes in EAX:
; Example 17.3 |
|
|
MOV |
EAX, [ESI] |
; read 4-bytes operand |
MOV |
EBX, EAX |
; copy into EBX |
AND |
EAX, 7F7F7F7FH |
; get lower 7 bits of each byte in EAX |
XOR |
EBX, EAX |
; get the highest bit of each byte |
ADD |
EAX, 02020202H |
; add desired value to all four bytes |
XOR |
EAX, EBX |
; combine bits again |
MOV |
[EDI], EAX |
; store result |