- •Introduction
- •Assembly language syntax
- •Microprocessor versions covered by this manual
- •Getting started with optimization
- •Speed versus program clarity and security
- •Choice of programming language
- •Choice of algorithm
- •Memory model
- •Finding the hot spots
- •Literature
- •Optimizing in C++
- •Use optimization options
- •Identify the most critical parts of your code
- •Break dependence chains
- •Use local variables
- •Use array of structures rather than structure of arrays
- •Alignment of data
- •Division
- •Function calls
- •Conversion from floating-point numbers to integers
- •Character arrays versus string objects
- •Combining assembly and high level language
- •Inline assembly
- •Calling conventions
- •Data storage in C++
- •Register usage in 16 bit mode DOS or Windows
- •Register usage in 32 bit Windows
- •Register usage in Linux
- •Making compiler-independent code
- •Adding support for multiple compilers in .asm modules
- •Further compiler incompatibilities
- •Object file formats
- •Using MASM under Linux
- •Object oriented programming
- •Other high level languages
- •Debugging and verifying assembly code
- •Reducing code size
- •Detecting processor type
- •Checking for operating system support for XMM registers
- •Alignment
- •Cache
- •First time versus repeated execution
- •Out-of-order execution (PPro, P2, P3, P4)
- •Instructions are split into uops
- •Register renaming
- •Dependence chains
- •Branch prediction (all processors)
- •Prediction methods for conditional jumps
- •Branch prediction in P1
- •Branch prediction in PMMX, PPro, P2, and P3
- •Branch prediction in P4
- •Indirect jumps (all processors)
- •Returns (all processors except P1)
- •Static prediction
- •Close jumps
- •Avoiding jumps (all processors)
- •Optimizing for P1 and PMMX
- •Pairing integer instructions
- •Address generation interlock
- •Splitting complex instructions into simpler ones
- •Prefixes
- •Scheduling floating-point code
- •Optimizing for PPro, P2, and P3
- •The pipeline in PPro, P2 and P3
- •Register renaming
- •Register read stalls
- •Out of order execution
- •Retirement
- •Partial register stalls
- •Partial memory stalls
- •Bottlenecks in PPro, P2, P3
- •Optimizing for P4
- •Trace cache
- •Instruction decoding
- •Execution units
- •Do the floating-point and MMX units run at half speed?
- •Transfer of data between execution units
- •Retirement
- •Partial registers and partial flags
- •Partial memory access
- •Memory intermediates in dependencies
- •Breaking dependencies
- •Choosing the optimal instructions
- •Bottlenecks in P4
- •Loop optimization (all processors)
- •Loops in P1 and PMMX
- •Loops in PPro, P2, and P3
- •Loops in P4
- •Macro loops (all processors)
- •Single-Instruction-Multiple-Data programming
- •Problematic Instructions
- •XCHG (all processors)
- •Shifts and rotates (P4)
- •Rotates through carry (all processors)
- •String instructions (all processors)
- •Bit test (all processors)
- •Integer multiplication (all processors)
- •Division (all processors)
- •LEA instruction (all processors)
- •WAIT instruction (all processors)
- •FCOM + FSTSW AX (all processors)
- •FPREM (all processors)
- •FRNDINT (all processors)
- •FSCALE and exponential function (all processors)
- •FPTAN (all processors)
- •FSQRT (P3 and P4)
- •FLDCW (PPro, P2, P3, P4)
- •Bit scan (P1 and PMMX)
- •Special topics
- •Freeing floating-point registers (all processors)
- •Transitions between floating-point and MMX instructions (PMMX, P2, P3, P4)
- •Converting from floating-point to integer (All processors)
- •Using integer instructions for floating-point operations
- •Using floating-point instructions for integer operations
- •Moving blocks of data (All processors)
- •Self-modifying code (All processors)
- •Testing speed
- •List of instruction timings for P1 and PMMX
- •Integer instructions (P1 and PMMX)
- •Floating-point instructions (P1 and PMMX)
- •MMX instructions (PMMX)
- •List of instruction timings and uop breakdown for PPro, P2 and P3
- •Integer instructions (PPro, P2 and P3)
- •Floating-point instructions (PPro, P2 and P3)
- •MMX instructions (P2 and P3)
- •List of instruction timings and uop breakdown for P4
- •integer instructions
- •Floating-point instructions
- •SIMD integer instructions
- •SIMD floating-point instructions
- •Comparison of the different microprocessors
PSRAD XMM0,31 ; copy sign bit into all bit positions
Manipulating the exponent
You can multiply a non-zero number with a power of 2 by simply adding to the exponent:
MOVAPS |
XMM0, [X] |
; four single-precision floats |
||
MOVDQA |
XMM1, [N] |
; four 32-bit integers |
||
PSLLD |
XMM1, |
23 |
; |
shift integers into exponent field |
PADDD |
XMM0, |
XMM1 |
; |
X * 2^N |
Likewise, you can divide by a power of 2 by subtracting from the exponent. Note that this code does not work if X is zero or if overflow or underflow is possible.
Manipulating the mantissa
You can convert an integer to a floating-point number in an interval of length 1.0 by putting bits into the mantissa field. The following code computes x = n / 232, where n in an unsigned integer in the interval 0 ≤ n < 232, and the resulting x is in the interval 0 ≤ x < 1.0.
DATA SEGMENT |
PARA PUBLIC 'DATA' |
||
ONE |
DQ 1.0 |
|
|
X |
DQ |
? |
|
N |
DD |
? |
|
DATA ENDS |
|
|
|
CODE SEGMENT |
BYTE PUBLIC 'CODE' |
||
MOVSD |
XMM0, |
[ONE] |
; 1.0, double precision |
MOVD |
XMM1, |
[N] |
; N, 32-bit unsigned integer |
PSLLQ |
XMM1, |
20 |
; align N left in mantissa field |
POR |
XMM1, |
XMM0 |
; combine mantissa and exponent |
SUBSD |
XMM1, |
XMM0 |
; subtract 1.0 |
MOVSD |
[X], XMM1 |
; store result |
In the above code, the exponent from 1.0 is combined with a mantissa containing the bits of n. This gives a double-precision float in the interval 1.0 ≤ x < 2.0. The SUBSD instruction subtracts 1.0 to get x into the desired interval.
Comparing numbers
Thanks to the fact that the exponent is stored in the biased format and to the left of the mantissa, it is possible to use integer instructions for comparing positive floating-point numbers. Example (single precision):
FLD [a] / FCOMP [b] / FNSTSW AX / AND AH,1 / JNZ ASmallerThanB
can be replaced by:
MOV EAX,[a] / MOV EBX,[b] / CMP EAX,EBX / JB ASmallerThanB
This method works only if you are certain that none of the numbers have the sign bit set. You may compare absolute values by shifting out the sign bit of both numbers. For doubleprecision numbers, you can make an approximate comparison by comparing the upper 32 bits using integer instructions.
19.5 Using floating-point instructions for integer operations
While there are no problems using integer instructions for moving floating-point data, it is not always safe to use floating-point instructions for moving integer data. For example, you may be tempted to use FLD QWORD PTR [ESI] / FSTP QWORD PTR [EDI] to move 8 bytes at a time. However, this method may fail if the data do not represent valid floatingpoint numbers. The FLD instruction may generate an exception and it may even change
the value of the data. If you want your code to be compatible with processors that don't have
MMX and XMM registers then you can only use the slower FILD and FISTP for moving 8 bytes at a time.
However, some floating-point instructions can handle integer data without generating exceptions or modifying data. For example, the MOVAPS instruction can be used for moving 16 bytes at a time on the P3 processor that doesn't have theMOVDQA instruction. You can determine whether a floating-point instruction can handle integer data by looking at the documentation in the "IA-32 Intel Architecture Software Developer's Manual" Volume 2. If the instruction can generate any floating-point exception, then it cannot be used for integer data. If the documentation says "none" for floating-point exceptions, then this instruction can be used for integer data. It is reasonable to assume that such code will work correctly on future processors, but there is no guarantee that it will work equally fast on future processors.
Most SIMD instructions are "typed", in the sense that they are intended for one type of data only. It seems quite odd, for example, that the P4 has three different instructions for OR'ing 128-bit registers. The instructions POR, ORPS and ORPD are doing exactly the same thing. Replacing one with another has absolutely no consequence on the P4 processor. However, in some cases there is a performance penalty for using the wrong type on the P3. It is unknown whether future processors and processors from other vendors also have a penalty for using the wrong type. The reason for this performance penalty is, I guess, that the processor may do certain kinds of optimizations on a chain of dependent floating-point instructions, which is only possible when the instructions are dedicated to floating-point data only. It is therefore recommended to always use the right type of instruction if such an instruction is available.
There are certain floating-point instructions that have no integer equivalent and which may be useful for handling integer data. This includes the MOVAPS and MOVNTPS instructions which are useful for moving 16 bytes of data from and to memory on the P3. The P4 has integer versions of the same instructions, MOVDQA and MOVNTDQ.
The MOVSS instruction may be useful for moving 32 bits of data from one XMM register to another, while leaving the rest of the destination register unchanged (not if source is a memory operand).
Most other movements of integer data within and between registers can be done with the various shuffle, pack, unpack and shift instructions.
Converting binary to decimal numbers
The FBSTP instruction provides a simple and convenient way of converting a binary number to decimal, although not necessarily the fastest method.
19.6 Moving blocks of data (All processors)
There are several ways to move large blocks of data. The most common method is REP MOVSD. See page 114 about the speed of this instruction.
In many cases it is faster to use instructions that move more than 4 bytes at a time. Make sure that both source and destination are aligned by 8 if you are moving 8 bytes at a time, and aligned by 16 if you are moving 16 bytes at a time. If the size of the block you want to move is not a multiple of 8, respectively 16, then it is better to pad the buffers with extra space in the end and move a little more data than needed, than to move the extra data using other methods.
On P1 and PMMX it is advantageous to use FILD and FISTP with 8 byte operands if the destination is not in the cache. You may roll out the loop by two (FILD / FILD / FXCH /
FISTP / FISTP).