- •Introduction
- •Assembly language syntax
- •Microprocessor versions covered by this manual
- •Getting started with optimization
- •Speed versus program clarity and security
- •Choice of programming language
- •Choice of algorithm
- •Memory model
- •Finding the hot spots
- •Literature
- •Optimizing in C++
- •Use optimization options
- •Identify the most critical parts of your code
- •Break dependence chains
- •Use local variables
- •Use array of structures rather than structure of arrays
- •Alignment of data
- •Division
- •Function calls
- •Conversion from floating-point numbers to integers
- •Character arrays versus string objects
- •Combining assembly and high level language
- •Inline assembly
- •Calling conventions
- •Data storage in C++
- •Register usage in 16 bit mode DOS or Windows
- •Register usage in 32 bit Windows
- •Register usage in Linux
- •Making compiler-independent code
- •Adding support for multiple compilers in .asm modules
- •Further compiler incompatibilities
- •Object file formats
- •Using MASM under Linux
- •Object oriented programming
- •Other high level languages
- •Debugging and verifying assembly code
- •Reducing code size
- •Detecting processor type
- •Checking for operating system support for XMM registers
- •Alignment
- •Cache
- •First time versus repeated execution
- •Out-of-order execution (PPro, P2, P3, P4)
- •Instructions are split into uops
- •Register renaming
- •Dependence chains
- •Branch prediction (all processors)
- •Prediction methods for conditional jumps
- •Branch prediction in P1
- •Branch prediction in PMMX, PPro, P2, and P3
- •Branch prediction in P4
- •Indirect jumps (all processors)
- •Returns (all processors except P1)
- •Static prediction
- •Close jumps
- •Avoiding jumps (all processors)
- •Optimizing for P1 and PMMX
- •Pairing integer instructions
- •Address generation interlock
- •Splitting complex instructions into simpler ones
- •Prefixes
- •Scheduling floating-point code
- •Optimizing for PPro, P2, and P3
- •The pipeline in PPro, P2 and P3
- •Register renaming
- •Register read stalls
- •Out of order execution
- •Retirement
- •Partial register stalls
- •Partial memory stalls
- •Bottlenecks in PPro, P2, P3
- •Optimizing for P4
- •Trace cache
- •Instruction decoding
- •Execution units
- •Do the floating-point and MMX units run at half speed?
- •Transfer of data between execution units
- •Retirement
- •Partial registers and partial flags
- •Partial memory access
- •Memory intermediates in dependencies
- •Breaking dependencies
- •Choosing the optimal instructions
- •Bottlenecks in P4
- •Loop optimization (all processors)
- •Loops in P1 and PMMX
- •Loops in PPro, P2, and P3
- •Loops in P4
- •Macro loops (all processors)
- •Single-Instruction-Multiple-Data programming
- •Problematic Instructions
- •XCHG (all processors)
- •Shifts and rotates (P4)
- •Rotates through carry (all processors)
- •String instructions (all processors)
- •Bit test (all processors)
- •Integer multiplication (all processors)
- •Division (all processors)
- •LEA instruction (all processors)
- •WAIT instruction (all processors)
- •FCOM + FSTSW AX (all processors)
- •FPREM (all processors)
- •FRNDINT (all processors)
- •FSCALE and exponential function (all processors)
- •FPTAN (all processors)
- •FSQRT (P3 and P4)
- •FLDCW (PPro, P2, P3, P4)
- •Bit scan (P1 and PMMX)
- •Special topics
- •Freeing floating-point registers (all processors)
- •Transitions between floating-point and MMX instructions (PMMX, P2, P3, P4)
- •Converting from floating-point to integer (All processors)
- •Using integer instructions for floating-point operations
- •Using floating-point instructions for integer operations
- •Moving blocks of data (All processors)
- •Self-modifying code (All processors)
- •Testing speed
- •List of instruction timings for P1 and PMMX
- •Integer instructions (P1 and PMMX)
- •Floating-point instructions (P1 and PMMX)
- •MMX instructions (PMMX)
- •List of instruction timings and uop breakdown for PPro, P2 and P3
- •Integer instructions (PPro, P2 and P3)
- •Floating-point instructions (PPro, P2 and P3)
- •MMX instructions (P2 and P3)
- •List of instruction timings and uop breakdown for P4
- •integer instructions
- •Floating-point instructions
- •SIMD integer instructions
- •SIMD floating-point instructions
- •Comparison of the different microprocessors
function that works under both Windows and Linux, if only you take these minor differences into account.
4.7 Making compiler-independent code
Functions
By default, C++ compilers use a method called name mangling for distinguishing between different versions of overloaded functions. A code that defines the number and type of function parameters, and possibly the calling convention and return type, is appended to the function name. These name mangling codes are compiler-specific. It is therefore recommended to turn off name mangling when linking C++ and assembly code together. You can avoid the mangling of a function name by adding extern "C" to the function prototype in the C++ file. extern "C" indicates that the function should be linked according to the conventions of the C language, rather than C++. Therefore, extern "C" cannot be used for constructs that don't exist in the C language, such as member functions. You may also add __cdecl to the function prototype to make the calling convention
explicit.
The assembly code must have the function name prefixed by an underscore (_) if called under Windows or DOS. The underscore is not required in newer versions of other operating systems such as Linux. Your C++ compiler may have an option for adding or removing the underscore, but you have to recompile all function libraries if you change this option (and risk name clashes).
Thus, the best way of linking C++ and assembly code together is to add extern "C" to the function prototype in C++, and prefix the function name with an underscore in the assembly file. You must assemble with case sensitivity on externals. Example:
;Example 4.2
;extern "C" int square (int x);
_square |
PROC NEAR |
; integer square function |
|
PUBLIC |
_square |
|
|
|
MOV |
EAX, [ESP+4] |
; read x from stack |
|
IMUL |
EAX |
; x * x |
|
RET |
|
; return value in EAX |
_square |
ENDP |
|
|
Global objects
Some C++ compilers also mangle the names of global objects. Add extern "C" to the declaration of global variables and objects if you want them to be accessible from assembly modules. Even better, avoid global objects if possible.
Member functions
Let's take as an example a simple C++ class containing a list of integers:
// |
Example 4.3 |
|
|
|
// |
define C++ class |
|
|
|
class |
MyList { |
|
|
|
|
int |
length; |
// |
number of items in list |
|
int |
buffer[100]; |
// |
store items |
public: |
|
|
MyList(); |
// constructor |
|
void AddItem(int item); |
// add item to |
list |
int Sum();}; |
// compute sum |
of items |
MyList::MyList() { |
// constructor |
in C++ |
length = 0;} |
|
|
void MyList::AddItem(int item) { // member function AddItem in C++ if (length < 100) buffer[length++] = item;}
int MyList::Sum() { // member function Sum in C++ int i, s = 0;
for (i=0; i<length; i++) s += buffer[i]; return s;}
The implementation of this class is compiler-specific because different C++ compilers use different calling conventions and name mangling methods for member functions. If you want to place one or more of the member functions in an assembly module and you don't want to mess with compiler-specific intricacies, then you may replace the member function by a friend function. The 'this' pointer is not transferred automatically to afriend function so you have to make it an explicit parameter to the function. The C++ class definition is then changed as follows:
//Example 4.3 with friend function
//predefine class name:
class MyList;
// define external friend function:
extern "C" void MyListAddItem(MyList * p_this, int item);
// changed definition of class |
MyList: |
||
class |
MyList { |
|
|
int |
length; |
// |
number of items in list |
int |
buffer[100]; |
// |
store items |
public: |
|
|
|
MyList(); |
// |
constructor |
|
// make external function |
a friend: |
||
friend void MyListAddItem(MyList *, int); |
|||
void AddItem(int item) { |
// |
wrap MyListAddItem in AddItem |
|
MyListAddItem(this,item);} |
//tranfer 'this' explicitly to function |
||
int |
Sum();}; |
// |
compute sum of items |
The friend function MyListAddItem can be coded in assembly without name mangling:
; define data members of |
class MyList: |
|
||||
MyList |
STRUC |
|
|
|
|
|
LENGTH_ |
DD |
? |
|
|
|
|
BUFFER |
DD |
100 DUP |
(?) |
|
|
|
MyList |
ENDS |
|
|
|
|
|
; define friend |
function |
MyListAddItem |
|
|||
_MyListAddItem PROC NEAR |
|
|
||||
PUBLIC _MyListAddItem |
|
|
|
|||
|
MOV |
|
ECX, |
[ESP+4] |
; p_this |
|
|
MOV |
|
EAX, |
[ESP+8] |
; item |
|
|
MOV |
|
EDX, |
[ECX].MyList.LENGTH_ |
; length |
|
|
CMP |
|
EDX, |
100 |
|
|
|
JNB |
|
ADDITEM9 |
|
|
|
|
MOV |
|
[ECX+4*EDX].MyList.BUFFER, EAX |
|||
|
ADD |
|
EDX, |
1 |
|
|
|
MOV |
|
[ECX].MyList.LENGTH_, EDX |
|
||
ADDITEM9: RET |
|
|
|
|
|
_MyListAddItem ENDP
Wrapping MyListAddItem in AddItem does not slow down execution because it is inlined when called inside the class definition. An optimizing compiler will simply call MyListAddItem instead of AddItem. The extern "C" linking makes this solution compatible with all compilers.
Member function pointers
The implementation of member function pointers is compiler-specific. If your assembly code needs a pointer to member functions then replace the member functions by friend functions as explained above, and replace the member function pointer by a friend function pointer.
Data member pointers
The implementation of data member pointers is compiler-specific. Some compilers add 1 to the offset in order to distinguish a pointer to the first data member from a null pointer. Furthermore, some compilers use 64 bits for the member pointer in order to handle more complicated constructs.
If your assembly code needs a pointer to structure or class data members (corresponding to the C++ operators .* or ->*), then try if it can be replaced by a simple pointer or an array index. If this is not possible then replace the data member pointer by an integer which contains the offset of the member relative to the class object. You need to typecast addresses to integers in order to use such a member pointer in C++ code.
Virtual functions and polymorphous classes
Each object of a polymorphous class contains a pointer to a table of pointers to the virtual functions. Borland and Microsoft compilers place the pointer to this so-called virtual table at the beginning of the object, while the Gnu compiler places it at the end of the object. This makes the code compiler-specific, even if only non-virtual member functions are coded in assembly. If you want to make the code compiler-independent then you have to replace all virtual member functions by friend functions. Insert one or more friend function pointers as members of the class and initialize these pointers in the constructors to emulate the virtual tables.
Long double
A long double in C++ corresponds to a TBYTE in assembly, using 10 bytes. The Microsoft C++ compiler does not support this type, but replaces it with a double, using 8 bytes. The Borland compiler allocates 10 bytes for a long double, while the Gnu compiler allocates 12 (or 16) bytes for the sake of alignment. If you want a structure or class containing long double members to be compiler-independent (and properly aligned) then use a union:
union ld {
long double a; char filler[16];};
The Microsoft compiler can still not access the long double without using assembly language.
Thread local objects
Thread-local data or objects should be defined or allocated in C++, but they can be accessed in assembly code through pointers.
Assembly language functions can be made reentrant (thread safe) without the need for thread-local storage when all variables are stored on the stack.
4.8 Adding support for multiple compilers in .asm modules
An alternative to making compiler-independent code is to use the proper name mangling, calling conventions, etc. for the C++ compiler in question. To do this, you have to write the code in C++ first and make the compiler translate the C++ to assembly. Use the assembly code produced by the C++ compiler to get the right mangled names, calling conventions, data formats, etc.
Functions
In many cases, it is possible to make assembly libraries containing functions that are compatible with more than one C++ compiler by adding the proper mangled names for each compiler.
For example, overloaded functions cannot be made without name mangling, but by adding several mangled names, the function can be made compatible with several different compilers. The following example shows a square function with two overloaded versions:
; Example 4.4: Overloaded function |
|
|
; int square (int x); |
// C++ |
prototype |
SQUARE_I PROC NEAR |
; integer square function |
|
@square$qi LABEL NEAR |
; link |
name for Borland compiler |
?square@@YAHH@Z LABEL NEAR |
; link |
name for Microsoft compiler |
_square__Fi LABEL NEAR |
; link |
name for Gnu compiler (Windows) |
square__Fi LABEL NEAR |
; link |
for Gnu (Redhat, Debian, BSD) |
_Z6squarei LABEL NEAR |
; link |
for Gnu (Mandrake, UNIX) |
PUBLIC @square$qi,?square@@YAHH@Z,_square__Fi,square__Fi,_Z6squarei
MOV |
EAX, [ESP+4] |
; x |
|
IMUL |
EAX |
|
|
RET |
|
|
|
SQUARE_I ENDP |
|
|
|
; double square |
(double x); |
// C++ |
prototype |
SQUARE_D PROC NEAR |
; double precision float square funct. |
||
@square$qd LABEL NEAR |
; link |
name for Borland compiler |
|
?square@@YANN@Z |
LABEL NEAR |
; link |
name for Microsoft compiler |
_square__Fd LABEL NEAR |
; link |
name for Gnu compiler (Windows) |
|
square__Fd LABEL NEAR |
; link |
for Gnu (Redhat, Debian, BSD) |
|
_Z6squared LABEL NEAR |
; link |
for Gnu (Mandrake, UNIX) |
PUBLIC @square$qd,?square@@YANN@Z,_square__Fd,square__Fd,_Z6squared
FLD |
QWORD PTR [ESP+4] ; x |
FMUL |
ST(0), ST(0) |
RET |
|
SQUARE_D ENDP |
|
Member functions
The above method works because all the compilers use the __cdecl calling convention by default for overloaded functions. For member functions (methods) however, the compilers differ. Borland and most Gnu compilers use the __stdcall convention; Gnu for Mandrake uses __cdecl; and Microsoft compilers uses a hybrid of __stdcall and __fastcall, with the 'this' pointer inECX. To obtain binary compatibility, we want to force the compilers to use the same calling convention. Most compilers allow you to explicitly specify the calling convention to member functions, but a few compilers (e.g. Gnu for Mandrake) do not allow this specification. However, you can force all compilers to use the __cdecl method for member functions by specifying a variable number of parameters. Returning to example 4.3 page 17, we can provide support for almost all compilers in this way:
//Example 4.3 with support for multiple compilers
//define C++ class
class |
MyList { |
|
|
int |
length; |
// |
number of items in list |
int |
buffer[100]; |
// |
store items |
public: |
|
|
|
MyList(); |
// constructor |
|
|
void AddItem(int item, ...); |
// |
add item to |
list |
int Sum(...);}; |
// |
compute sum |
of items |
;Assembly code for MyList::AddItem with mangled names for several
;compilers: