Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf
Скачиваний:
12
Добавлен:
23.08.2013
Размер:
814.91 Кб
Скачать

PSRAD XMM0,31 ; copy sign bit into all bit positions

Manipulating the exponent

You can multiply a non-zero number with a power of 2 by simply adding to the exponent:

MOVAPS

XMM0, [X]

; four single-precision floats

MOVDQA

XMM1, [N]

; four 32-bit integers

PSLLD

XMM1,

23

;

shift integers into exponent field

PADDD

XMM0,

XMM1

;

X * 2^N

Likewise, you can divide by a power of 2 by subtracting from the exponent. Note that this code does not work if X is zero or if overflow or underflow is possible.

Manipulating the mantissa

You can convert an integer to a floating-point number in an interval of length 1.0 by putting bits into the mantissa field. The following code computes x = n / 232, where n in an unsigned integer in the interval 0 ≤ n < 232, and the resulting x is in the interval 0 ≤ x < 1.0.

DATA SEGMENT

PARA PUBLIC 'DATA'

ONE

DQ 1.0

 

X

DQ

?

 

N

DD

?

 

DATA ENDS

 

 

CODE SEGMENT

BYTE PUBLIC 'CODE'

MOVSD

XMM0,

[ONE]

; 1.0, double precision

MOVD

XMM1,

[N]

; N, 32-bit unsigned integer

PSLLQ

XMM1,

20

; align N left in mantissa field

POR

XMM1,

XMM0

; combine mantissa and exponent

SUBSD

XMM1,

XMM0

; subtract 1.0

MOVSD

[X], XMM1

; store result

In the above code, the exponent from 1.0 is combined with a mantissa containing the bits of n. This gives a double-precision float in the interval 1.0 ≤ x < 2.0. The SUBSD instruction subtracts 1.0 to get x into the desired interval.

Comparing numbers

Thanks to the fact that the exponent is stored in the biased format and to the left of the mantissa, it is possible to use integer instructions for comparing positive floating-point numbers. Example (single precision):

FLD [a] / FCOMP [b] / FNSTSW AX / AND AH,1 / JNZ ASmallerThanB

can be replaced by:

MOV EAX,[a] / MOV EBX,[b] / CMP EAX,EBX / JB ASmallerThanB

This method works only if you are certain that none of the numbers have the sign bit set. You may compare absolute values by shifting out the sign bit of both numbers. For doubleprecision numbers, you can make an approximate comparison by comparing the upper 32 bits using integer instructions.

19.5 Using floating-point instructions for integer operations

While there are no problems using integer instructions for moving floating-point data, it is not always safe to use floating-point instructions for moving integer data. For example, you may be tempted to use FLD QWORD PTR [ESI] / FSTP QWORD PTR [EDI] to move 8 bytes at a time. However, this method may fail if the data do not represent valid floatingpoint numbers. The FLD instruction may generate an exception and it may even change

the value of the data. If you want your code to be compatible with processors that don't have

MMX and XMM registers then you can only use the slower FILD and FISTP for moving 8 bytes at a time.

However, some floating-point instructions can handle integer data without generating exceptions or modifying data. For example, the MOVAPS instruction can be used for moving 16 bytes at a time on the P3 processor that doesn't have theMOVDQA instruction. You can determine whether a floating-point instruction can handle integer data by looking at the documentation in the "IA-32 Intel Architecture Software Developer's Manual" Volume 2. If the instruction can generate any floating-point exception, then it cannot be used for integer data. If the documentation says "none" for floating-point exceptions, then this instruction can be used for integer data. It is reasonable to assume that such code will work correctly on future processors, but there is no guarantee that it will work equally fast on future processors.

Most SIMD instructions are "typed", in the sense that they are intended for one type of data only. It seems quite odd, for example, that the P4 has three different instructions for OR'ing 128-bit registers. The instructions POR, ORPS and ORPD are doing exactly the same thing. Replacing one with another has absolutely no consequence on the P4 processor. However, in some cases there is a performance penalty for using the wrong type on the P3. It is unknown whether future processors and processors from other vendors also have a penalty for using the wrong type. The reason for this performance penalty is, I guess, that the processor may do certain kinds of optimizations on a chain of dependent floating-point instructions, which is only possible when the instructions are dedicated to floating-point data only. It is therefore recommended to always use the right type of instruction if such an instruction is available.

There are certain floating-point instructions that have no integer equivalent and which may be useful for handling integer data. This includes the MOVAPS and MOVNTPS instructions which are useful for moving 16 bytes of data from and to memory on the P3. The P4 has integer versions of the same instructions, MOVDQA and MOVNTDQ.

The MOVSS instruction may be useful for moving 32 bits of data from one XMM register to another, while leaving the rest of the destination register unchanged (not if source is a memory operand).

Most other movements of integer data within and between registers can be done with the various shuffle, pack, unpack and shift instructions.

Converting binary to decimal numbers

The FBSTP instruction provides a simple and convenient way of converting a binary number to decimal, although not necessarily the fastest method.

19.6 Moving blocks of data (All processors)

There are several ways to move large blocks of data. The most common method is REP MOVSD. See page 114 about the speed of this instruction.

In many cases it is faster to use instructions that move more than 4 bytes at a time. Make sure that both source and destination are aligned by 8 if you are moving 8 bytes at a time, and aligned by 16 if you are moving 16 bytes at a time. If the size of the block you want to move is not a multiple of 8, respectively 16, then it is better to pad the buffers with extra space in the end and move a little more data than needed, than to move the extra data using other methods.

On P1 and PMMX it is advantageous to use FILD and FISTP with 8 byte operands if the destination is not in the cache. You may roll out the loop by two (FILD / FILD / FXCH /

FISTP / FISTP).