Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf
Скачиваний:
12
Добавлен:
23.08.2013
Размер:
814.91 Кб
Скачать

; (Only for

processors with SSE2)

 

INTPOWER

MACRO X, Y, N

 

 

LOCAL

I,

YUSED

; define

local identifiers

I = N

 

 

; I used

for shifting N

YUSED

= 0

 

; remember if Y contains valid data

REPT 32

 

; maximum repeat count is 32

IF

I AND 1

; test bit 0

 

IF

YUSED

; If Y already contains data

 

 

MULPD Y, X

; multiply Y with a power of X

 

ELSE

; If this is first time Y is used:

 

 

MOVAPD Y, X

; copy data to Y

 

 

YUSED = 1

; remember that Y now contains data

 

ENDIF

; end of

IF YUSED

ENDIF

 

; end of

IF I AND 1

I = I

SHR 1

; shift right I one place

IF

I EQ 0

; stop when I = 0

 

EXITM

; exit REPT 32 loop prematurely

ENDIF

 

; end of

IF I EQ 0

MULPD

X, X

; square

X

ENDM

 

 

; end of

REPT 32 loop

ENDM

 

 

; end of

INTPOWER macro definition

This macro generates the minimum number of instructions needed to do the job. There is no loop overhead, prolog or epilog in the final code. And, most importantly, no branches. All branches have been resolved by the macro preprocessor. To calculate XMM0 to the power of 12, you write:

INTPOWER XMM0, XMM1, 12

This will be resolved to:

MULPD

XMM0, XMM0

; x^2

MULPD

XMM0, XMM0

; x^4

MOVAPD

XMM1, XMM0

; save x^4

MULPD

XMM0,

XMM0

;

x^8

MULPD

XMM1,

XMM0

;

x^4 * x^8 = x^12

This even has fewer instructions than the optimized loop (example 16.13). The expanded macro takes 25 clock cycles in this example.

17 Single-Instruction-Multiple-Data programming

Since there are technological limits to the maximum clock frequency of microprocessors, the trend goes towards increasing processor throughput by handling multiple data in parallel.

When optimizing code, it is important to consider if there are data that can be handled in parallel. The principle of Single-Instruction-Multiple-Data (SIMD) programming is that a vector or set of data are packed together in one large register and handled together in one operation.

Multiple data can be packed into 64-bit or 128-bit registers in the following ways:

data type

data per pack

register size

instruction set

microprocessor

8-bit integer

8

64 bit (MMX)

MMX

PMMX and later

16-bit integer

4

64 bit (MMX)

MMX

PMMX and later

32-bit integer

2

64 bit (MMX)

MMX

PMMX and later

64-bit integer

1

64 bit (MMX)

SSE2

P4 and later

32-bit float

2

64 bit (MMX)

3DNow

AMD only

8-bit integer

16

128 bit (XMM)

SSE2

P4 and later

16-bit integer

8

128 bit (XMM)

SSE2

P4 and later

32-bit integer

4

128 bit (XMM)

SSE2

P4 and later

64-bit integer

2

128 bit (XMM)

SSE2

P4 and later

32-bit float

4

128 bit (XMM)

SSE

P3 and later

64-bit float

2

128 bit (XMM)

SSE2

P4 and later

All these packing modes are available on the latest microprocessors from Intel and AMD, except for the 3DNow mode, which is available only on AMD processors. Whether the different instruction sets are supported on a particular microprocessor can be determined with the CPUID instruction, as explained on page 26. The 64-bit MMX registers cannot be used together with the floating-point registers. The 128-bit XMM registers can only be used if supported by the operating system. See page 27 for how to check if the use of XMM registers is enabled.

You may make two or more versions of the critical part of your code: one that will run on old microprocessors, and one that uses the most advantageous packing mode and instruction set. The program should automatically select the version of the code that is appropriate for the system on which it is running.

Choose the smallest data size that fits your purpose in order to pack as many data as possible into one register. Mathematical computations may require double precision (64-bit) floats in order to avoid loss of precision in the intermediate calculations, even if single precision is sufficient for the final result.

Before you choose to use SIMD instructions for integer operations, you have to consider whether the resulting code will be faster than the simple integer instructions in 32-bit registers. Simple operations such as integer additions take four times as long in SIMD registers as in 32-bit registers on the P4. The SIMD instructions are therefore only advantageous for integer additions if they can handle at least four data in parallel. Loading and storing memory operands take no longer for 64-bit and 128-bit registers than for 32-bit registers. Integer shift and multiplication is faster in 64-bit and 128-bit registers than in 32-bit registers on the P4. With SIMD code, you may spend more instructions on trivial things such as moving data into the right positions in the registers and emulating conditional moves, than on the actual calculations. Example 17.1 below is an example of this.

For floating-point calculations on the P4, it is often advantageous to use XMM registers, even if there are no opportunities for handling data in parallel. The latency of floating-point operations is shorter in XMM registers than in floating-point registers, and you can make conversions between integers and floating-point numbers without using a memory intermediate. Furthermore, you get rid of the annoying floating-point register stack.

Memory operands for SIMD instructions have to be properly aligned. See page 28 for how to align data in memory. The alignment requirement makes it complicated to pass 64-bit and 128-bit function parameters on the stack (see Intel's optimization reference manual). It is therefore recommended to pass 64-bit and 128-bit parameters in registers or through a pointer, rather than on the stack.

Conditional moves in SIMD registers

Let's repeat example 16.13 page 107 with two double-precision floats in XMM registers. This enables us to compute x0n0 and x1n1 in parallel:

; Example 17.1 (P4)

 

DATA

SEGMENT PARA PUBLIC 'DATA'

ONE

DQ

1.0, 1.0

 

X

DQ

?, ?

; X0, X1

N

DD

?, ?

; N0, N1

DATA

ENDS

 

 

CODE

SEGMENT BYTE PUBLIC 'CODE'

;register use:

;XMM0 = xp

;XMM1 = power

; XMM2 = i (i0 and i1 each stored twice as DWORD integers)

;XMM3 = 1.0 if not(i & 1)

;XMM4 = xp if (i & 1)

MOVQ

XMM2, [N]

; load

N0, N1

 

PUNPCKLDQ

XMM2, XMM2

; copy

to

get

N0, N0, N1, N1

MOVAPD

XMM0, [X]

; load

X0, X1

 

MOVAPD

XMM1, [ONE]

; power init. = 1.0

MOV

EAX, [N]

; N0

 

 

 

OR

EAX, [N+4]

; N0 OR N1 to

get highest significant bit

XOR

ECX, ECX

; 0 if

N0

and

N1 both zero

BSR

ECX, EAX

; compute

repeat count for max(N0,N1)

A: MOVDQA

XMM3, XMM2

; copy

i

 

 

PSLLD

XMM3, 31

; get least significant bit of i

PSRAD

XMM3, 31

; copy

to

all

bit positions

PSRLD

XMM2, 1

; i >>= 1

 

 

MOVAPD

XMM4, XMM0

; copy

of

xp

 

ANDPD

XMM4, XMM3

; xp if bit =

1

ANDNPD

XMM3, [ONE]

; 1.0 if bit = 0

ORPD

XMM3, XMM4

; (i &

1)

? xp : 1.0

MULPD

XMM1, XMM3

; power *= (i

& 1) ? xp : 1.0

MULPD

XMM0, XMM0

; xp *= xp

 

 

SUB

ECX, 1

 

 

 

 

JNS

A

; repeat ECX+1 times

Conditional moves in SIMD registers are implemented by generating a mask of all 1's representing the condition; AND'ing the first operand with the mask; AND'ing the second operand with the inverted mask; and OR'ing the two together. The mask can be generated by a compare instruction or, as here, by shifting a bit into the most significant position and then shifting it arithmetically to copy it into all positions. The arithmetic shift does not exist in a 64-bit version, so we have to use the 32-bit version with two identical copies of the condition operand.

The repeat count of the loop is calculated separately outside the loop in order to reduce the number of instructions inside the loop.

Timing analysis for example 17.1 in P4: There are four continued dependence chains: XMM0: 6 clocks, XMM1: 6 clocks, XMM2: 2 clocks, ECX: ½ clock. Throughput for the different execution units: MMX-SHIFT: 3 uops, 6 clocks. MMX-ALU: 3 uops, 6 clocks. FP-MUL: 2 uops, 4 clocks. Throughput for port 1: 8 uops, 8 clocks. Thus, the loop appears to be limited by port 1 throughput. The best timing we can hope for is 8 clocks per iteration which is the number of uops that must go to port 1. However, three of the continued dependence chains are interconnected by two broken, but quite long, dependence chains involving XMM3 and XMM4, which take 22 and 18 clocks, respectively. This tends to hinder the optimal reordering of uops. The measured time is approximately 10 uops per iteration. This timing actually requires a quite impressive reordering capability, considering that several iterations must be overlapped and several dependence chains interwoven in order to satisfy the restrictions on all ports and execution units.

In situations like this where it is difficult to obtain the optimal reordering of uops, it may require some experimentation to find the optimal solution. By experimentation I found that the code in example 17.1 can be made approximately 16 clocks faster in total by modifying it to the following:

; Example

17.2 (P4)

 

 

 

MOVQ

XMM2, [N]

; load

N0, N1

 

PUNPCKLDQ

XMM2, XMM2

; copy

to get

N0, N0, N1, N1

MOVAPD

XMM1, [ONE]

; power init. = 1.0

MOVAPD

XMM0, [X]

; load

X0, X1

 

MOV

EAX, [N]

; N0

 

 

OR

EAX, [N+4]

; N0 OR N1 to

get highest significant bit

XOR

ECX, ECX

; 0 if

N0 and

N1 both zero

BSR

ECX, EAX

; compute repeat count for max(N0,N1)

A: MOVDQA

XMM3, XMM2

; copy

i

 

MOVDQA

XMM4, [ONE]

; temporary 1.0

PSLLD

XMM3, 31

; get least significant bit of i

PSRAD

XMM3, 31

; copy

to all

bit positions

PSRLD

XMM2, 1

; i >>= 1

 

PXOR

XMM4, XMM0

; get bits that differ between xp and 1.0

PANDN

XMM3, XMM4

; mask

out if

(i & 1)

XORPD

XMM3, XMM0

; (i &

1) ? xp : 1.0

MULPD

XMM0, XMM0

; xp *= xp

 

MULPD

XMM1, XMM3

; power *= (i

& 1) ? xp : 1.0

SUB

ECX, 1

 

 

 

JNS

A

; repeat ECX+1 times

Here I have used PXOR and PANDN, rather than XORPD and ANDNPD, where the result of the operation may not be a valid floating-point number.

These examples shows that conditional moves in SIMD registers are quite complicated and involve considerable dependence chains. However, conditional moves are unavoidable in SIMD programming if the conditions are not the same for all operands in a pack. In the above example, the conditional moves can only be replaced by branches if n0 and n1 have the same value.

Conditional moves in 32-bit registers are no faster than in SIMD registers. The reason why conditional moves are so complex and inefficient is that they have three dependencies, while the hardware design does not allow any uop to have more than two dependencies.

Packing operands in 32-bit registers

Sometimes it is possible to handle packed data in 32-bit registers. You may use this method to take advantage of the fact that 32-bit operations are fast, or to make the code compatible with old microprocessors.

A 32-bit register can hold two 16-bit integers, four 8-bit integers, or 32 Booleans. When doing calculations on packed integers in 32-bit registers, you have to take special care to avoid carries from one operand going into the next operand if overflow is possible. The following example adds 2 to all four bytes in EAX:

; Example 17.3

 

MOV

EAX, [ESI]

; read 4-bytes operand

MOV

EBX, EAX

; copy into EBX

AND

EAX, 7F7F7F7FH

; get lower 7 bits of each byte in EAX

XOR

EBX, EAX

; get the highest bit of each byte

ADD

EAX, 02020202H

; add desired value to all four bytes

XOR

EAX, EBX

; combine bits again

MOV

[EDI], EAX

; store result