Добавил:

Andrey Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

Электротехника

Файл:

Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf

Скачиваний:

Добавлен:

23.08.2013

Размер:

814.91 Кб

Скачать

☆

<<< < Предыдущая 18 19 20 21 22 23 24 25 26 27 28 2930 / 4330 31 32 33 34 35 36 37 38 39 40 41 42 > Следующая >>>

; (Only for		processors with SSE2)
INTPOWER	MACRO X, Y, N
LOCAL	I,	YUSED	; define	local identifiers
I = N			; I used	for shifting N
YUSED	= 0		; remember if Y contains valid data
REPT 32			; maximum repeat count is 32
IF	I AND 1		; test bit 0
	IF	YUSED	; If Y already contains data
		MULPD Y, X	; multiply Y with a power of X
	ELSE		; If this is first time Y is used:
		MOVAPD Y, X	; copy data to Y
		YUSED = 1	; remember that Y now contains data
	ENDIF		; end of	IF YUSED
ENDIF			; end of	IF I AND 1
I = I		SHR 1	; shift right I one place
IF	I EQ 0		; stop when I = 0
	EXITM		; exit REPT 32 loop prematurely
ENDIF			; end of	IF I EQ 0
MULPD		X, X	; square	X
ENDM			; end of	REPT 32 loop
ENDM			; end of	INTPOWER macro definition

This macro generates the minimum number of instructions needed to do the job. There is no loop overhead, prolog or epilog in the final code. And, most importantly, no branches. All branches have been resolved by the macro preprocessor. To calculate XMM0 to the power of 12, you write:

INTPOWER XMM0, XMM1, 12

This will be resolved to:

MULPD	XMM0, XMM0		; x^2
MULPD	XMM0, XMM0		; x^4
MOVAPD	XMM1, XMM0		; save x^4
MULPD	XMM0,	XMM0	;	x^8
MULPD	XMM1,	XMM0	;	x^4 * x^8 = x^12

This even has fewer instructions than the optimized loop (example 16.13). The expanded macro takes 25 clock cycles in this example.

17 Single-Instruction-Multiple-Data programming

Since there are technological limits to the maximum clock frequency of microprocessors, the trend goes towards increasing processor throughput by handling multiple data in parallel.

When optimizing code, it is important to consider if there are data that can be handled in parallel. The principle of Single-Instruction-Multiple-Data (SIMD) programming is that a vector or set of data are packed together in one large register and handled together in one operation.

Multiple data can be packed into 64-bit or 128-bit registers in the following ways:

data type	data per pack	register size	instruction set	microprocessor
8-bit integer	8	64 bit (MMX)	MMX	PMMX and later
16-bit integer	4	64 bit (MMX)	MMX	PMMX and later
32-bit integer	2	64 bit (MMX)	MMX	PMMX and later
64-bit integer	1	64 bit (MMX)	SSE2	P4 and later
32-bit float	2	64 bit (MMX)	3DNow	AMD only
8-bit integer	16	128 bit (XMM)	SSE2	P4 and later
16-bit integer	8	128 bit (XMM)	SSE2	P4 and later
32-bit integer	4	128 bit (XMM)	SSE2	P4 and later
64-bit integer	2	128 bit (XMM)	SSE2	P4 and later
32-bit float	4	128 bit (XMM)	SSE	P3 and later
64-bit float	2	128 bit (XMM)	SSE2	P4 and later

All these packing modes are available on the latest microprocessors from Intel and AMD, except for the 3DNow mode, which is available only on AMD processors. Whether the different instruction sets are supported on a particular microprocessor can be determined with the CPUID instruction, as explained on page 26. The 64-bit MMX registers cannot be used together with the floating-point registers. The 128-bit XMM registers can only be used if supported by the operating system. See page 27 for how to check if the use of XMM registers is enabled.

You may make two or more versions of the critical part of your code: one that will run on old microprocessors, and one that uses the most advantageous packing mode and instruction set. The program should automatically select the version of the code that is appropriate for the system on which it is running.

Choose the smallest data size that fits your purpose in order to pack as many data as possible into one register. Mathematical computations may require double precision (64-bit) floats in order to avoid loss of precision in the intermediate calculations, even if single precision is sufficient for the final result.

Before you choose to use SIMD instructions for integer operations, you have to consider whether the resulting code will be faster than the simple integer instructions in 32-bit registers. Simple operations such as integer additions take four times as long in SIMD registers as in 32-bit registers on the P4. The SIMD instructions are therefore only advantageous for integer additions if they can handle at least four data in parallel. Loading and storing memory operands take no longer for 64-bit and 128-bit registers than for 32-bit registers. Integer shift and multiplication is faster in 64-bit and 128-bit registers than in 32-bit registers on the P4. With SIMD code, you may spend more instructions on trivial things such as moving data into the right positions in the registers and emulating conditional moves, than on the actual calculations. Example 17.1 below is an example of this.

For floating-point calculations on the P4, it is often advantageous to use XMM registers, even if there are no opportunities for handling data in parallel. The latency of floating-point operations is shorter in XMM registers than in floating-point registers, and you can make conversions between integers and floating-point numbers without using a memory intermediate. Furthermore, you get rid of the annoying floating-point register stack.

Memory operands for SIMD instructions have to be properly aligned. See page 28 for how to align data in memory. The alignment requirement makes it complicated to pass 64-bit and 128-bit function parameters on the stack (see Intel's optimization reference manual). It is therefore recommended to pass 64-bit and 128-bit parameters in registers or through a pointer, rather than on the stack.

Conditional moves in SIMD registers

Let's repeat example 16.13 page 107 with two double-precision floats in XMM registers. This enables us to compute x0n0 and x1n1 in parallel:

; Example 17.1 (P4)
DATA	SEGMENT PARA PUBLIC 'DATA'
ONE	DQ	1.0, 1.0
X	DQ	?, ?	; X0, X1
N	DD	?, ?	; N0, N1
DATA	ENDS
CODE	SEGMENT BYTE PUBLIC 'CODE'

;register use:

;XMM0 = xp

;XMM1 = power

; XMM2 = i (i0 and i1 each stored twice as DWORD integers)

;XMM3 = 1.0 if not(i & 1)

;XMM4 = xp if (i & 1)

MOVQ	XMM2, [N]	; load	N0, N1
PUNPCKLDQ	XMM2, XMM2	; copy	to	get	N0, N0, N1, N1
MOVAPD	XMM0, [X]	; load	X0, X1
MOVAPD	XMM1, [ONE]	; power init. = 1.0
MOV	EAX, [N]	; N0
OR	EAX, [N+4]	; N0 OR N1 to			get highest significant bit
XOR	ECX, ECX	; 0 if	N0	and	N1 both zero
BSR	ECX, EAX	; compute		repeat count for max(N0,N1)
A: MOVDQA	XMM3, XMM2	; copy	i
PSLLD	XMM3, 31	; get least significant bit of i
PSRAD	XMM3, 31	; copy	to	all	bit positions
PSRLD	XMM2, 1	; i >>= 1
MOVAPD	XMM4, XMM0	; copy	of	xp
ANDPD	XMM4, XMM3	; xp if bit =			1
ANDNPD	XMM3, [ONE]	; 1.0 if bit = 0
ORPD	XMM3, XMM4	; (i &	1)	? xp : 1.0
MULPD	XMM1, XMM3	; power *= (i			& 1) ? xp : 1.0
MULPD	XMM0, XMM0	; xp *= xp
SUB	ECX, 1
JNS	A	; repeat ECX+1 times

Conditional moves in SIMD registers are implemented by generating a mask of all 1's representing the condition; AND'ing the first operand with the mask; AND'ing the second operand with the inverted mask; and OR'ing the two together. The mask can be generated by a compare instruction or, as here, by shifting a bit into the most significant position and then shifting it arithmetically to copy it into all positions. The arithmetic shift does not exist in a 64-bit version, so we have to use the 32-bit version with two identical copies of the condition operand.

The repeat count of the loop is calculated separately outside the loop in order to reduce the number of instructions inside the loop.

Timing analysis for example 17.1 in P4: There are four continued dependence chains: XMM0: 6 clocks, XMM1: 6 clocks, XMM2: 2 clocks, ECX: ½ clock. Throughput for the different execution units: MMX-SHIFT: 3 uops, 6 clocks. MMX-ALU: 3 uops, 6 clocks. FP-MUL: 2 uops, 4 clocks. Throughput for port 1: 8 uops, 8 clocks. Thus, the loop appears to be limited by port 1 throughput. The best timing we can hope for is 8 clocks per iteration which is the number of uops that must go to port 1. However, three of the continued dependence chains are interconnected by two broken, but quite long, dependence chains involving XMM3 and XMM4, which take 22 and 18 clocks, respectively. This tends to hinder the optimal reordering of uops. The measured time is approximately 10 uops per iteration. This timing actually requires a quite impressive reordering capability, considering that several iterations must be overlapped and several dependence chains interwoven in order to satisfy the restrictions on all ports and execution units.

In situations like this where it is difficult to obtain the optimal reordering of uops, it may require some experimentation to find the optimal solution. By experimentation I found that the code in example 17.1 can be made approximately 16 clocks faster in total by modifying it to the following:

; Example	17.2 (P4)
MOVQ	XMM2, [N]	; load	N0, N1
PUNPCKLDQ	XMM2, XMM2	; copy	to get	N0, N0, N1, N1
MOVAPD	XMM1, [ONE]	; power init. = 1.0
MOVAPD	XMM0, [X]	; load	X0, X1
MOV	EAX, [N]	; N0
OR	EAX, [N+4]	; N0 OR N1 to		get highest significant bit
XOR	ECX, ECX	; 0 if	N0 and	N1 both zero
BSR	ECX, EAX	; compute repeat count for max(N0,N1)
A: MOVDQA	XMM3, XMM2	; copy	i
MOVDQA	XMM4, [ONE]	; temporary 1.0
PSLLD	XMM3, 31	; get least significant bit of i
PSRAD	XMM3, 31	; copy	to all	bit positions
PSRLD	XMM2, 1	; i >>= 1
PXOR	XMM4, XMM0	; get bits that differ between xp and 1.0
PANDN	XMM3, XMM4	; mask	out if	(i & 1)
XORPD	XMM3, XMM0	; (i &	1) ? xp : 1.0
MULPD	XMM0, XMM0	; xp *= xp
MULPD	XMM1, XMM3	; power *= (i		& 1) ? xp : 1.0
SUB	ECX, 1
JNS	A	; repeat ECX+1 times

Here I have used PXOR and PANDN, rather than XORPD and ANDNPD, where the result of the operation may not be a valid floating-point number.

These examples shows that conditional moves in SIMD registers are quite complicated and involve considerable dependence chains. However, conditional moves are unavoidable in SIMD programming if the conditions are not the same for all operands in a pack. In the above example, the conditional moves can only be replaced by branches if n0 and n1 have the same value.

Conditional moves in 32-bit registers are no faster than in SIMD registers. The reason why conditional moves are so complex and inefficient is that they have three dependencies, while the hardware design does not allow any uop to have more than two dependencies.

Packing operands in 32-bit registers

Sometimes it is possible to handle packed data in 32-bit registers. You may use this method to take advantage of the fact that 32-bit operations are fast, or to make the code compatible with old microprocessors.

A 32-bit register can hold two 16-bit integers, four 8-bit integers, or 32 Booleans. When doing calculations on packed integers in 32-bit registers, you have to take special care to avoid carries from one operand going into the next operand if overflow is possible. The following example adds 2 to all four bytes in EAX:

; Example 17.3
MOV	EAX, [ESI]	; read 4-bytes operand
MOV	EBX, EAX	; copy into EBX
AND	EAX, 7F7F7F7FH	; get lower 7 bits of each byte in EAX
XOR	EBX, EAX	; get the highest bit of each byte
ADD	EAX, 02020202H	; add desired value to all four bytes
XOR	EAX, EBX	; combine bits again
MOV	[EDI], EAX	; store result

<<< < Предыдущая 18 19 20 21 22 23 24 25 26 27 28 2930 / 4330 31 32 33 34 35 36 37 38 39 40 41 42 > Следующая >>>

Соседние файлы в предмете Электротехника

#
23.08.201378.64 Кб8Firebird Null guide.pdf
#
23.08.201360.5 Кб6Firebird's nbackup tool.pdf
#
23.08.2013384.6 Кб11Firth D.R.Balanced constant current excitation for dynamic strain measurements.pdf
#
23.08.2013447.05 Кб10FLTK human interface guidelines.2005.pdf
#
23.08.2013430.42 Кб9FLTK Subversion quick-start guide.2005.pdf
#
23.08.2013814.91 Кб12Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf
#
23.08.2013163.76 Кб41Forth-83 standard.1983.pdf
#
23.08.2013551.69 Кб18Frame D.Printed circuit board and connector impedance matching using complex conjugation.2004.pdf
#
23.08.2013321.12 Кб8Fredriksson L.CAN for critical embedded automotive networks.pdf
#
23.08.2013665.38 Кб10FreeBSD developers' handbook.2001.pdf
#
23.08.2013177.78 Кб17Fuller J.P.MSW Logo.A simplified reference.1998.pdf