Добавил:

Andrey Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

Электротехника

Файл:

Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf

Скачиваний:

Добавлен:

23.08.2013

Размер:

814.91 Кб

Скачать

☆

<<< < Предыдущая 15 16 17 18 19 20 21 22 23 24 25 2627 / 4327 28 29 30 31 32 33 34 35 36 37 38 39 > Следующая >>>

Uop retirement

The retirement station can handle 3 uops per clock cycle. Taken branches can only be handled by the first of the three slots in the retirement station.

If you aim at an average throughput of 3 uops per clock cycle then avoid an excessive number of jumps, calls and branches. Small critical loops should preferably have a number of uops divisible by 3 (see page 88).

Instruction decoding

If the critical part of your code doesn't fit into the trace cache, then the limiting stage may be instruction decoding. The decoder can handle one instruction per clock cycle, provided that the instruction generates no more than 4 uops and no microcode, and has no more than

one prefix (see page 80). If decoding is a bottleneck, then you may try to minimize the number of instructions rather than the number of uops.

Branch prediction

The calculations of latencies and throughputs are only valid if all branches are predicted. Branch mispredictions can seriously slow down performance when latency or throughput is the limiting factor.

Avoid poorly predictable branches in critical parts of your code unless the alternative (e.g. conditional moves) outweighs the advantage by adding costly extra dependencies and latency. See page 45 for details.

16 Loop optimization (all processors)

When analyzing a program, you often find that most of the time consumption lies in the innermost loop. The way to improve the speed is to carefully optimize the most timeconsuming loop using assembly language. The rest of the program may be left in high-level language.

In all the following examples it is assumed that all data are likely to be in the cache most of the time. If the speed is limited by cache misses then there is no reason to optimize the instructions. Rather, you should concentrate on organizing your data in a way that minimizes cache misses (see page 29).

A loop generally contains a counter controlling how many times to iterate, and often array access reading or writing one array element for each iteration. I have chosen as example a procedure that reads integers from an array, changes the sign of each integer, and stores the results in another array. A C++ language code for this procedure would be:

void ChangeSign (int * A, int * B, int N) { for (int i=0; i<N; i++) B[i] = -A[i];}

Translating to assembly, we might write the procedure like this:

; Example 16.1:
_ChangeSign PROC NEAR
	PUSH	ESI
	PUSH	EDI
A	EQU	DWORD PTR [ESP+12]	; addresses of parameters on stack
B	EQU	DWORD PTR [ESP+16]
N	EQU	DWORD PTR [ESP+20]
	MOV	ECX, [N]
	JECXZ	L2	; skip if N = 0
	MOV	ESI, [A]	; pointer to source A
	MOV	EDI, [B]	; pointer to destination B
	CLD

L1:	LODSD		; read
	NEG	EAX	; change sign
	STOSD		; write
	LOOP	L1	; repeat
L2:	POP	EDI
	POP	ESI
	RET		;(no extra pop if _cdecl calling convention)
_ChangeSign		ENDP

This looks like a nice solution, but it is not optimal because it uses the complex instructions JECXZ, CLD, LODSD, STOSD and LOOP, which are inefficient on most processors. It can be improved by avoiding these instructions:

; Example 16.2:
	MOV	ECX, [N]	; ECX = counter
	MOV	ESI, [A]
	TEST	ECX, ECX
	JZ	SHORT L2	; skip if N = 0
	MOV	EDI, [B]
L1:	MOV	EAX, [ESI]	; read
	ADD	ESI, 4	; increment source pointer
	NEG	EAX	; change sign
	MOV	[EDI], EAX	; write
	ADD	EDI, 4	; increment destination pointer
	SUB	ECX, 1	; decrement loop counter
	JNZ	L1	; loop
L2:

Here I am using SUB ECX,1 instead of DEC ECX because the latter instruction uses one extra uop in the P4. It is an advantage to have the loop control branch in the bottom of the loop because if the branch were at the top of the loop then we would need an extra jump in the bottom of the loop.

Using the same register for counter and index reduces the number of instructions:

; Example 16.3:
	MOV	ESI, [A]
	MOV	EDI, [B]
	MOV	ECX, [N]
	SUB	EDX, EDX	; set counter EDX = 0
	TEST	ECX, ECX
	JZ	SHORT L2
L1:	MOV	EAX, [ESI+4*EDX]	; use base pointer and index
	NEG	EAX
	MOV	[EDI+4*EDX], EAX
	ADD	EDX, 1	; increment loop counter
	CMP	EDX, ECX
	JB	L1
L2:

We can get rid of the CMP instruction in example 16.3 by letting the loop counter end at zero and use the zero flag for detecting when the loop is finished as we did in example 16.2. One way of doing this would be to execute the loop backwards taking the last array elements first. However, data caches are optimized for accessing data forwards, not backwards, so if cache misses are likely then you should rather start the counter at -N and count through negative values up to zero. This is possible if you let the base registers point to the end of the arrays rather than the beginning:

; Example 16.4:

MOV	ECX, [N]
MOV	ESI,	[A]
MOV	EDI,	[B]

	LEA	ESI, [ESI+4*ECX]	; point to end of array A
	LEA	EDI, [EDI+4*ECX]	; point to end of array B
	NEG	ECX	; -N
	JZ	SHORT L2	; skip if N = 0
L1:	MOV	EAX, [ESI+4*ECX]
	NEG	EAX
	MOV	[EDI+4*ECX], EAX
	ADD	ECX, 1
	JNZ	L1
L2:

Now we are down to two simple instructions for loop overhead, which is as low as you can get.

On processors with out-of-order execution, it is likely that the second calculation will start before the first calculation is finished. In some situations, the processor may start several iterations until the maximum throughput of the execution units is reached.

The out-of-order capabilities cannot be utilized, however, in the case where each calculation depends on the result of the previous one. Such a continued dependence chain is the worst situation you can have, and you should definitely try to find a way to break down the dependence chain. Assume, for example, that we have to multiply a long series of integers. The C++ code looks like this:

int MultiplyList (int * List, int N) { int product = 1, i;

for (i=0; i<N; i++) product *= List[i]; return product;}

The best thing you can do here is to roll out the loop and use two accumulators:

; Example 16.5:
_MultiplyList PROC NEAR
	PUSH	ESI
	PUSH	EBX
List	EQU	DWORD PTR [ESP+12]	; addresses of parameters on stack
N	EQU	DWORD PTR [ESP+16]
	MOV	ESI, [List]
	MOV	ECX, [N]
	MOV	EAX, 1	; accumulator one
	MOV	EBX, EAX	; accumulator two
	SUB	EDX, EDX	; counter starts at 0
	SUB	ECX, 1	; N-1
	JS	SHORT L3	; N-1 < 0 means N = 0
	SHL	ECX, 2	; 4*(N-1)
L1:	IMUL	EAX, [ESI+EDX]	; multiply first accumulator
	IMUL	EBX, [ESI+EDX+4]	; multiply second accumulator
	ADD	EDX, 8	; add 2*(data size) to counter
	CMP	EDX, ECX	; do we have at least 2 more?
	JB	L1
	JA	SHORT L2	; finished if counter > N-1
	IMUL	EAX, [ESI+EDX]	; N is odd, do one more
L2:	IMUL	EAX, EBX	; combine the two accumulators
L3:	POP	EBX
	POP	ESI
	RET

_MultiplyList ENDP

Now we are doing two operations in parallel and the long dependence chain is split into two parallel dependence chains of half the length. This will reduce the calculation time with almost 50% if the multiplication unit is pipelined.

When we roll out the loop by two, we have to check if the number of factors in List is odd. If N is odd, then we have to do the odd multiplication outside the loop. We can do the odd one either before or after the main loop. It may be more efficient to do it after the loop if List is aligned by 8.

In general, if you roll out a loop by R, i.e. if you do R calculations per loop iteration, then the number of extra calculations to do outside the loop is E = N modulo R. If you want to do the extra E calculations before the main loop, then you have to calculate E, which requires a division if R is not a power of 2. This can be avoided by doing the extra calculations after the main loop, as shown in example 16.5. Of course it is an advantage to choose R so that N is divisible by R, if possible.

A suitable roll-out factor R can be found by dividing the latency of the most critical calculation instruction with the reciprocal throughput of the same instruction. Remember that it not necessary to roll out the loop if there is no continued dependence chain. Excessive loop unrolling will only fill up the code cache or trace cache without any significant speed advantage. However, you may choose to roll out a loop if this improves the prediction of the loop control branch.

In example 16.5, we are using three instructions for counter and loop control. It is possible to reduce this to two instructions, as in example 16.4, but this will make the code quite awkward. In most cases, the loop control instructions can execute in parallel with the calculations so you don't have to care about minimizing the loop control overhead.

16.1 Loops in P1 and PMMX

The P1 and PMMX processors have no capabilities for out-of-order execution. Instead you have to care about pairing opportunities. The code in example 16.4 may be changed to:

; Example 16.6:
	MOV	ESI, [A]
	MOV	EAX, [N]
	MOV	EDI, [B]
	XOR	ECX, ECX
	LEA	ESI, [ESI+4*EAX]	; point to end of array A
	SUB	ECX, EAX	; -N
	LEA	EDI, [EDI+4*EAX]	; point to end of array B
	JZ	SHORT L3
	XOR	EBX, EBX	; start first calculation
	MOV	EAX, [ESI+4*ECX]
	INC	ECX
	JZ	SHORT L2
L1:	SUB	EBX, EAX	; u
	MOV	EAX, [ESI+4*ECX]	; v (pairs)
	MOV	[EDI+4*ECX-4], EBX	; u
	INC	ECX	; v (pairs)
	MOV	EBX, 0	; u
	JNZ	L1	; v (pairs)
L2:	SUB	EBX, EAX	; end last calculation
	MOV	[EDI+4*ECX-4], EBX
L3:

Here the iterations are overlapped in order to improve pairing opportunities. We begin reading the second value before we have stored the first one. The MOV EBX,0 instruction has been put in between INC ECX and JNZ L1, not to improve pairing, but to avoid the AGI stall that would result from using ECX as address index in the first instruction pair after it has been incremented.

Loops with floating-point operations are somewhat different because the floating-point instructions are overlapping rather than pairing. Consider the C++ language code:

<<< < Предыдущая 15 16 17 18 19 20 21 22 23 24 25 2627 / 4327 28 29 30 31 32 33 34 35 36 37 38 39 > Следующая >>>

Соседние файлы в предмете Электротехника

#
23.08.201378.64 Кб8Firebird Null guide.pdf
#
23.08.201360.5 Кб6Firebird's nbackup tool.pdf
#
23.08.2013384.6 Кб11Firth D.R.Balanced constant current excitation for dynamic strain measurements.pdf
#
23.08.2013447.05 Кб10FLTK human interface guidelines.2005.pdf
#
23.08.2013430.42 Кб9FLTK Subversion quick-start guide.2005.pdf
#
23.08.2013814.91 Кб12Fog A.How to optimize for the Pentium family of microprocessors.2004.pdf
#
23.08.2013163.76 Кб41Forth-83 standard.1983.pdf
#
23.08.2013551.69 Кб18Frame D.Printed circuit board and connector impedance matching using complex conjugation.2004.pdf
#
23.08.2013321.12 Кб8Fredriksson L.CAN for critical embedded automotive networks.pdf
#
23.08.2013665.38 Кб10FreeBSD developers' handbook.2001.pdf
#
23.08.2013177.78 Кб17Fuller J.P.MSW Logo.A simplified reference.1998.pdf