程式電羊: 9月 2015

2015年9月28日星期一

Chapter 4 The Processor

這講這章開始有幾個名詞必須弄懂combination element與sequential element的差別,前者僅有輸入值會影響輸出值,而後者除了輸入值之外還必須考量到目前的狀態(stored information in registers),舉可口可樂投幣機為例,一罐可樂賣18元且能投進去的硬幣面額只有分別1,5,10元,想像一個情境先是投進三個一元接著三個五元,我們都會很自然的認為應該會有一罐飲料會掉下來,但如果這個投幣機是combinational element他不會"記得"你投了前三個一元與2個五元,可想而知,我們都會認定這台機器會吃錢,但若使用sequential elenemt來設計,投幣機會記錄已經投了1元並升起一個flag,接著等待到下一個positive clock edge,進入到已投一元的state,並將所有flag歸零.

note: positive clock edge 為clock從零到一的過程,兩個positive clock edge間理論上為一個cycle

而這次的課程就是借用這兩項原理來設計處理器,並整理出一下的重點

Combinational logic transforms data during clock cycles

Between clock edges
Input from state elements, output to state element
Longest delay determines clock period

換句話說,我們設計的CPU一個cycle的時間即combinational logics所需的運算時間

辛苦的莘莘學子,就必須老老實實的劃出與接好data path

note:

1.why alu operation need 4 bit?

我當時在ALU使用了12個情況,便可應付MIPS的指令

Load/Store: F = add

Branch: F = subtract

R-type: F depends on funct field

2.branch instruction steps

Read register operands
Compare operands (Use ALU, subtract and check Zero output)
Calculate target address

Sign-extend displacement()
Shift left 2 places (word displacement)
Add to PC + 4 (Already calculated by instruction fetch )

綜合以上 FULL DATAPATH and control

note: memTOreg 有的人設計的mux一和零是反過來的

從single cycle到pipeline

終於完成這部分後,可以發現每一個指令必須走完必經的critical path之後下一個指令才能進來,這種single cycle的設計簡單,但卻沒有效率

note: load instruction critical path

Instruction memory-->register file-->ALU-->data memory-->register file

同時也違反MIPS的原則,而解決辦法就是pipeline,我們發現當指令正在ALU的部分進行運算時,其他原件都是閒置的! 所以pipeline的中心思想就是利用這些閒置的原件

single cycle的設計因為CPI=1,每個cycle的時間必須足夠長,即使beq僅花500ps,cycle time能必須為800ps.

而pipeline的cycle time則必須等足夠長的stage time,如上圖每個stage最大值為200ps.

理論上計算piepline cycle time

800(cycle time)/5(stages)

=160,

理論與實際差不多~

同時這種設計的難題就是還必須處理Hazard,有三種

Structure hazards

A required resource is busy (pipeline有兩個memory分別負責instruction和data)

Data hazard

Need to wait for previous instruction to complete its data read/write

Control hazard

Deciding on control action depends on previous instruction

Data hazard

case1: solved by forwarding

case2: solved by stall

Load-Use Data Hazard

Control hazard

Branch determines flow of control

Fetching next instruction depends on branch outcome

In MIPS pipeline Need to compare registers and compute target early in the pipeline

Add hardware to do it in ID stage

Branch prediction

Longer pipelines can’t readily determine branch outcome early

Stall penalty becomes unacceptable

Predict outcome of branch

Only stall if prediction is wrong (flush)

How to implement pipeline?

Need registers between stages

To hold information produced in previous cycle

再來進階

detect Hazard

1a. EX/MEM.RegisterRd = ID/EX.RegisterRs

1b. EX/MEM.RegisterRd = ID/EX.RegisterRt

2a. MEM/WB.RegisterRd = ID/EX.RegisterRs

2b. MEM/WB.RegisterRd = ID/EX.RegisterRt

Detect the need to forward

But only if forwarding instruction will write to a register!

EX/MEM.RegWrite, MEM/WB.RegWrite

And only if Rd for that instruction is not $zero

EX/MEM.RegisterRd ≠ 0,
MEM/WB.RegisterRd ≠ 0

然後再加上前面Data hazard的判斷式

Double data hazard

原則就是"使用最近的值",例如在上面的例子就是得forward line2的值,而非line1

Load-Use Hazard Detection

Load-use hazard when (在ID stage時檢查)

ID/EX.MemRead and

((ID/EX.RegisterRt = IF/ID.RegisterRs) or

(ID/EX.RegisterRt = IF/ID.RegisterRt))

stall the pipeline!

note:

how to stall?

Force control values in ID/EX register to 0
Prevent update of PC and IF/ID register

Using instruction is decoded again
Following instruction is fetched again
1-cycle stall allows MEM to read data for lw

Branch/Control Hazards

If branch outcome determined in MEM

PCSrc (AND-ing Branch and Zero) is generated in MEM
branch delay and need to flush 3 instructions

solution to reduce the delay:

Move hardware to determine outcome to ID stage

add

Target address adder
Register comparator

case1 for branch hazard

case2

note: 簡而言之 R-format需要兩個cycle, lw須要兩個cycle

Dynamic Branch Prediction

Use dynamic prediction
Branch prediction buffer (aka branch history table)
Indexed by recent branch instruction addresses
Stores outcome (taken/not taken)
To execute a branch

Check table, expect the same outcome
Start fetching from fall-through or target
If wrong, flush pipeline and flip prediction

1-Bit Predictor

在程式裡常常會遇到loop,若使用1bit predictor,當判斷錯誤時,就反轉猜測值,反而會連續猜錯.若使用two-bit,就必須連續猜錯兩次,才會反轉測值,這樣可以降低錯誤率

Instruction-Level Parallelism (ILP)(可直接想成是pipeline)

Pipelining: executing multiple instructions in parallel
To increase ILP

Deeper pipeline

Less work per stage shorter clock cycle

multiple issues(增加洗衣機!)

Replicate pipeline stages multiple pipelines
Start multiple instructions per clock cycle
CPI < 1, so use Instructions Per Cycle (IPC)

Static Multiple Issue

Compiler groups instructions to be issued together(can be issued on a single cycle)
Packages them into “issue slots”(Very Long Instruction Word (VLIW) )
Compiler must remove some/all hazards

介紹兩種增加效能的方法

scheduling

2.Loop Unrolling(循序展開)

借由犧牲程式的尺寸來增加程式執行速度的方法

Replicate loop body to expose more parallelism
Use different registers per replication(must register renaming to avoid anti-dependency)

e.g

for (i = 1; i <= 60; i++)
a[i] = a[i] * b + c;

展開兩次

for (i = 1; i <= 60; i+=3)
{
a[i] = a[i] * b + c;
a[i+1] = a[i+1] * b + c;
a[i+2] = a[i+2] * b + c;
}

Dynamic Multiple Issue

“Superscalar” processors

An advanced pipelining techniques that enables the processor to execute more than one instruction per clock cycle by selecting them during execution.

參考:

Difference between Combinational logic

http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Seq/diff.html

吳凱強交大計組課本講義

wiki

loop unrolling

https://zh.wikipedia.org/wiki/%E5%BE%AA%E7%8E%AF%E5%B1%95%E5%BC%80

dual-issue processor

http://stackoverflow.com/questions/8014739/what-exactly-is-a-dual-issue-processor

2015年9月19日星期六

Chapter 3 Arithmetic for Computers

一切都是從這張圖開始,想當初修計組這門課對於前面的組語概念都不甚了解,最有印象的內容大概就是這章節,因為他實作起來比較簡單,總之32bit加法器,非常適合新手入門!

在實作32bit加法器之前,我想大家都有一個必經的過程先個別實作出1bit的加法器,如下圖,還記得大家在實作實都會照著一張表來作,當時也沒有想太多就照著寫就對了

Function	Ainvert	Binvert	CarryIn	Operation1	Operatopn2
and	0	0	X	0	0
or	0	0	X	0	1
add	0	0	0	1	0
sub	0	1	1	1	0
nor	1	1	X	0	0
slt	0	1	1	1	1

bit operation比較直觀,我想大家會有疑問的應該會是對sub與slt,這個問題也是每次遇到都要再查一遍,屢試不爽.

Sub

A - B = A + (2's comp B)

= A + (B' + 1). note: +1的部分可以從carry in借來

Slt 中心思想為A,B兩數相減

Result reg is 1 if a < b

Result reg is 0 if a >= b

最後還必須判斷是否溢位(overflow)

Recall the simpler overflow detection:

An overflow occurs if and only if the carry in to the HOB differs from the carry out of the HOB. (直接檢查最高位的carry_in與carry_out是否相同,不同的話則溢位)

額外補充:

Implementing Zero Detection

To see if all bits are zero just need the NOR of all the bits
Conceptually trivially but does require some wiring

note: 這可以在branch condition時,快速判斷兩暫存器內容是否相同

花了一些篇幅介紹簡單的加法器,然而往往最簡單的想法可能步驟是簡單而多的.

這種加法器又稱為漣波進位加法器(Ripple Carry Adder ,RCA),漣波進位加法器必須從最小有效位傳輸至最大有效位,因此會有很長的電路延遲,就N位元漣波進位加法器,最長延遲路徑(Critical path delay)為2N個邏輯閘延遲(must wait for input A and B),而得到正確結果須2N+1

這種加法器相較於其他加法器算是延遲時間最長的一種.

前詹進位加法器 (Carry Lookahead Adder, CLA)

先定義兩個名詞

propagate is that p is true if the current bit will propagate a carry from its input to its output. note: 是否會傳遞進位?

It is easy to see that p = (a OR b), i.e.

if and only if (a OR b)

then if there is a carry in

then there is a carry out

generate is that g is true if the current bit will generate a carry out (independent of the carry in). note:是否會產生進位?

It is easy to see that g = (a AND b), i.e.

if and only if (a AND b)

then the must be a carry-out independent of the carry-in

to generate a carry: gi = ai bi
to propagate a carry: pi = ai+bi

四位元前瞻進位加法器

c1 = g0 + p0 c0

c2 = g1 + p1 c1 = g1 + p1 g0 + p1 p0 c0

c3 = g2 + p2 c2 = g2 + p2 g1 + p2 p1 g0 + p2 p1 p0 c0

c4 = g3 + p3 c3 = g3 + p3 g2 + p3 p2 g1 + p3 p2 p1 g0 + p3 p2 p1 p0 c0

無號數乘法 Multiplication

第一次看這邊總是搞不懂到底是multiplicand 還是multiplier該往右,在這之前必須知道multiplicand是被乘數,multiplier是乘數,在進行乘法時,都會先選定乘數的最小位乘上一個被乘數,在選乘數的第二位乘上一個被乘數,依序下去,所以從這個過程可以看出直覺得想法該如何時作乘法器.

首先要知道(n bit)乘上(m bit),所乘結果,最多只有n+m bit,所以product與被乘數的部分為64bit,傳統的乘法為每次檢查multipler的LSB如果為1就在product累加並將multiplicand左移,muliplier右移(試想乘法的過程每次都會對齊目前的乘數位置),這個方法的缺點是被乘數總有一半的位元數為零(wasted),所以有改良版

Hardware-friendly multiplication

之前提到有一半浪費的空間,加強版的的要旨就是要好好的利用這部分,首先固定不動被乘數,並將乘數放在product這個register(意味著不需要額外的空間來儲存multiplier),基本步驟與傳統乘法相似,差別在判斷乘數的LSB為1時,將multiplicand累加至product的左半邊,反之為0則no operation,最後右移product,這舉動相當漂亮,將傳統的一左一右合併成一個動作.

有號數乘法

第一種方法為直接將乘數與被乘數用二的補數方式改成正數

第二種方法為

Booth’s Alogrithm

這個演算法的硬體與硬體最佳化的硬體大致相同,這個演算法多了一個mythiccal bit的register

布斯演算法的核心在於他對數字中間部分的1字串進行頭,中,尾三段的處理. 0字串的部分由於本身就無須進行運算,所以不必理會,布斯演算法一次檢查乘數的連續兩個位元,所以依據這兩個位元的值會有四種狀況

依照目前與先前位元(右邊)的不同，執行下面工作：

00：字串0的中間部份，不需要算術運算
01：字串1的結尾，所以將被乘數加到乘積的左半部
10：字串1的開端，所以從乘積的左半部減去被乘數
11：字串1的中間部份，所以不需要算術運算

如同前面的演算法，將乘積暫存器右移1位元

note:

1.一開始 mythical bit值為0

2.將乘積右移時，因為我們是處理有號數字，必須保留中間過程結果的正負符號，所以當乘積向右移時，擴展其符號。因此第一次重覆迴圈的乘積暫存器右移1位元時，將(111001110)2轉換成(111100111)2，不是(011100111)2，這種移位稱為算術右移(arithemetic right shift)，有別於邏輯右移。

重覆次數	步驟	被乘數暫存器	乘積暫存器
0	起始值	0010	0000 1101 0
1	10=>乘積暫存器=乘積暫存器-被乘數暫存器	0010	1110 1101 0
1	乘積暫存器右移	0010	1111 0110 1
2	01=>乘積暫存器=乘積暫存器+被乘數暫存器	0010	0001 0110 1
2	乘積暫存器右移	0010	1111 1011 0
3	10=>乘積暫存器=乘積暫存器-被乘數暫存器	0010	1110 1011 0
3	乘積暫存器右移	0010	1111 0101 1
4	11=>不做任何事	0010	1111 0101 1
4	乘積暫存器右移	0010	1111 1010 1