程式電羊: Chapter 4 The Processor

Chapter 4 The Processor

這講這章開始有幾個名詞必須弄懂combination element與sequential element的差別,前者僅有輸入值會影響輸出值,而後者除了輸入值之外還必須考量到目前的狀態(stored information in registers),舉可口可樂投幣機為例,一罐可樂賣18元且能投進去的硬幣面額只有分別1,5,10元,想像一個情境先是投進三個一元接著三個五元,我們都會很自然的認為應該會有一罐飲料會掉下來,但如果這個投幣機是combinational element他不會"記得"你投了前三個一元與2個五元,可想而知,我們都會認定這台機器會吃錢,但若使用sequential elenemt來設計,投幣機會記錄已經投了1元並升起一個flag,接著等待到下一個positive clock edge,進入到已投一元的state,並將所有flag歸零.

note: positive clock edge 為clock從零到一的過程,兩個positive clock edge間理論上為一個cycle

而這次的課程就是借用這兩項原理來設計處理器,並整理出一下的重點

Combinational logic transforms data during clock cycles

Between clock edges
Input from state elements, output to state element
Longest delay determines clock period

換句話說,我們設計的CPU一個cycle的時間即combinational logics所需的運算時間

辛苦的莘莘學子,就必須老老實實的劃出與接好data path

note:

1.why alu operation need 4 bit?

我當時在ALU使用了12個情況,便可應付MIPS的指令

Load/Store: F = add

Branch: F = subtract

R-type: F depends on funct field

2.branch instruction steps

Read register operands
Compare operands (Use ALU, subtract and check Zero output)
Calculate target address

Sign-extend displacement()
Shift left 2 places (word displacement)
Add to PC + 4 (Already calculated by instruction fetch )

綜合以上 FULL DATAPATH and control

note: memTOreg 有的人設計的mux一和零是反過來的

從single cycle到pipeline

終於完成這部分後,可以發現每一個指令必須走完必經的critical path之後下一個指令才能進來,這種single cycle的設計簡單,但卻沒有效率

note: load instruction critical path

Instruction memory-->register file-->ALU-->data memory-->register file

同時也違反MIPS的原則,而解決辦法就是pipeline,我們發現當指令正在ALU的部分進行運算時,其他原件都是閒置的! 所以pipeline的中心思想就是利用這些閒置的原件

single cycle的設計因為CPI=1,每個cycle的時間必須足夠長,即使beq僅花500ps,cycle time能必須為800ps.

而pipeline的cycle time則必須等足夠長的stage time,如上圖每個stage最大值為200ps.

理論上計算piepline cycle time

800(cycle time)/5(stages)

=160,

理論與實際差不多~

同時這種設計的難題就是還必須處理Hazard,有三種

Structure hazards

A required resource is busy (pipeline有兩個memory分別負責instruction和data)

Data hazard

Need to wait for previous instruction to complete its data read/write

Control hazard

Deciding on control action depends on previous instruction

Data hazard

case1: solved by forwarding

case2: solved by stall

Load-Use Data Hazard

Control hazard

Branch determines flow of control

Fetching next instruction depends on branch outcome

In MIPS pipeline Need to compare registers and compute target early in the pipeline

Add hardware to do it in ID stage

Branch prediction

Longer pipelines can’t readily determine branch outcome early

Stall penalty becomes unacceptable

Predict outcome of branch

Only stall if prediction is wrong (flush)

How to implement pipeline?

Need registers between stages

To hold information produced in previous cycle

再來進階

detect Hazard

1a. EX/MEM.RegisterRd = ID/EX.RegisterRs

1b. EX/MEM.RegisterRd = ID/EX.RegisterRt

2a. MEM/WB.RegisterRd = ID/EX.RegisterRs

2b. MEM/WB.RegisterRd = ID/EX.RegisterRt

Detect the need to forward

But only if forwarding instruction will write to a register!

EX/MEM.RegWrite, MEM/WB.RegWrite

And only if Rd for that instruction is not $zero

EX/MEM.RegisterRd ≠ 0,
MEM/WB.RegisterRd ≠ 0

然後再加上前面Data hazard的判斷式

Double data hazard

原則就是"使用最近的值",例如在上面的例子就是得forward line2的值,而非line1

Load-Use Hazard Detection

Load-use hazard when (在ID stage時檢查)

ID/EX.MemRead and

((ID/EX.RegisterRt = IF/ID.RegisterRs) or

(ID/EX.RegisterRt = IF/ID.RegisterRt))

stall the pipeline!

note:

how to stall?

Force control values in ID/EX register to 0
Prevent update of PC and IF/ID register

Using instruction is decoded again
Following instruction is fetched again
1-cycle stall allows MEM to read data for lw

Branch/Control Hazards

If branch outcome determined in MEM

PCSrc (AND-ing Branch and Zero) is generated in MEM
branch delay and need to flush 3 instructions

solution to reduce the delay:

Move hardware to determine outcome to ID stage

add

Target address adder
Register comparator

case1 for branch hazard

case2

note: 簡而言之 R-format需要兩個cycle, lw須要兩個cycle

Dynamic Branch Prediction

Use dynamic prediction
Branch prediction buffer (aka branch history table)
Indexed by recent branch instruction addresses
Stores outcome (taken/not taken)
To execute a branch

Check table, expect the same outcome
Start fetching from fall-through or target
If wrong, flush pipeline and flip prediction

1-Bit Predictor

在程式裡常常會遇到loop,若使用1bit predictor,當判斷錯誤時,就反轉猜測值,反而會連續猜錯.若使用two-bit,就必須連續猜錯兩次,才會反轉測值,這樣可以降低錯誤率

Instruction-Level Parallelism (ILP)(可直接想成是pipeline)

Pipelining: executing multiple instructions in parallel
To increase ILP

Deeper pipeline

Less work per stage shorter clock cycle

multiple issues(增加洗衣機!)

Replicate pipeline stages multiple pipelines
Start multiple instructions per clock cycle
CPI < 1, so use Instructions Per Cycle (IPC)

Static Multiple Issue

Compiler groups instructions to be issued together(can be issued on a single cycle)
Packages them into “issue slots”(Very Long Instruction Word (VLIW) )
Compiler must remove some/all hazards

介紹兩種增加效能的方法

scheduling

2.Loop Unrolling(循序展開)

借由犧牲程式的尺寸來增加程式執行速度的方法

Replicate loop body to expose more parallelism
Use different registers per replication(must register renaming to avoid anti-dependency)

e.g

for (i = 1; i <= 60; i++)
a[i] = a[i] * b + c;

展開兩次

for (i = 1; i <= 60; i+=3)
{
a[i] = a[i] * b + c;
a[i+1] = a[i+1] * b + c;
a[i+2] = a[i+2] * b + c;
}

Dynamic Multiple Issue

“Superscalar” processors

An advanced pipelining techniques that enables the processor to execute more than one instruction per clock cycle by selecting them during execution.

參考:

Difference between Combinational logic

http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Seq/diff.html

吳凱強交大計組課本講義

wiki

loop unrolling

https://zh.wikipedia.org/wiki/%E5%BE%AA%E7%8E%AF%E5%B1%95%E5%BC%80

dual-issue processor

http://stackoverflow.com/questions/8014739/what-exactly-is-a-dual-issue-processor

2015年9月28日星期一

Chapter 4 The Processor

沒有留言:

張貼留言

2015年9月28日 星期一

Chapter 4 The Processor

沒有留言:

張貼留言

2015年9月28日星期一