Chapter 4 The Processor
這講這章開始有幾個名詞必須弄懂combination element與sequential element的差別,前者僅有輸入值會影響輸出值,而後者除了輸入值之外還必須考量到目前的狀態(stored information in registers),舉可口可樂投幣機為例,一罐可樂賣18元且能投進去的硬幣面額只有分別1,5,10元,想像一個情境先是投進三個一元接著三個五元,我們都會很自然的認為應該會有一罐飲料會掉下來,但如果這個投幣機是combinational element他不會"記得"你投了前三個一元與2個五元,可想而知,我們都會認定這台機器會吃錢,但若使用sequential elenemt來設計,投幣機會記錄已經投了1元並升起一個flag,接著等待到下一個positive clock edge,進入到已投一元的state,並將所有flag歸零.
note: positive clock edge 為clock從零到一的過程,兩個positive clock edge間理論上為一個cycle
而這次的課程就是借用這兩項原理來設計處理器,並整理出一下的重點
Combinational logic transforms data during clock cycles
- Between clock edges
- Input from state elements, output to state element
- Longest delay determines clock period
換句話說,我們設計的CPU一個cycle的時間即combinational logics所需的運算時間
辛苦的莘莘學子,就必須老老實實的劃出與接好data path
note:
1.why alu operation need 4 bit?
我當時在ALU使用了12個情況,便可應付MIPS的指令
Load/Store: F = add
Branch: F = subtract
R-type: F depends on funct field
2.branch instruction steps
- Read register operands
- Compare operands (Use ALU, subtract and check Zero output)
- Calculate target address
- Sign-extend displacement()
- Shift left 2 places (word displacement)
- Add to PC + 4 (Already calculated by instruction fetch )
3.
綜合以上 FULL DATAPATH and control
note: memTOreg 有的人設計的mux一和零是反過來的
從single cycle到pipeline
終於完成這部分後,可以發現每一個指令必須走完必經的critical path之後下一個指令才能進來,這種single cycle的設計簡單,但卻沒有效率
note: load instruction critical path
Instruction memory-->register file-->ALU-->data memory-->register file
同時也違反MIPS的原則,而解決辦法就是pipeline,我們發現當指令正在ALU的部分進行運算時,其他原件都是閒置的! 所以pipeline的中心思想就是利用這些閒置的原件
single cycle的設計因為CPI=1,每個cycle的時間必須足夠長,即使beq僅花500ps,cycle time能必須為800ps.
而pipeline的cycle time則必須等足夠長的stage time,如上圖每個stage最大值為200ps.
理論上計算piepline cycle time
800(cycle time)/5(stages)
=160,
理論與實際差不多~
同時這種設計的難題就是還必須處理Hazard,有三種
Structure hazards
A required resource is busy (pipeline有兩個memory分別負責instruction和data)
Data hazard
Need to wait for previous instruction to complete its data read/write
Control hazard
Deciding on control action depends on previous instruction
Data hazard
case1: solved by forwarding
case2: solved by stall
Load-Use Data Hazard
Control hazard
- Branch determines flow of control
- Fetching next instruction depends on branch outcome
- In MIPS pipeline Need to compare registers and compute target early in the pipeline
- Add hardware to do it in ID stage
Branch prediction
- Longer pipelines can’t readily determine branch outcome early
- Stall penalty becomes unacceptable
- Predict outcome of branch
- Only stall if prediction is wrong (flush)
How to implement pipeline?
- Need registers between stages
- To hold information produced in previous cycle
再來進階
detect Hazard
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
Detect the need to forward
- But only if forwarding instruction will write to a register!
- EX/MEM.RegWrite, MEM/WB.RegWrite
- And only if Rd for that instruction is not $zero
- EX/MEM.RegisterRd ≠ 0,
- MEM/WB.RegisterRd ≠ 0
然後再加上前面Data hazard的判斷式
Double data hazard
原則就是"使用最近的值",例如在上面的例子就是得forward line2的值,而非line1
Load-Use Hazard Detection
Load-use hazard when (在ID stage時檢查)
ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt))
stall the pipeline!
note:
how to stall?
- Force control values in ID/EX register to 0
- Prevent update of PC and IF/ID register
- Using instruction is decoded again
- Following instruction is fetched again
- 1-cycle stall allows MEM to read data for lw
Branch/Control Hazards
- If branch outcome determined in MEM
- PCSrc (AND-ing Branch and Zero) is generated in MEM
- branch delay and need to flush 3 instructions
solution to reduce the delay:
Move hardware to determine outcome to ID stage
add
- Target address adder
- Register comparator
case1 for branch hazard
case2
note: 簡而言之 R-format需要兩個cycle, lw須要兩個cycle
Dynamic Branch Prediction
- Use dynamic prediction
- Branch prediction buffer (aka branch history table)
- Indexed by recent branch instruction addresses
- Stores outcome (taken/not taken)
- To execute a branch
- Check table, expect the same outcome
- Start fetching from fall-through or target
- If wrong, flush pipeline and flip prediction
1-Bit Predictor
在程式裡常常會遇到loop,若使用1bit predictor,當判斷錯誤時,就反轉猜測值,反而會連續猜錯.若使用two-bit,就必須連續猜錯兩次,才會反轉測值,這樣可以降低錯誤率
Instruction-Level Parallelism (ILP)(可直接想成是pipeline)
- Pipelining: executing multiple instructions in parallel
- To increase ILP
- Deeper pipeline
- Less work per stage shorter clock cycle
- multiple issues(增加洗衣機!)
- Replicate pipeline stages multiple pipelines
- Start multiple instructions per clock cycle
- CPI < 1, so use Instructions Per Cycle (IPC)
Static Multiple Issue
- Compiler groups instructions to be issued together(can be issued on a single cycle)
- Packages them into “issue slots”(Very Long Instruction Word (VLIW) )
- Compiler must remove some/all hazards
介紹兩種增加效能的方法
1.
scheduling
2.Loop Unrolling(循序展開)
借由犧牲程式的尺寸來增加程式執行速度的方法
- Replicate loop body to expose more parallelism
- Use different registers per replication(must register renaming to avoid anti-dependency)
e.g
for (i = 1; i <= 60; i++)
a[i] = a[i] * b + c;
a[i] = a[i] * b + c;
展開兩次
for (i = 1; i <= 60; i+=3)
{
a[i] = a[i] * b + c;
a[i+1] = a[i+1] * b + c;
a[i+2] = a[i+2] * b + c;
}
{
a[i] = a[i] * b + c;
a[i+1] = a[i+1] * b + c;
a[i+2] = a[i+2] * b + c;
}
Dynamic Multiple Issue
“Superscalar” processors
- An advanced pipelining techniques that enables the processor to execute more than one instruction per clock cycle by selecting them during execution.
參考:
Difference between Combinational logic
吳凱強 交大計組課本講義
wiki
loop unrolling
dual-issue processor
沒有留言:
張貼留言