2015年9月28日 星期一

Chapter 4 The Processor

Chapter 4 The Processor


這講這章開始有幾個名詞必須弄懂combination element與sequential element的差別,前者僅有輸入值會影響輸出值,而後者除了輸入值之外還必須考量到目前的狀態(stored information in registers),舉可口可樂投幣機為例,一罐可樂賣18元且能投進去的硬幣面額只有分別1,5,10元,想像一個情境先是投進三個一元接著三個五元,我們都會很自然的認為應該會有一罐飲料會掉下來,但如果這個投幣機是combinational element他不會"記得"你投了前三個一元與2個五元,可想而知,我們都會認定這台機器會吃錢,但若使用sequential elenemt來設計,投幣機會記錄已經投了1元並升起一個flag,接著等待到下一個positive clock edge,進入到已投一元的state,並將所有flag歸零.


note: positive clock edge 為clock從零到一的過程,兩個positive clock edge間理論上為一個cycle


而這次的課程就是借用這兩項原理來設計處理器,並整理出一下的重點


Combinational logic transforms data during clock cycles
  1. Between clock edges
  2. Input from state elements, output to state element
  3. Longest delay determines clock period
換句話說,我們設計的CPU一個cycle的時間即combinational logics所需的運算時間


辛苦的莘莘學子,就必須老老實實的劃出與接好data path


note:
1.why alu operation need 4 bit?


我當時在ALU使用了12個情況,便可應付MIPS的指令


Load/Store: F = add
Branch: F = subtract
R-type: F depends on funct field


2.branch instruction steps
  1. Read register operands
  2. Compare operands (Use ALU, subtract and check Zero output)
  3. Calculate target address
    1. Sign-extend displacement()
    2. Shift left 2 places (word displacement)
    3. Add to PC + 4   (Already calculated by instruction fetch )
3.


綜合以上  FULL DATAPATH and control



note: memTOreg 有的人設計的mux一和零是反過來的


從single cycle到pipeline
終於完成這部分後,可以發現每一個指令必須走完必經的critical path之後下一個指令才能進來,這種single cycle的設計簡單,但卻沒有效率
note: load instruction critical path
Instruction memory-->register file-->ALU-->data memory-->register file
同時也違反MIPS的原則,而解決辦法就是pipeline,我們發現當指令正在ALU的部分進行運算時,其他原件都是閒置的! 所以pipeline的中心思想就是利用這些閒置的原件
single cycle的設計因為CPI=1,每個cycle的時間必須足夠長,即使beq僅花500ps,cycle time能必須為800ps.
而pipeline的cycle time則必須等足夠長的stage time,如上圖每個stage最大值為200ps.
理論上計算piepline cycle time


800(cycle time)/5(stages)
=160,
理論與實際差不多~




同時這種設計的難題就是還必須處理Hazard,有三種
Structure hazards
A required resource is busy (pipeline有兩個memory分別負責instruction和data)
Data hazard
Need to wait for previous instruction to complete its data read/write
Control hazard
Deciding on control action depends on previous instruction


Data hazard
case1: solved by forwarding


case2: solved by stall
Load-Use Data Hazard
Control hazard
  1. Branch determines flow of control
    1. Fetching next instruction depends on branch outcome
  2. In MIPS pipeline Need to compare registers and compute target early in the pipeline
    1. Add hardware to do it in ID stage


Branch prediction
  • Longer pipelines can’t readily determine branch outcome early
    • Stall penalty becomes unacceptable
  • Predict outcome of branch
    • Only stall if prediction is wrong (flush)




How to implement pipeline?


  • Need registers between stages
    • To hold information produced in previous cycle




再來進階


detect Hazard


1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
Detect the need to forward
  1. But only if forwarding instruction will write to a register!
    1. EX/MEM.RegWrite, MEM/WB.RegWrite
  2. And only if Rd for that instruction is not $zero
    1. EX/MEM.RegisterRd ≠ 0,
    2. MEM/WB.RegisterRd ≠ 0
然後再加上前面Data hazard的判斷式


Double data hazard


原則就是"使用最近的值",例如在上面的例子就是得forward line2的值,而非line1


Load-Use Hazard Detection


Load-use hazard when (在ID stage時檢查)
ID/EX.MemRead and
 ((ID/EX.RegisterRt = IF/ID.RegisterRs) or
  (ID/EX.RegisterRt = IF/ID.RegisterRt))
   stall the pipeline!


note:
how to stall?
  • Force control values in ID/EX register to 0
  • Prevent update of PC and IF/ID register
    • Using instruction is decoded again
    • Following instruction is fetched again
    • 1-cycle stall allows MEM to read data for lw
Branch/Control Hazards
  • If branch outcome determined in MEM
    • PCSrc (AND-ing Branch and Zero) is generated in MEM
    • branch delay and need to flush 3 instructions


solution to reduce the delay:  
Move hardware to determine outcome to ID stage
add
  • Target address adder
  • Register comparator


case1 for branch hazard






case2


note: 簡而言之  R-format需要兩個cycle, lw須要兩個cycle

Dynamic Branch Prediction
  • Use dynamic prediction
  • Branch prediction buffer (aka branch history table)
  • Indexed by recent branch instruction addresses
  • Stores outcome (taken/not taken)
  • To execute a branch
    • Check table, expect the same outcome
    • Start fetching from fall-through or target
    • If wrong, flush pipeline and flip prediction


1-Bit Predictor
在程式裡常常會遇到loop,若使用1bit predictor,當判斷錯誤時,就反轉猜測值,反而會連續猜錯.若使用two-bit,就必須連續猜錯兩次,才會反轉測值,這樣可以降低錯誤率

Instruction-Level Parallelism (ILP)(可直接想成是pipeline)
  • Pipelining: executing multiple instructions in parallel
  • To increase ILP
    • Deeper pipeline
      • Less work per stage  shorter clock cycle
    • multiple issues(增加洗衣機!)
      • Replicate pipeline stages  multiple pipelines
      • Start multiple instructions per clock cycle
      • CPI < 1, so use Instructions Per Cycle (IPC)


Static Multiple Issue
  • Compiler groups instructions to be issued together(can be issued on a single cycle)
  • Packages them into “issue slots”(Very Long Instruction Word (VLIW) )
  • Compiler must remove some/all hazards


介紹兩種增加效能的方法


1.
scheduling


2.Loop Unrolling(循序展開)
借由犧牲程式的尺寸來增加程式執行速度的方法
  1. Replicate loop body to expose more parallelism
  2. Use different registers per replication(must register renaming to avoid anti-dependency)
e.g
for (i = 1; i <= 60; i++)
  a[i] = a[i] * b + c;
展開兩次
for (i = 1; i <= 60; i+=3)
{
 a[i] = a[i] * b + c;
 a[i+1] = a[i+1] * b + c;
 a[i+2] = a[i+2] * b + c;
}


Dynamic Multiple Issue
“Superscalar” processors
  • An advanced pipelining techniques that enables the processor to execute more than one instruction per clock cycle by selecting them during execution.

參考:
Difference between Combinational logic


吳凱強 交大計組課本講義


wiki
loop unrolling


dual-issue processor

沒有留言:

張貼留言