NVIDIA CUDA Learning Note 1

1)CPU architecture

  • Pipelining
  • Branch Prediction
  • Superscalar
  • Out-of-Order Execution
  • Memory Hierarchy
  • Vector Operation
  • Multi-core

What is CPU?

  • Execute instruction, process data
  • Additional complex function
  • Contains many transistor

What is instruction

For example:
arithmetic:add r3,r4 > r4
visit and save:load[r4] > r7
control:jz end

Optimize Objective:

cycles for instruction * seconds/cycle

CPI(clock cycle per instruction) & clock cycle
The two factors are not independent, some time the increase of CPI can cause the decrease of the number of instruction.

Desktop Programs

Lightly threaded
Lots of branches
Lots of memory accesses
Most desktop program deals with data transfer instead of numeric computation.

Moore’s Law

The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.
What do we do with our transistor budget?

《NVIDIA CUDA Learning Note 1》 image.png

8 Core processor contains 2,2 billion transistor, most part of the cpu is about I/O and saving instead of computation.

Pipelining

Several steps involved in executing an instruction:
Fetch -> Decode -> Execute -> Memory -> Writeback
This process can be separate to different parts of pipeline

《NVIDIA CUDA Learning Note 1》 image.png

Pros

  • Instruction level parallelism (ILP)
  • Significantly reduced clock period.

Cons

  • Slight latency & area increase (pipeline latches)
  • Dependency
  • How to manage the branch
  • Alieged Pipeline Lengths

Bypassing

《NVIDIA CUDA Learning Note 1》 image.png

If two instructions are dependent, for example, the ADD instruction has to wait for SUB instruction to finish pipeline and return R7, bypassing can pass R7 to latter instruction without waiting.

Stalls

《NVIDIA CUDA Learning Note 1》 image.png

If load is not finished, the pipeline must stop to wait.

Branch

《NVIDIA CUDA Learning Note 1》 image.png

Branch Prediction

Guess what instruction comes next
Based off branch history
Example: two-level predictor with global history

  • Maintain history table of all outcomes for M successive
  • Compare with past N results (history register)
  • Sandy Bridge employs 32-bit history register

Modern predictors > 90%

Pros:
Raise performance and energy efficiency
Cons;
Area increase
Potential fetch stage latency increase

Predication

Replace branches with conditional instructions
Avoids branch predictor

  • Avoid area penalty, misprediction penalty

GPU also use prediction

Increase IPC

  • Normal IPC is limited by 1 instruction / clock
  • Superscalar – increase the width of the pipeline

Superscalar

Peak IPC is N (for N-way superscalar)

《NVIDIA CUDA Learning Note 1》 image.png

Scheduling

xor r1,r2 -> r3
add r3,r4 -> r4

sub r5,r2 ->r3
addi r3,1->r1

xor and add : Read-After-Write,RAW
sub and addi: RAW
xor and sub: WAW

Register Renaming

xor r1,r2 -> r6
add r6,r4 -> r7

sub r5,r2 ->r8
addi r8,1->r9
xor and sub can parallel compute

Out-of-Order(OoO) Execution

Reordering the order
Fetch -> Decode -> Rename -> Dispatch -> Issue ->
Register-Read – > Execute -> Memory -> Writeback ->
Commit

Reorder Buffer
Issue Queue/Scheduler

Pros:
IPC near to the ideal state
Cons:
Area increase
Power cost

Modern Desktop/ Mobile In-order CPUs
  • Intel Atom
  • ARM Cortex-A8
  • Quaicomm Scorpion
Modern Desktop/Mobile OoO CPUs
  • Intel Pentium Pro and onwards
  • ARM Cortex-A9
  • Quaicomm Krait

Memory Hierarchy

《NVIDIA CUDA Learning Note 1》 image.png

Caching

Put the data in a position as close as possible。

  • Time proximity
  • Spatial proximity

Cpu parallel

  • Instruction – level extraction
  • Data – Level Parallelism (Vectors)
  • Thread- Level Parallelism (TLP)

Vectors Motivation

for(int i = 0;i<N;i++)
A[i] = B[i] + c[i]

Single instruction multiple Data
//in parallel
A[i] = B[i] + c[i]
A[i+!] = B[i+!] + c[i+!]
A[i+2] = B[i+2] + c[i+2]
A[i+3] = B[i+3] + c[i+3]
A[i+4] = B[i+4] + c[i+4]

X86 Vector Motivation

  • SSE2
  • AVX

Thread-Level Parallelism

Programmers can destroy and create.
Programmers or OS can dispatch.

Multicore

Locks, Coherence and Consistency

  • Multi thread access same data
  • Coherence: which one is correct
  • Consistency: what kind of data is correct

Power Wall

The increase of the main frequency of CPU leads to the increase of power consumption, so that the density can not be increased unrestricted.

CPU provides optimization for series program

    原文作者:戬杨Jason
    原文地址: https://www.jianshu.com/p/863224b38d5a
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞