NVIDIA CUDA Learning Note 1

2019年5月5日 373次阅读来源: 戬杨Jason

1）CPU architecture

Pipelining
Branch Prediction
Superscalar
Out-of-Order Execution
Memory Hierarchy
Vector Operation
Multi-core

What is CPU?

Execute instruction, process data
Additional complex function
Contains many transistor

What is instruction

For example:
arithmetic:add r3,r4 > r4
visit and save:load[r4] > r7
control:jz end

Optimize Objective:

cycles for instruction * seconds/cycle

CPI(clock cycle per instruction) & clock cycle
The two factors are not independent, some time the increase of CPI can cause the decrease of the number of instruction.

Desktop Programs

Lightly threaded
Lots of branches
Lots of memory accesses
Most desktop program deals with data transfer instead of numeric computation.

Moore’s Law

The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.
What do we do with our transistor budget?

《NVIDIA CUDA Learning Note 1》 image.png

8 Core processor contains 2,2 billion transistor, most part of the cpu is about I/O and saving instead of computation.

Pipelining

Several steps involved in executing an instruction:
Fetch -> Decode -> Execute -> Memory -> Writeback
This process can be separate to different parts of pipeline

《NVIDIA CUDA Learning Note 1》 image.png

Pros

Instruction level parallelism (ILP)
Significantly reduced clock period.

Cons

Slight latency & area increase (pipeline latches)
Dependency
How to manage the branch
Alieged Pipeline Lengths

Bypassing

《NVIDIA CUDA Learning Note 1》 image.png

If two instructions are dependent, for example, the ADD instruction has to wait for SUB instruction to finish pipeline and return R7, bypassing can pass R7 to latter instruction without waiting.

Stalls

《NVIDIA CUDA Learning Note 1》 image.png

If load is not finished, the pipeline must stop to wait.

Branch

《NVIDIA CUDA Learning Note 1》 image.png

Branch Prediction

Guess what instruction comes next
Based off branch history
Example: two-level predictor with global history

Maintain history table of all outcomes for M successive
Compare with past N results (history register)
Sandy Bridge employs 32-bit history register

Modern predictors > 90%

Pros:
Raise performance and energy efficiency
Cons;
Area increase
Potential fetch stage latency increase

Predication

Replace branches with conditional instructions
Avoids branch predictor

Avoid area penalty, misprediction penalty

GPU also use prediction

Increase IPC

Normal IPC is limited by 1 instruction / clock
Superscalar – increase the width of the pipeline

Superscalar

Peak IPC is N (for N-way superscalar)

《NVIDIA CUDA Learning Note 1》 image.png

Scheduling

xor r1,r2 -> r3
add r3,r4 -> r4

sub r5,r2 ->r3
addi r3,1->r1

xor and add : Read-After-Write,RAW
sub and addi: RAW
xor and sub: WAW

Register Renaming

xor r1,r2 -> r6
add r6,r4 -> r7

sub r5,r2 ->r8
addi r8,1->r9
xor and sub can parallel compute

Out-of-Order(OoO) Execution

Reordering the order
Fetch -> Decode -> Rename -> Dispatch -> Issue ->
Register-Read – > Execute -> Memory -> Writeback ->
Commit

Reorder Buffer
Issue Queue/Scheduler

Pros:
IPC near to the ideal state
Cons:
Area increase
Power cost

Modern Desktop/ Mobile In-order CPUs

Intel Atom
ARM Cortex-A8
Quaicomm Scorpion

Modern Desktop/Mobile OoO CPUs

Intel Pentium Pro and onwards
ARM Cortex-A9
Quaicomm Krait

Memory Hierarchy

《NVIDIA CUDA Learning Note 1》 image.png

Caching

Put the data in a position as close as possible。

Time proximity
Spatial proximity

Cpu parallel

Instruction – level extraction
Data – Level Parallelism (Vectors)
Thread- Level Parallelism (TLP)

Vectors Motivation

for(int i = 0;i<N;i++)
A[i] = B[i] + c[i]

Single instruction multiple Data
//in parallel
A[i] = B[i] + c[i]
A[i+!] = B[i+!] + c[i+!]
A[i+2] = B[i+2] + c[i+2]
A[i+3] = B[i+3] + c[i+3]
A[i+4] = B[i+4] + c[i+4]

X86 Vector Motivation

SSE2
AVX

Thread-Level Parallelism

Programmers can destroy and create.
Programmers or OS can dispatch.

Multicore

Locks, Coherence and Consistency

Multi thread access same data
Coherence: which one is correct
Consistency: what kind of data is correct

Power Wall

The increase of the main frequency of CPU leads to the increase of power consumption, so that the density can not be increased unrestricted.

CPU provides optimization for series program

    原文作者：戬杨Jason
    原文地址: https://www.jianshu.com/p/863224b38d5a
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。