# **Master Informatics Eng.**

# 2018/19 *A.J.Proença*

### Data Parallelism 3 (GPU/CUDA, Neural Nets, ...) (most slides are borrowed)

AJProença, Advanced Architectures, MiEl, UMinho, 2018/19

XX

## The CUDA programming model

| Compute Unified Device Architecture                                                  |         |
|--------------------------------------------------------------------------------------|---------|
| CUDA is a recent programming model, designed for                                     | ,       |
| <ul> <li>a multicore CPU <i>host</i> coupled to a many-core <i>device</i></li> </ul> | , where |
| <ul> <li>devices have wide SIMD/SIMT parallelism, and</li> </ul>                     |         |
| <ul> <li>the host and the device do not share memory</li> </ul>                      |         |
| CUDA provides:                                                                       |         |
| <ul> <li>– a thread abstraction to deal with SIMD</li> </ul>                         |         |
| <ul> <li>– synchr. &amp; data sharing between small groups of threa</li> </ul>       | ds      |
| <ul> <li>CUDA programs are written in C with extensions</li> </ul>                   |         |
| <ul> <li>OpenCL inspired by CUDA, but hw &amp; sw vendor net</li> </ul>              | utral   |
| <ul> <li>programming model essentially identical</li> </ul>                          |         |

1

#### XX

- A compute device
  - is a coprocessor to the CPU or host
  - has its own DRAM (device memory)
  - runs many threads in parallel
  - is typically a GPU but can also be another type of parallel processing device
- Data-parallel portions of an application are expressed as device kernels which run on many threads - SIMT
- Differences between GPU and CPU threads
  - GPU threads are extremely lightweight
    - very little creation overhead, requires LARGE register bank •
  - GPU needs 1000s of threads for full efficiency
    - multi-core CPU needs only a few

AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

CUDA basic model: Single-Program Multiple-Data (SPMD)



AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

3

### Programming Model: SPMD + SIMT/SIMD

#### XX

- Hierarchy
  - Device => Grids
  - Grid => Blocks
  - Block => Warps
  - Warp => Threads
- Single kernel runs on multiple blocks (SPMD)
- Threads within a warp are executed • in a lock-step way called singleinstruction multiple-thread (SIMT)
- Single instruction are executed on multiple threads (SIMD)
  - Warp size defines SIMD granularity (32 threads)
- Synchronization within a block uses shared memory



AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

### The Computational Grid: Block IDs and Thread IDs



AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

<sup>6</sup> 

## Example

#### 公

### • Multiply two vectors of length 8192

- Code that works over all elements is the grid
- Thread blocks break this down into manageable sizes
  - 512 threads per block
- SIMD instruction executes 32 elements at a time
- Thus grid size = 16 blocks
- Block is analogous to a strip-mined vector loop with vector length of 32
- Block is assigned to a multithreaded SIMD processor by the thread block scheduler
- Current-generation GPUs (Fermi) have 7-16 multithreaded SIMD processors

Copyright © 2012, Elsevier Inc. All rights reserved.

AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

XX C with CUDA Extensions: C with a few keywords void saxpy\_serial(int n, float a, float \*x, float \*y) for (int i = 0; i < n; ++i) y[i] = a\*x[i] + y[i];Standard C Code // Invoke serial SAXPY kernel saxpy\_serial(n, 2.0, x, y); \_global\_\_\_ void saxpy\_parallel(int n, float a, float \*x, float \*y) { int i = blockIdx.x\*blockDim.x + threadIdx.x; Parallel C Code if (i < n) y[i] = a\*x[i] + y[i]; // Invoke **parallel** SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy\_para]]e]<<<nb]ocks, 256>>>(n, 2.0, x, y);

7

9

# Terminology (and in NVidia)

#### XX

- Threads of SIMD instructions (warps)
  - Each has its own IP (up to 48/64 per SIMD processor, Fermi/Kepler)
  - Thread scheduler uses scoreboard to dispatch
  - No data dependencies between threads!
  - Threads are organized into blocks & executed in groups of 32 threads (thread block)
    - · Blocks are organized into a grid
- The thread block scheduler schedules blocks to SIMD • processors (Streaming Multiprocessors)
- Within each SIMD processor:
  - 32 SIMD lanes (thread processors)
  - Wide and shallow compared to vector processors

Copyright © 2012, Elsevier Inc. All rights reserved. AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

# **CUDA Thread Block**

| <ul> <li>Programmer declares (Thread) Block:         <ul> <li>Block size 1 to 512 concurrent<br/>threads</li> <li>CUDA Thread Block</li> </ul> </li> </ul>                                                        |    |                                                                |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|----------------------------------------------------------------|
| <ul> <li>Block shape 1D, 2D, or 3D</li> <li>Block dimensions in threads</li> <li>threadID 0 1 2 3 4 5 6 7</li> </ul>                                                                                              |    |                                                                |
| All threads in a Block execute the same thread program                                                                                                                                                            |    | ı, 2007-2009<br>Iampaign                                       |
| <ul> <li>Threads share data and synchronize<br/>while doing their share of the work</li> <li> <sup>"float x = input[threadID];</sup> <sup>"float y = func(x);</sup> <sup>"output[threadID] = y;</sup> </li> </ul> |    | and Wen-mei W. Hwu, 2007-2009<br>of Illinois, Urbana-Champaign |
| Threads have thread id numbers     within Block                                                                                                                                                                   |    |                                                                |
| <ul> <li>Thread program uses thread id to<br/>select work and address shared data</li> </ul>                                                                                                                      |    | avid Kirk/NVIDIA<br>E498AL, University                         |
| AJProença, Advanced Architectures, MiEI, UMinho, 2018/19                                                                                                                                                          | 10 | © David<br>ECE 498                                             |

## Parallel Memory Sharing



# **CUDA Memory Model Overview**



### Hardware Implementation: **Memory Architecture**

- Device memory (DRAM) •
  - Slow (2~300 cycles)
  - Local, global, constant, and texture memory
- On-chip memory •

1

- Fast (1 cycle)
- Registers, shared memory, constant/texture cache



AJProença, Advanced Architectures, MiEI, UMinho, 2018/19



AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

Copyright © 2012, Elsevier Inc. All rights reserved.

Example

# Vector Processor versus CUDA core



Copyright © 2012, Elsevier Inc. All rights reserved. AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

15

**Graphical Processing Units** 

# **Conditional Branching**

|        | Conditional Branching                                                                                                                                                                                                                                                                                                                                                                                           | Graphical F |
|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|
| $\sim$ |                                                                                                                                                                                                                                                                                                                                                                                                                 | ธ           |
| •      | Like vector architectures, GPU branch hardware uses internal masks                                                                                                                                                                                                                                                                                                                                              | Processing  |
| •      | <ul> <li>Also uses <ul> <li>Branch synchronization stack</li> <li>Entries consist of masks for each SIMD lane</li> <li>I.e. which threads commit their results (all threads execute)</li> </ul> </li> <li>Instruction markers to manage when a branch diverges into multiple execution paths <ul> <li>Push on divergent branch</li> <li>and when paths converge</li> <li>Act as barriers</li> </ul> </li> </ul> | Units       |

- · Pops stack
- Per-thread-lane 1-bit predicate register, specified by • programmer

## **Beyond Vector/SIMD architectures**

#### ~~

#### Vector/SIMD-extended architectures are hybrid approaches

- mix (super)scalar + vector op capabilities on a single device
- highly pipelined approach to reduce memory access penalty
- tightly-closed access to shared memory: lower latency

### Evolution of Vector/SIMD-extended architectures

### - PU (Processing Unit) cores with wider vector units

- x86 many-core: Intel MIC / Xeon KNL (more slides later)
- other many-core: IBM Power BlueGene/Q Compute, ShenWay 260

#### - coprocessors (require a host scalar processor): accelerator devices

- on disjoint physical memories (e.g., **Xeon KNC** with PCI-Expr, **PEZY-SC**)
- ISA-free architectures, code compiled to silica: FPGA
- focus on SIMT/SIMD to hide memory latency: GPU-type approach
- focus on tensor/neural nets cores: NVidia, IBM, Intel NNP, Google TPU
- heterogeneous PUs in a SoC: multicore PUs with GPU-cores

• ...

AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

17

### Machine learning w/ neural nets & deep learning...



Key algorithms to train & classify use matrix products, but require lower precision numbers!

AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

### NVidia Volta Architecture: the new Tensor Cores



### NVidia competitors with neural net features: IBM TrueNorth chip array (August'2014)



### NVidia competitors with neural net features: the IBM TrueNorth architecture



### NVidia competitors with neural net features: Intel Nervana Neural Network Processor, NNP



### NVidia competitors with neural net features: Google Tensor Processing Unit, TPU (April'17)

#### XX

- The Matrix Unit: 65,536 (256x256) . 8-bit multiply-accumulate units
- 700 MHz clock rate
- Peak: 92T operations/second 65,536 \* 2 \* 700M
- >25X as many MACs vs GPU
- >100X as many MACs vs CPU
- 4 MiB of on-chip Accumulator memory
- 24 MiB of on-chip Unified Buffer, (activation memory)
- 3.5X as much on-chip memory vs GPU
- Two 2133MHz DDR3 DRAM channels
- 8 GiB of off-chip weight DRAM memory

# **TPU: High-level Chip** Architecture



AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

### NVidia competitors with neural net features: Google Tensor Processing Unit, TPU (April'17)



### NVidia competitors with neural net features: Google TPUv2 (September'17)



AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

25

# **Beyond Vector/SIMD architectures**

#### ~~

- Vector/SIMD-extended architectures are hybrid approaches
  - mix (super)scalar + vector op capabilities on a single device
  - highly pipelined approach to reduce memory access penalty
  - tightly-closed access to shared memory: lower latency
- Evolution of Vector/SIMD-extended architectures
  - PU (Processing Unit) cores with wider vector units
    - x86 many-core: Intel MIC / Xeon KNL
    - other many-core: IBM Power BlueGene/Q Compute, ShenWay 260
  - coprocessors (require a host scalar processor): accelerator devices
    - on disjoint physical memories (e.g., Xeon KNC with PCI-Expr, PEZY-SC)
    - ISA-free architectures, code compiled to silica: FPGA
    - · focus on SIMT/SIMD to hide memory latency: GPU-type approach
    - focus on tensor/neural nets cores: NVidia, IBM, Intel NNP, Google TPU
  - heterogeneous PUs in a SoC: multicore PUs with GPU-cores
    - x86 multicore coupled with SIMT/SIMD cores: Intel i5/i7
    - ARMv8 cores coupled with SIMT/SIMD cores: NVidia Tegra