#### **Parallel Computing**



## **Master Informatics Eng.**

2021/22 *A.J.Proença* 

#### Computing accelerators: GPU & CUDA (most slides are borrowed)



#### **Compute accelerators**

Best accelerator for number crunching, namely intensive vector/matrix computing: **GPU** 

#### Other common compute accelerators:

- DSP: Digital Signal Processor, mostly used in telecommunication equipments, from cell phones to radio systems and TVs
- TPU: Tensor Processing Units, optimized for operations with tensors (vector and n-dimensional matrices), popular in AI app's, namely in autonomous driving
- FPGA: Field Programmable Gate Arrays, reconfigurable h/w; can be configured in runtime to behave according to a given specification

#### Data Parallelism: SIMD CPU vs. GPU



# **Graphics Processing Units**

#### **SIMD Parallelism**

M<

- Vector architectures
- SIMD & extensions
- Graphics Processor Units (GPUs)

Copyright @ 2012, Elsevier Inc. All rights reserved

- Question to GPU architects:
  - Given the hardware invested to do graphics well, how can we supplement it to improve the performance of a wider range of applications?
- Key ideas:
  - Heterogeneous execution model
    - CPU is the *host*, GPU is the *device*
  - Develop a C-like programming language for GPU
  - Unify all forms of GPU parallelism as CUDA\_threads
  - Programming model follows SIMT: "Single Instruction Multiple Thread"



## *# cores/processing element in several computing devices*



公

## Theoretical peak performance in several computing devices (DP)



AJProença, Parallel Computing, MEI, UMinho, 2021/22

# **NVIDIA GPU Architecture**

- Similarities to vector machines:
  - Works well with data-level parallel problems
  - Scatter-gather transfers
  - Mask registers
  - Large register files
- Differences:
  - No scalar processor
  - Uses multithreading to hide memory latency
  - Has many functional units, as opposed to a few deeply pipelined units like a vector processor



#### **Early NVidia GPU Computing Modules**

公



SM

I-Cache MT Issue

C-Cache

SP

SP

SP

SFU SFI

DP

Shared Memory

SP

SP

SP

SP

# **NVIDIA GPU Memory Structures**

- Each SIMD Lane has private section of off-chip DRAM
  - "Private memory" (Local Memory)
  - Contains stack frame, spilling registers, and private variables
- Each multithreaded SIMD processor (SM) also has local memory (Shared Memory)
  - Shared by SIMD lanes / threads within a block
- Memory shared by SIMD processors (SM) is GPU Memory, off-chip DRAM (Global Memory)
  - Host can read and write GPU memory



#### The NVidia Fermi architecture



# **Fermi Architecture Innovations**

#### Each SIMD processor has

- Two SIMD thread schedulers, two instruction dispatch units
- 16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load-store units, 4 special function units
- Thus, two threads of SIMD instructions are scheduled every two clock cycles



- Fast double precision
- Caches for GPU memory (16/64KiB\_L1/SM and global 768KiB\_L2)
- 64-bit addressing and unified address space
- Error correcting codes
- Faster context switching
- Faster atomic instructions



#### Fermi: Multithreading and Memory Hierarchy



AJProença, Parallel Computing, MEI, UMinho, 2021/22

公

### TOP500 list in November 2010: 3 systems in the top4 use Fermi GPUs



## HIGHLIGHTS: NOVEMBER 2010

- The Chinese Tianhe-1A system is the new No. 1 on the TOP500 and clearly in the lead with 2.57 petaflop/s performance.
- No. 3 is also a Chinese system called Nebulae, built from a Dawning TC3600 Blade system with Intel X5650 processors and NVIDIA Tesla C2050 GPUs
- There are seven petaflop/s systems in the TOP10
- The U.S. is tops in petaflop/s with three systems performing at the petaflop/s level
- The two Chinese systems and the new Japanese Tsubame 2.0 system at No. 4 are all using NVIDIA GPUs to
  accelerate computation and a total of 28 systems on the list are using GPU technology.

#### Families in NVidia Tesla GPUs (up to 2018)



#### From Fermi into Kepler: the Memory Hierarchy



AJProença, Parallel Computing, MEI, UMinho, 2021/22

公



16 SM

July'11

512 CUDA-cores

#### From the GF110 to the **GK110 Kepler Architecture**





AJProença, Parallel Computing, MEI, UMinho, 20

#### From Fermi to Kepler core: SM and the SMX Architecture

| SMX<br>Instruction Cache                                     |        |      |         |                |      |                                |         |                |         |       |        |      |                |      |      |        |         |        |     |
|--------------------------------------------------------------|--------|------|---------|----------------|------|--------------------------------|---------|----------------|---------|-------|--------|------|----------------|------|------|--------|---------|--------|-----|
| Warp Scheduler                                               |        |      |         | Warp Scheduler |      |                                |         | Warp Scheduler |         |       |        |      | Warp Scheduler |      |      |        |         |        |     |
| Dis                                                          | spatcl | h    | Dispat  | ch             | Di   | ispatc                         | h I     | Dispat         | ch      | Di    | spatc  | h    | Dispat         | ch   | D    | ispato | :h      | Dispat | tch |
| • •                                                          |        |      |         |                |      | Register File (65,536 x 32-bit |         |                |         |       | 2-bit) |      |                |      |      |        |         |        |     |
| ÷                                                            | +      | +    | +       | ÷              | +    | +                              | +       | +              | +       | +     | ÷      | +    | +              | +    | ÷    | ÷      | +       | +      | -   |
| Core                                                         | Core   | Core | DP Unit | Core           | Core | Core                           | DP Unit | LD/ST          | SFU     | Core  | Core   | Core | DP Unit        | Core | Core | Core   | DP Unit | LD/ST  | SF  |
| Core                                                         | Core   | Core | DP Unit | Core           | Core | Core                           | DP Unit | LD/ST          | SFU     | Core  | Core   | Core | DP Unit        | Core | Core | Core   | DP Unit | LD/ST  | SF  |
| Core                                                         | Core   | Core | DP Unit | Core           | Core | Core                           | DP Unit | LD/ST          | SFU     | Core  | Core   | Core | DP Unit        | Core | Core | Core   | DP Unit | LD/ST  | SF  |
| Core                                                         | Core   | Core | DP Unit | Core           | Core | Core                           | DP Unit | LD/ST          | SFU     | Core  | Core   | Core | DP Unit        | Core | Core | Core   | DP Unit | LD/ST  | SF  |
| Core                                                         | Core   | Core | DP Unit | Core           | Core | Core                           | DP Unit | LD/ST          | SFU     | Core  | Core   | Core | DP Unit        | Core | Core | Core   | DP Unit | LD/ST  | si  |
| Core                                                         | Core   | Core | DP Unit | Core           | Core | Core                           | DP Unit | LD/ST          | SFU     | Core  | Core   | Core | DP Unit        | Core | Core | Core   | DP Unit | LD/ST  | si  |
| Core                                                         | Core   | Core | DP Unit | Core           | Core | Core                           | DP Unit | LD/ST          | SFU     | Core  | Core   | Core | DP Unit        | Core | Core | Core   | DP Unit | LD/ST  | s   |
| Core                                                         | Core   | Core | DP Unit | Core           | Core | Core                           | DP Unit | LD/ST          | SFU     | Core  | Core   | Core | DP Unit        | Core | Core | Core   | DP Unit | LD/ST  | s   |
| Core                                                         | Core   | Core | DP Unit | Core           | Core | Core                           | DP Unit | LD/ST          | SFU     | Core  | Core   | Core | DP Unit        | Core | Core | Core   | DP Unit | LD/ST  | s   |
| Core                                                         | Core   | Core | DP Unit | Core           | Core | Core                           | DP Unit | LD/ST          | SFU     | Core  | Core   | Core | DP Unit        | Core | Core | Core   | DP Unit | LD/ST  | s   |
| Core                                                         | Core   | Core | DP Unit | Core           | Core | Core                           | DP Unit | LD/ST          | SFU     | Core  | Core   | Core | DP Unit        | Core | Core | Core   | DP Unit | LD/ST  | s   |
| Core                                                         | Core   | Core | DP Unit | Core           | Core | Core                           | DP Unit | LD/ST          | SFU     | Core  | Core   | Core | DP Unit        | Core | Core | Core   | DP Unit | LD/ST  | s   |
| Core                                                         | Core   | Core | DP Unit | Core           | Core | Core                           | DP Unit | LD/ST          | SFU     | Core  | Core   | Core | DP Unit        | Core | Core | Core   | DP Unit | LD/ST  | s   |
| Core                                                         | Core   | Core | DP Unit | Core           | Core | Core                           | DP Unit | LD/ST          | SFU     | Core  | Core   | Core | DP Unit        | Core | Core | Core   | DP Unit | LD/ST  | s   |
| Core                                                         | Core   | Core | DP Unit | Core           | Core | Core                           | DP Unit | LD/ST          | SFU     | Core  | Core   | Core | DP Unit        | Core | Core | Core   | DP Unit | LD/ST  | SI  |
| Core                                                         | Core   | Core | DP Unit | Core           | Core | Core                           | DP Unit | LD/ST          |         |       |        |      | DP Unit        | Core | Core | Core   | DP Unit | LD/ST  | s   |
|                                                              |        |      |         |                |      |                                | CA KB   |                |         | ct Ne |        |      |                |      |      |        |         |        |     |
| 64 KB Shared Memory / L1 Cache<br>48 KB Read-Only Data Cache |        |      |         |                |      |                                |         |                |         |       |        |      |                |      |      |        |         |        |     |
|                                                              | Tex    | -    | Tex     |                |      | Tex Tex                        |         |                | Tex Tex |       |        |      | Тех            |      |      | Tex    |         |        |     |
|                                                              | Tex    |      | Tex     |                |      | Tex                            |         | Tex            |         | -     | Tex    |      | Tex            |      |      | Tex    |         | Tex    |     |



### The move from Kepler to Maxwell : from 15 SMXs to 48 SMMs in 6 GPCs



AJProença, Parallel Computing, MEI, UMinho, 2021/22

Dispat

.D/ST SI

Register File (65,536 x 32-bit)







# From the GP100 to the GV100 Volta Architecture

Pascal: 60 SM 3840 CUDA-cores *November'15* 

Volta: 84 SM 5120 CUDA-cores HBM on-package June'17







#### From GV 100 to Ampere: up to 8 GPC, 128 SMs total

Ampere: NVidia GA100 8192 FP32 CUDA Cores 512 3<sup>rd</sup> generation Tensor Cores 5 HBM2e, 10 <u>512-bit</u> mem controllers *May'20* 





#### **Tensor cores in Ampere**



#### Pascal vs. Turing tensor cores (animation)



|                 | Tesla Product               | Tesla K40            | Tesla M40           | Tesla P100          | Tesla V100                  |
|-----------------|-----------------------------|----------------------|---------------------|---------------------|-----------------------------|
|                 | GPU                         | GK180 (Kepler)       | GM200<br>(Maxwell)  | GP100<br>(Pascal)   | GV100 (Volta)               |
|                 | SMs                         | 15                   | 24                  | 56                  | 80                          |
|                 | TPCs                        | 15                   | 24                  | 28                  | 40                          |
|                 | FP32 Cores / SM             | 192                  | 128                 | 64                  | 64                          |
|                 | FP32 Cores / GPU            | 2880                 | 3072                | 3584                | 5120                        |
| FP64 Cores / SM |                             | 64                   | 4                   | 32                  | 32                          |
|                 | FP64 Cores / GPU            | 960                  | 96                  | 1792                | 2560                        |
|                 | Tensor Cores / SM           | NA                   | NA                  | NA                  | 8                           |
|                 | Tensor Cores / GPU          | NA                   | NA                  | NA                  | 640                         |
|                 | GPU Boost Clock             | 810/875 MHz          | 1114 MHz            | 1480 MHz            | 1530 MHz                    |
|                 | Peak FP32 TFLOP/s           | 5.04                 | 6.8                 | 10.6                | 15.7                        |
|                 | Peak FP64 TFLOP/s           | 1.68                 | .21                 | 5.3                 | 7.8                         |
|                 | Peak Tensor Core<br>TFLOP/s | NA                   | NA                  | NA                  | 125                         |
|                 | Texture Units               | 240                  | 192                 | 224                 | 320                         |
|                 | Memory Interface            | 384-bit GDDR5        | 384-bit GDDR5       | 4096-bit<br>HBM2    | 4096-bit HBM2               |
|                 | Memory Size                 | Up to 12 GB          | Up to 24 GB         | 16 GB               | 16 GB                       |
|                 | L2 Cache Size               | 1536 KB              | 3072 KB             | 4096 KB             | 6144 KB                     |
|                 | Shared Memory Size /<br>SM  | 16 KB/32 KB/48<br>KB | 96 KB               | 64 KB               | Configurable up to 96<br>KB |
|                 | Register File Size / SM     | 256 KB               | 256 KB              | 256 KB              | 256KB                       |
| F               | Register File Size / GPU    | 3840 KB              | 6144 KB             | 14336 KB            | 20480 KB                    |
|                 | TDP                         | 235 Watts            | 250 Watts           | 300 Watts           | 300 Watts                   |
|                 | Transistors                 | 7.1 billion          | 8 billion           | 15.3 billion        | 21.1 billion                |
|                 | GPU Die Size                | 551 mm <sup>2</sup>  | 601 mm <sup>2</sup> | 610 mm <sup>2</sup> | 815 mm <sup>2</sup>         |
|                 | Manufacturing Process       | 28 nm                | 28 nm               | 16 nm<br>FinFET+    | 12 nm FFN                   |

### Tesla accelerators: evolution

#### Ampere

#### SYSTEM SPECIFICATIONS (PEAK PERFORMANCE)

|                                 | NVIDIA A100 for<br>NVIDIA HGX™                                 | NVIDIA A100 for<br>PCle                                                  |  |  |  |  |  |
|---------------------------------|----------------------------------------------------------------|--------------------------------------------------------------------------|--|--|--|--|--|
| GPU Architecture                | NVIDIA Ampere                                                  |                                                                          |  |  |  |  |  |
| Double-Precision<br>Performance |                                                                | FP64: 9.7 TFLOPS<br>FP64 Tensor Core: 19.5 TFLOPS                        |  |  |  |  |  |
| Single-Precision<br>Performance | Tensor Float 32 (TF                                            | FP32: 19.5 TFLOPS<br>Tensor Float 32 (TF32): 156 TFLOPS  <br>312 TFLOPS* |  |  |  |  |  |
| Half-Precision<br>Performance   | 312 TFLOPS                                                     | 312 TFLOPS   624 TFLOPS*                                                 |  |  |  |  |  |
| Bfloat16                        | 312 TFLOPS   624 TFLOPS*                                       |                                                                          |  |  |  |  |  |
| Integer Performance             | INT8: 624 TOPS   1,248 TOPS*<br>INT4: 1,248 TOPS   2,496 TOPS* |                                                                          |  |  |  |  |  |
| GPU Memory                      | 40 GB                                                          | 40 GB HBM2                                                               |  |  |  |  |  |
| Memory Bandwidth                | 1.6 TI                                                         | 1.6 TB/sec                                                               |  |  |  |  |  |

|                                                          | Tesla V100            | Tesla P100       | Tesla M40     | Tesla K40      | Tesla Product               |
|----------------------------------------------------------|-----------------------|------------------|---------------|----------------|-----------------------------|
|                                                          | GV100 (Volta)         | GP100            | GM200         | GK180 (Kepler) | GPU                         |
| a evolution                                              |                       | (Pascal)         | (Maxwell)     | 12             |                             |
| (1)                                                      | 80                    | 56               | 24            | 15             | SMs                         |
| (-/                                                      | 40                    | 28               | 24            | 15             | TPCs                        |
|                                                          | 64                    | 64               | 128           | 192            | FP32 Cores / SM             |
|                                                          | 5120                  | 3584             | 3072          | 2880           | FP32 Cores / GPU            |
|                                                          | 32                    | 32               | 4             | 64             | FP64 Cores / SM             |
| ta/                                                      | 2560                  | 1792             | 96            | 960            | FP64 Cores / GPU            |
|                                                          | 8                     | NA               | NA            | NA             | Tensor Cores / SM           |
| ide                                                      | 640                   | NA               | NA            | NA             | Tensor Cores / GPU          |
| l/ins                                                    | 1530 MHz              | 1480 MHz         | 1114 MHz      | 810/875 MHz    | GPU Boost Clock             |
| oral                                                     | 15.7                  | 10.6             | 6.8           | 5.04           | Peak FP32 TFLOP/s           |
| llelf                                                    | 7.8                   | 5.3              | .21           | 1.68           | Peak FP64 TFLOP/s           |
| m/oara                                                   | 125                   | NA               | NA            | NA             | Peak Tensor Core<br>TFLOP/s |
| COL                                                      | 320                   | 224              | 192           | 240            | Texture Units               |
| https://devblogs.nvidia.com/parallelforall/inside-volta/ | 4096-bit HBM2         | 4096-bit<br>HBM2 | 384-bit GDDR5 | 384-bit GDDR5  | Memory Interface            |
| SDO                                                      | 16 GB                 | 16 GB            | Up to 24 GB   | Up to 12 GB    | Memory Size                 |
| ldve                                                     | 6144 KB               | 4096 KB          | 3072 KB       | 1536 KB        | L2 Cache Size               |
| 9 <b>0</b> //:                                           | Configurable up to 96 | 64 KB            | 96 KB         | 16 KB/32 KB/48 | Shared Memory Size /        |
| tos                                                      | КВ                    |                  |               | КВ             | SM                          |
|                                                          | 256KB                 | 256 KB           | 256 KB        | 256 KB         | Register File Size / SM     |
| 29                                                       | 20480 KB              | 14336 KB         | 6144 KB       | 3840 KB        | Register File Size / GPU    |

#### **Tesla** evolution (2)

| Nvidia Datacenter GPU       | Nvidia Tesla V100 | Nvidia A100                  |
|-----------------------------|-------------------|------------------------------|
| GPU codename                | GV100             | GA100                        |
| GPU architecture            | Volta             | Ampere                       |
| Launch date                 | May 2017          | May 2020                     |
| GPU process                 | TSMC 12nm         | TSMC 7nm                     |
| Die size                    | 815mm2            | 826mm2                       |
| Transistor Count            | 21.1 billion      | 54 billion                   |
| FP64 CUDA cores             | 2,560             | 3,456                        |
| FP32 CUDA cores             | 5,120             | 6,912                        |
| Tensor Cores                | 640               | 432                          |
| Streaming Multiprocessors   | 80                | 108                          |
| Peak FP64                   | 7.8 teraflops     | 9.7 teraflops                |
| Peak FP64 Tensor Core       |                   | 19.5 teraflops               |
| Peak FP32                   | 15.7 teraflops    | 19.5 teraflops               |
| Peak FP32 Tensor Core       | -                 | 156 teraflops/312 teraflops* |
| Peak BFLOAT16 Tensor Core   | -                 | 312 teraflops/624 teraflops* |
| Peak FP16 Tensor Core       |                   | 312 teraflops/624 teraflops* |
| Peak INT8 Tensor Core       | 15.4              | 624 teraflops/1,248 TOPS*    |
| Peak INT4 Tensor Core       | -                 | 1,248 TOPS/2,496 TOPS*       |
| Mixed-precision Tensor Core | 125 teraflops     | 312 teraflops/624 teraflops* |
| Max TDP                     | 300 watts         | 400 watts                    |

AJProença, \*Effective TOPS / TFLOPS using the new Sparsity feature

30

## The CUDA programming model



- Compute Unified Device Architecture
- CUDA is a recent programming model, designed for
  - a multicore CPU *host* coupled to a many-core *device*, where
  - devices have wide SIMD/SIMT parallelism, and
  - the *host* and the *device* do not share memory
- CUDA provides:
  - a thread abstraction to deal with SIMD
  - synchr. & data sharing between small groups of threads
- CUDA programs are written in C with extensions
- OpenCL inspired by CUDA, but hw & sw vendor neutral
  - programming model essentially identical

## **CUDA Devices and Threads**

#### $\langle \rangle$

- A compute device
  - is a coprocessor to the CPU or host
  - has its own DRAM (device memory)
  - runs many threads in parallel
  - is typically a GPU but can also be another type of parallel processing device
- Data-parallel portions of an application are expressed as device kernels which run on many threads - SIMT
- Differences between GPU and CPU threads
  - GPU threads are extremely lightweight
    - very little creation overhead, requires LARGE register bank
  - GPU needs 1000s of threads for full efficiency
    - multi-core CPU needs only a few

32

### CUDA basic model: Single-Program Multiple-Data (SPMD)

- CUDA integrated CPU + GPU application C program
  - Serial C code executes on CPU

公入

Parallel Kernel C code executes on GPU thread blocks



© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

## Programming Model: SPMD + SIMT/SIMD

#### $\sim$

- Hierarchy
  - Device => Grids
  - Grid => Blocks
  - Block => Warps
  - Warp => Threads
- Single kernel runs on multiple blocks (SPMD)
- Threads within a warp are executed in a lock-step way called singleinstruction multiple-thread (SIMT)
- Single instruction are executed on multiple threads (SIMD)
  - Warp size defines SIMD granularity (32 threads)
- Synchronization within a block uses shared memory



Courtesy NVIDIA

### The Computational Grid: Block IDs and Thread IDs



AJProença, Parallel Computing, MEI, UMinho, 2021/22

35

#### Code example



## Terminology (and in NVidia)

#### $\sim$

- Threads of SIMD instructions (warps)
  - Each has its own IP (up to 48/64 per SIMD processor, Fermi/Kepler)
  - Thread scheduler uses scoreboard to dispatch
  - No data dependencies between threads!
  - Threads are organized into blocks & executed in groups of 32 threads (*thread block*)
    - Blocks are organized into a grid
- The <u>thread block scheduler</u> schedules blocks to SIMD processors (Streaming Multiprocessors)
- Within each SIMD processor:
  - 32 SIMD lanes (thread processors)
  - Wide and shallow compared to vector processors

Copyright © 2012, Elsevier Inc. All rights reserved.

## **CUDA Thread Block**

#### $\sim$

- Programmer declares (Thread) Block:
  - Block size 1 to 512 concurrent threads
  - Block shape 1D, 2D, or 3D
  - Block dimensions in threads
- All threads in a Block execute the same thread program
- Threads share data and synchronize while doing their share of the work
- Threads have thread id numbers
   within Block
- Thread program uses thread id to select work and address shared data

#### CUDA Thread Block



#### **Parallel Memory Sharing**



AJProença, Parallel Computing, MEI, UMinho, 2021/22

39

### **CUDA Memory Model Overview**



公

- R/W per-thread registers
- R/W per-thread local memory
- R/W per-block shared memory
- R/W per-grid global memory
- Read only per-grid constant memory
- Read only per-grid texture memory
- The host can R/W global, constant, and texture memories



#### Hardware Implementation: Memory Architecture

#### $\sim$

- Device memory (DRAM)
  - Slow (2~300 cycles)
  - <u>Local</u>, global, constant, and texture memory
- On-chip memory
  - Fast (1 cycle)
  - Registers, shared memory, constant/texture cache



#### Terminology: CUDA and OpenCL

CUDA and OpenCL

公

