### Advanced Architectures



## Master Informatics Eng.

2020/21

A.J.Proença

### **Data Parallelism with GPUs**

(most slides are borrowed)

### Data Parallelism: SIMD CPU vs. GPU





## **Graphics Processing Units**

### SIMD Parallelism

M<

- Vector architectures
- SIMD & extensions
- Graphics Processor Units (GPUs)

Copyright @ 2012, Elsevier Inc. All rights reserved

- Question to GPU architects:
  - Given the hardware invested to do graphics well, how can we supplement it to improve the performance of a wider range of applications?

Key ideas:

- Heterogeneous execution model
  - CPU is the host, GPU is the device
- Develop a C-like programming language for GPU
- Unify all forms of GPU parallelism as CUDA\_threads
- Programming model follows SIMT:
  "Single Instruction Multiple Thread"



## # cores/processing elements in several computing devices

人入

Key question: what is a core?

- IU+FPU? GPU-type...
- b) A SIMD

  processor?

  CPU-type...

  This updated slide

and in this course: - b)

Note: the web link with these plots was updated in Aug'16



# Theoretical peak performance (DP) in several computing devices



### **NVIDIA GPU Architecture**

- Similarities to vector machines:
  - Works well with data-level parallel problems
  - Scatter-gather transfers
  - Mask registers
  - Large register files
- Differences:
  - No scalar processor
  - Uses multithreading to hide memory latency
  - Has many functional units, as opposed to a few deeply pipelined units like a vector processor



## Early NVidia GPU Computing Modules





## **NVIDIA GPU Memory Structure**

- Each SIMD Lane has private section of off-chip DRAM
  - "Private memory" (Local Memory)
  - Contains stack frame, spilling registers, and private variables
- Each multithreaded SIMD processor (SM) also has local memory (Shared Memory)
  - Shared by SIMD lanes / threads within a block
- Memory shared by SIMD processors (SM) is
   GPU Memory, off-chip DRAM (Global Memory)
  - Host can read and write GPU memory



### The NVidia Fermi architecture



### Fermi Architecture Innovations

- Each SIMD processor has
  - Two SIMD thread schedulers, two instruction dispatch units

■ 16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load-store units

4 special function units

 Thus, two threads of SIMD instructions are scheduled every two clock cycles



- Fast double precision
- Caches for GPU memory (16/64KiB\_L1/SM and global 768KiB\_L2)
- 64-bit addressing and unified address space
- Error correcting codes
- Faster context switching
- Faster atomic instructions



# Fermi: Multithreading and Memory Hierarchy





# TOP500 list in November 2010: 3 systems in the top4 use Fermi GPUs





### **HIGHLIGHTS: NOVEMBER 2010**

- The Chinese Tianhe-1A system is the new No. 1 on the TOP500 and clearly in the lead with 2.57 petaflop/s
  performance.
- No. 3 is also a Chinese system called Nebulae, built from a Dawning TC3600 Blade system with Intel X5650 processors and NVIDIA Tesla C2050 GPUs
- There are seven petaflop/s systems in the TOP10
- The U.S. is tops in petaflop/s with three systems performing at the petaflop/s level
- The two Chinese systems and the new Japanese Tsubame 2.0 system at No. 4 are all using NVIDIA GPUs to
  accelerate computation and a total of 28 systems on the list are using GPU technology.

### Families in NVidia Tesla GPUs



## From Fermi into Kepler: The Memory Hierarchy





### **Kepler Memory Hierarchy**



# DRAM I/F DRAM I/F DRAM I/F

# From the GF110 to the GK110 Kepler Architecture

Fermi: 16 SM 512 CUDA-cores *July'11* 

> Kepler: 15 SMX 2880 CUDA-cores October'13



#### SM Instruction Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit Register File (32,768 x 32-bit) LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST LD/ST Core Core Core Core LD/ST LD/ST Core Core Core LD/ST LD/ST Core Core Core Core LD/ST SFU LD/ST Core Core Core Core LD/ST Interconnect Network 64 KB Shared Memory / L1 Cache Uniform Cache Fermi SM

## From Fermi to Kepler core: SM and the SMX Architecture



SMX:

192 CUDA-cores

Ratio **DP**unit : **SP**unit -> 1 : 3

AJProenca, Advanced Architectures, MiEl, UMinh



# From the GK110 to the GM200 Maxwell Architecture



Maxwell: 24 SMM 3072 CUDA-cores November'15



PCI Express 3.0 Host Interface
GigaThread Engine
Raster Engine

Raster Engine



# From Kepler to Maxwell core: SMX and the SMM Architecture

Maxwell SMM: 128 CUDA-cores
Ratio **DP**unit: **SP**unit -> 1:32



Kepler SMX





PolyMorph Engine 3.0

AJProença, Advanced Architectures, MiEI, UMinho, 2020/21



# From the M200 to the GP100 Pascal Architecture

Maxwell: 24 SMM 3072 CUDA-cores *November'15* 

Pascal: 60 SM 3840 CUDA-cores 4 HBM on-package September'16



PCI Express 3.0 Host Interface





# From the GP100 to the GV100 Volta Architecture

Pascal: 60 SM 3840 CUDA-cores *November'15* 

Volta: 84 SM 5120 CUDA-cores HBM on-package June'17





**TENSOR** 

CORE

SFU

TENSOR

CORE

SFU



# From GV 100 to Ampere: up to 8 GPC, 128 SMs total

Ampere: NVidia GA100 128 SM 8192 FP32 CUDA Cores 512 3<sup>rd</sup> generation Tensor Cores 6 HBM2, 12 512-bit mem controllers May'20

Volta: 84 SM 3584 CUDA-cores *November'15* 

Ampere:

<u>GA100</u>
for graphics
w/ 8 GPC
<u>A100</u>

for HPC & AI w/ 7 GPC





### Ampere Architecture

L1 Instruction Cache







NT32 INT32

FP64

Tex

LD/ LD/



Ampere SM:

64x FP32 CUDA Cores/SM 32x FP64 CUDA Cores/SM

4x 3<sup>rd</sup> generation Tensor Cores

Tensor Cores support FP64, FP32, TF32, FP16, BF16, INT8...

1024 dense FP16/FP32 FMA op's/cycle



AJProenca, Advanced Architectures, MiEI, UMin

## Tensor cores in Ampere

### **Tensor**: a multidimensional array











## Pascal vs. Turing tensor cores (animation)





## Volta and Ampere specifications



| Nvidia Datacenter GPU       | Nvidia Tesla V100 | Nvidia A100                  |
|-----------------------------|-------------------|------------------------------|
| GPU codename                | GV100             | GA100                        |
| GPU architecture            | Volta             | Ampere                       |
| Launch date                 | May 2017          | May 2020                     |
| GPU process                 | TSMC 12nm         | TSMC 7nm                     |
| Die size                    | 815mm2            | 826mm2                       |
| Transistor Count            | 21.1 billion      | 54 billion                   |
| FP64 CUDA cores             | 2,560             | 3,456                        |
| FP32 CUDA cores             | 5,120             | 6,912                        |
| Tensor Cores                | 640               | 432                          |
| Streaming Multiprocessors   | 80                | 108                          |
| Peak FP64                   | 7.8 teraflops     | 9.7 teraflops                |
| Peak FP64 Tensor Core       | -                 | 19.5 teraflops               |
| Peak FP32                   | 15.7 teraflops    | 19.5 teraflops               |
| Peak FP32 Tensor Core       | -                 | 156 teraflops/312 teraflops* |
| Peak BFLOAT16 Tensor Core   | -                 | 312 teraflops/624 teraflops* |
| Peak FP16 Tensor Core       | -                 | 312 teraflops/624 teraflops* |
| Peak INT8 Tensor Core       |                   | 624 teraflops/1,248 TOPS*    |
| Peak INT4 Tensor Core       | -                 | 1,248 TOPS/2,496 TOPS*       |
| Mixed-precision Tensor Core | 125 teraflops     | 312 teraflops/624 teraflops* |
| Max TDP                     | 300 watts         | 400 watts                    |
|                             |                   |                              |

|                                                         | Tesla Product                | Tesla K40            | Tesla M40          | Tesla P100        | Tesla V100                  |
|---------------------------------------------------------|------------------------------|----------------------|--------------------|-------------------|-----------------------------|
|                                                         | GPU                          | GK180 (Kepler)       | GM200<br>(Maxwell) | GP100<br>(Pascal) | GV100 (Volta)               |
| Ita/                                                    | SMs                          | 15                   | 24                 | 56                | 80                          |
| 9-VC                                                    | TPCs                         | 15                   | 24                 | 28                | 40                          |
| Sid                                                     | FP32 Cores / SM              | 192                  | 128                | 64                | 64                          |
| a                                                       | FP32 Cores / GPU             | 2880                 | 3072               | 3584              | 5120                        |
| jorg                                                    | FP64 Cores / SM              | 64                   | 4                  | 32                | 32                          |
| a<br>e                                                  | FP64 Cores / GPU             | 960                  | 96                 | 1792              | 2560                        |
| γpa                                                     | Tensor Cores / SM            | NA                   | NA                 | NA                | 8                           |
| COL                                                     | Tensor Cores / GPU           | NA                   | NA                 | NA                | 640                         |
| <u>a</u>                                                | GPU Boost Clock              | 810/875 MHz          | 1114 MHz           | 1480 MHz          | 1530 MHz                    |
| <u>.</u>                                                | Peak FP32 TFLOP/s*           | 5.04                 | 6.8                | 10.6              | 15.7                        |
| go                                                      | Peak FP64 TFLOP/s*           | 1.68                 | .21                | 5.3               | 7.8                         |
| nttps://devblogs.nvidia.com/parallelforall/inside-volta | Peak Tensor Core<br>TFLOP/s* | NA                   | NA                 | NA                | 125                         |
| ttps                                                    | Texture Units                | 240                  | 192                | 224               | 320                         |
| _                                                       | Memory Interface             | 384-bit GDDR5        | 384-bit GDDR5      | 4096-bit<br>HBM2  | 4096-bit HBM2               |
|                                                         | Memory Size                  | Up to 12 GB          | Up to 24 GB        | 16 GB             | 16 GB                       |
|                                                         | L2 Cache Size                | 1536 KB              | 3072 KB            | 4096 KB           | 6144 KB                     |
|                                                         | Shared Memory Size /<br>SM   | 16 KB/32 KB/48<br>KB | 96 KB              | 64 KB             | Configurable up to 96<br>KB |
|                                                         | Register File Size / SM      | 256 KB               | 256 KB             | 256 KB            | 256KB                       |
|                                                         | Register File Size / GPU     | 3840 KB              | 6144 KB            | 14336 KB          | 20480 KB                    |
|                                                         | TDP                          | 235 Watts            | 250 Watts          | 300 Watts         | 300 Watts                   |
|                                                         | Transistors                  | 7.1 billion          | 8 billion          | 15.3 billion      | 21.1 billion                |
|                                                         | GPU Die Size                 | 551 mm²              | 601 mm²            | 610 mm²           | 815 mm²                     |
|                                                         | Manufacturing Process        | 28 nm                | 28 nm              | 16 nm<br>FinFET+  | 12 nm FFN                   |

# GPU accelerators: evolution

### Ampere SYSTEM SPECIFICATIONS (PEAK PERFORMANCE)

|                                 | NVIDIA A100 for<br>NVIDIA HGX™                                           | NVIDIA A100 for<br>PCle |  |
|---------------------------------|--------------------------------------------------------------------------|-------------------------|--|
| GPU Architecture                | NVIDIA Ampere                                                            |                         |  |
| Double-Precision<br>Performance | FP64: 9.7 TFLOPS<br>FP64 Tensor Core: 19.5 TFLOPS                        |                         |  |
| Single-Precision<br>Performance | FP32: 19.5 TFLOPS<br>Tensor Float 32 (TF32): 156 TFLOPS  <br>312 TFLOPS* |                         |  |
| Half-Precision<br>Performance   | 312 TFLOPS   624 TFLOPS*                                                 |                         |  |
| Bfloat16                        | 312 TFLOPS   624 TFLOPS*                                                 |                         |  |
| Integer Performance             | INT8: 624 TOPS   1,248 TOPS*<br>INT4: 1,248 TOPS   2,496 TOPS*           |                         |  |
| GPU Memory                      | 40 GB HBM2                                                               |                         |  |
| Memory Bandwidth                | 1.6 TB/sec                                                               |                         |  |