### **Master Informatics Eng.**

# 2018/19 A.J.Proença

# Beyond traditional PUs (GPU/CUDA, Tensor Cores, ...) (most slides are borrowed)

AJProença, Parallel Computing, MiEI, UMinho, 2018/19

1

### **Beyond Vector/SIMD architectures**

### XX

- Vector/SIMD-extended architectures are hybrid approaches
  - mix (super)scalar + vector op capabilities on a single device
  - highly pipelined approach to reduce memory access penalty
  - tightly-closed access to shared memory: lower latency
- Evolution of Vector/SIMD-extended architectures
  - CPU cores with wider vector units
    - x86 many-core: Intel MIC / Xeon KNL
    - IBM Power cores with SIMD extensions: BlueGene/Q Compute
    - other many-core: **ShenWay** 260
  - coprocessors (require a host scalar processor): accelerator devices
    - on disjoint physical memories (e.g., **Xeon KNC** with PCI-Expr, **PEZY-SC**)
    - · ISA-free architectures, code compiled to silica: FPGA
    - focus on SIMT/SIMD to hide memory latency: GPU-type approach
  - heterogeneous processors (multicore with GPU-cores, SoC)

• ...

### **Graphical Processing Units**

众人

- Question to GPU architects:
  - Given the hardware invested to do graphics well, how can we supplement it to improve the performance of a wider range of applications?
- Key ideas:
  - Heterogeneous execution model
    - CPU is the host, GPU is the device
  - Develop a C-like programming language for GPU
  - Unify all forms of GPU parallelism as CUDA\_threads
  - Programming model follows SIMT: "Single Instruction Multiple Thread"

Copyright © 2012, Elsevier Inc. All rights reserved.

AJProença, Parallel Computing, MiEI, UMinho, 2018/19

3

### #cores/processing element in several devices



AJProença, Parallel Computing, MiEI, UMinho, 2018/19

# http://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

# Theoretical peak performance in several computing devices (DP)



AJProença, Parallel Computing, MiEI, UMinho, 2018/19

# Theoretical peak FP Op's per clock cycle in several computing devices (DP)



### 众入

- · Similarities to vector machines:
  - Works well with data-level parallel problems
  - Scatter-gather transfers
  - Mask registers
  - Large register files

### Differences:

- No scalar processor
- Uses multithreading to hide memory latency
- Has many functional units, as opposed to a few deeply pipelined units like a vector processor

Copyright © 2012, Elsevier Inc. All rights reserved.

AJProença, Parallel Computing, MiEI, UMinho, 2018/19

7

### Early NVidia GPU Computing Modules

八入



### **NVIDIA GPU Memory Structures**

- Each SIMD Lane has private section of off-chip DRAM
  - "Private memory" (Local Memory)
  - Contains stack frame, spilling registers, and private variables
- Each multithreaded SIMD processor (SM) also has local memory (Shared Memory)
  - Shared by SIMD lanes / threads within a block
- SM MT Issue SP SP SP SP SP SP SP SFU SFU DP Shared Memory
- Memory shared by SIMD processors (SM) is GPU Memory, off-chip DRAM (Global Memory)
  - Host can read and write GPU memory

Copyright © 2012, Elsevier Inc. All rights reserved.

AJProença, Parallel Computing, MiEI, UMinho, 2018/19

### The NVidia Fermi architecture



AJProença, Parallel Computing, MiEI, UMinho, 2018/19

### Fermi Architecture Innovations

### *X*

- Each SIMD processor has
  - Two SIMD thread schedulers, two instruction dispatch units
  - 16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load-store units, 4 special function units
  - Thus, two threads of SIMD instructions are scheduled every two clock cycles



- Fast double precision
- Caches for GPU memory (16/64KB\_L1/SM and global 768KB\_L2)
- 64-bit addressing and unified address space
- Error correcting codes
- · Faster context switching
- Faster atomic instructions

Copyright © 2012, Elsevier Inc. All rights reserved. AJProença, Parallel Computing, MiEI, UMINIO, 2018/19

11

### **Beyond Vector/SIMD architectures**

- Vector/SIMD-extended architectures are hybrid approaches
  - mix (super)scalar + vector op capabilities on a single device
  - highly pipelined approach to reduce memory access penalty
  - tightly-closed access to shared memory: lower latency
- Evolution of Vector/SIMD-extended architectures
  - PU (Processing Unit) cores with wider vector units
    - x86 many-core: Intel MIC / Xeon KNL
    - other many-core: IBM BlueGene/Q Compute, ShenWay 260
  - coprocessors (require a host scalar processor): accelerator devices
    - on disjoint physical memories (e.g., **Xeon KNC** with PCI-Expr, **PEZY-SC**)
    - · ISA-free architectures, code compiled to silica: FPGA
    - focus on SIMT/SIMD to hide memory latency: GPU-type approach
    - ...
  - heterogeneous PUs in a SoC: multicore PUs with GPU-cores
    - ...

# Fermi: Multithreading and Memory Hierarchy

人入



AJProença, Parallel Computing, MiEI, UMinho, 2018/19

13

# TOP500 list in November 2010: 3 systems in the top4 use Fermi GPUs

XX



### HIGHLIGHTS: NOVEMBER 2010

- The Chinese Tianhe-1A system is the new No. 1 on the TOP500 and clearly in the lead with 2.57 petaflop/s performance.
- No. 3 is also a Chinese system called Nebulae, built from a Dawning TC3600 Blade system with Intel X5650 processors and NVIDIA Tesla C2050 GPUs
- There are seven petaflop/s systems in the TOP10
- The U.S. is tops in petaflop/s with three systems performing at the petaflop/s level
- The two Chinese systems and the new Japanese Tsubame 2.0 system at No. 4 are all using NVIDIA GPUs to accelerate computation and a total of 28 systems on the list are using GPU technology.

### Families in NVidia Tesla GPUs



AJProença, Parallel Computing, MiEI, UMinho, 2018/19

15

# From Fermi into Kepler: The Memory Hierarchy





AJProença, Parallel Computing, MiEI, UMinho, 2018/19









AJProença, Parallel Computing, MiEl, UMinho, 2018/19





AJProença, Parallel Computing, MiEI, UMinho, 2018/19



| Tesla Product                | Tesla K40            | Tesla M40     | Tesla P100          | Tesla V100                  |
|------------------------------|----------------------|---------------|---------------------|-----------------------------|
| GPU                          | GK180 (Kepler)       | GM200         | GP100               | GV100 (Volta)               |
|                              |                      | (Maxwell)     | (Pascal)            |                             |
| SMs                          | 15                   | 24            | 56                  | 80                          |
| TPCs                         | 15                   | 24            | 28                  | 40                          |
| FP32 Cores / SM              | 192                  | 128           | 64                  | 64                          |
| FP32 Cores / GPU             | 2880                 | 3072          | 3584                | 5120                        |
| FP64 Cores / SM              | 64                   | 4             | 32                  | 32                          |
| FP64 Cores / GPU             | 960                  | 96            | 1792                | 2560                        |
| Tensor Cores / SM            | NA                   | NA            | NA                  | 8                           |
| Tensor Cores / GPU           | NA                   | NA            | NA                  | 640                         |
| GPU Boost Clock              | 810/875 MHz          | 1114 MHz      | 1480 MHz            | 1530 MHz                    |
| Peak FP32 TFLOP/s*           | 5.04                 | 6.8           | 10.6                | 15.7                        |
| Peak FP64 TFLOP/s*           | 1.68                 | .21           | 5.3                 | 7.8                         |
| Peak Tensor Core<br>TFLOP/s* | NA                   | NA            | NA                  | 125                         |
| Texture Units                | 240                  | 192           | 224                 | 320                         |
| Memory Interface             | 384-bit GDDR5        | 384-bit GDDR5 | 4096-bit<br>HBM2    | 4096-bit HBM2               |
| Memory Size                  | Up to 12 GB          | Up to 24 GB   | 16 GB               | 16 GB                       |
| L2 Cache Size                | 1536 KB              | 3072 KB       | 4096 KB             | 6144 KB                     |
| Shared Memory Size /<br>SM   | 16 KB/32 KB/48<br>KB | 96 KB         | 64 KB               | Configurable up to 96<br>KB |
| Register File Size / SM      | 256 KB               | 256 KB        | 256 KB              | 256KB                       |
| Register File Size / GPU     | 3840 KB              | 6144 KB       | 14336 KB            | 20480 KB                    |
| TDP                          | 235 Watts            | 250 Watts     | 300 Watts           | 300 Watts                   |
| Transistors                  | 7.1 billion          | 8 billion     | 15.3 billion        | 21.1 billion                |
| GPU Die Size                 | 551 mm²              | 601 mm²       | 610 mm <sup>2</sup> | 815 mm²                     |
| Manufacturing Process        | 28 nm                | 28 nm         | 16 nm<br>FinFET+    | 12 nm FFN                   |

# Tesla accelerators: recent evolution



https://devblogs.nvidia.com/parallelforall/inside-volta/

25



### Current top 10 greener-HPC systems Nov'17 Green500

|      |        |                                                                                                                                                                            |            |           |       | _                   |      |     |                                                                                                                                                                                        |         |          |       |        |
|------|--------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|-----------|-------|---------------------|------|-----|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|----------|-------|--------|
|      | TOP500 |                                                                                                                                                                            |            | Rmax      | Power | Power<br>Efficiency |      |     |                                                                                                                                                                                        |         |          |       |        |
| Rank | Rank   | System                                                                                                                                                                     | Cores      | (TFlop/s) | (kW)  | (GFlops/watts)      |      |     |                                                                                                                                                                                        |         |          |       |        |
| 1    | 259    | Shoubu system B - ZettaScaler-2.2, Xeon D-1571 16C 1.36Hz, Infiniband EDR, PEZY-SC2 PEZY Computing / Exascaler Inc. Advanced Center for Computing and Communication, RIKEN | 794,400    | 842.0     | 50    | 17.009              | 6    | 13  | TSUBAME3.0 - SGI ICE XA,<br>IP139-SXM2, Xeon<br>E5-2680v4 14C 2.4GHz,<br>Intel Omni-Path, NVIDIA<br>Tesla P100 SXM2 , HPE<br>GSIC Center, Tokyo<br>Institute of Technology<br>Japan    |         | 8,125.0  | 792   | 13.704 |
| 2    | 307    | Japan Suiren2 - ZettaScaler-2.2, Xeon D-1571 16C 1.3GHz, Infiniband EDR, PEZY-SC2, PEZY Computing / Exascaler Inc. High Energy Accelerator Research Organization /KEK      | 762,624    | 788.2     | 47    | 16.759              | 7    | 195 | AIST AI Cloud - NEC 4U-80PU Server, Xeon E5-2630Lv4 10C 1.8GHz, Infiniband EDR, NVIDIA Tesla P100 SXM2 , NEC National Institute of Advanced Industrial Science and Technology Japan    | 23,400  | 961.0    | 76    | 12.681 |
| 3    | 276    | Japan Sakura - ZettaScaler-2.2, Xeon E5-2618Lv3 8C 2.3GHz, Infiniband EDR, PEZY-SC2 PEZY Computing / Exascaler Inc. PEZY Computing K.K. Japan                              | 794,400    | 824.7     | 50    | 16.657              | 8    | 419 | RAIDEN GPU subsystem -<br>NVIDIA DGX-1, Xeon<br>E5-26984/2 QC 2.26Hz,<br>Infiniband EDR, NVIDIA<br>Testa P100 , Fujitsu<br>Center for Advanced<br>Intelligence Project, RIKEN<br>Japan | 11,712  | 635.1    | 60    | 10.603 |
| 4    | 149    | DOX SaturnV Volta NVIDIA DGX-1 Volta36, Xeon E5-2698v4 20C 2 2GHz, Inflatband EDR, NVIDIA Corporation ViDIA Corporation United States                                      | 22,440     | 1,070.0   | 97    | 15.113              | 9    | 115 | Wilkes-2 - Dell C4130,<br>Xeon E5-2650v4 12C<br>2.2GHz, Infinband EDR,<br>NVIDIA Tesla P100 , Dell<br>EMC<br>University of Cambridge<br>United Kingdom                                 | 21,240  | 1,193.0  | 114   | 10.428 |
| 5    | 4      |                                                                                                                                                                            | 19,860,000 | 19,135.8  | 1,350 | 14.173              | - 10 | 3   | Piz Daint - Cray XC50, Xeon<br>E5-2690v3 12C 2.6GHz,<br>Aries interconnect, NVIDIA<br>Testa P100 , Cray Inc.<br>Swiss National<br>Supercomputing Centre<br>(CSCS)<br>Switzerland       | 361,760 | 19,590.0 | 2,272 | 10.398 |



### The CUDA programming model

XX

- Compute Unified Device Architecture
- CUDA is a recent programming model, designed for
- a multicore CPU *host* coupled to a many-core *device*, where
  - devices have wide SIMD/SIMT parallelism, and
    - the host and the device do not share memory
      - · CUDA provides:
      - a thread abstraction to deal with SIMD
  - synchr. & data sharing between small groups of threads
  - CUDA programs are written in C with extensions
- OpenCL inspired by CUDA, but hw & sw vendor neutral
  - programming model essentially identical

### **CUDA Devices and Threads**

### 众人

- A compute device
  - is a coprocessor to the CPU or host
  - has its own DRAM (device memory)
  - runs many threads in parallel
  - is typically a GPU but can also be another type of parallel processing device
- Data-parallel portions of an application are expressed as device kernels which run on many threads - SIMT
- Differences between GPU and CPU threads
  - GPU threads are extremely lightweight
    - very little creation overhead, requires LARGE register bank
  - GPU needs 1000s of threads for full efficiency
    - · multi-core CPU needs only a few

AJProença, Parallel Computing, MiEI, UMinho, 2018/19

29

# CUDA basic model: Single-Program Multiple-Data (SPMD)

### 众人

- CUDA integrated CPU + GPU application C program
  - Serial C code executes on CPU
  - Parallel Kernel C code executes on GPU thread blocks



© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2005 ECE 498AL, University of Illinois, Urbana-Champaign

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

### Programming Model: SPMD + SIMT/SIMD

CPU

### 众人

- Hierarchy
  - Device => Grids
  - Grid => Blocks
  - Block => Warps
  - Warp => Threads
- Single kernel runs on multiple blocks (SPMD)
- Threads within a warp are executed in a lock-step way called singleinstruction multiple-thread (SIMT)
- Single instruction are executed on multiple threads (SIMD)
  - Warp size defines SIMD granularity (32 threads)
- Synchronization within a block uses shared memory

Serial Code Grid 1 Kernel Block Block (1, 0)(2, 0)(0, 0)Block Block Block (0, 1)(2, 1)Serial Code Grid 2 Kernel Block (1, 1) Thread Thread Thread **Courtesy NVIDIA** Thread

GPU

AJProença, Parallel Computing, MiEI, UMinho, 2018/19

# The Computational Grid: Block IDs and Thread IDs

### X

- A kernel runs on a computational grid of thread blocks
  - Threads share global memory
- Each thread uses IDs to decide what data to work on
  - Block ID: 1D or 2D
  - Thread ID: 1D, 2D, or 3D
- A thread block is a batch of threads that can cooperate by:
  - Sync their execution w/ barrier
  - Efficiently sharing data through a low latency shared memory
  - Two threads from two different blocks cannot cooperate



© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 众人

- Multiply two vectors of length 8192
  - Code that works over all elements is the grid
  - Thread blocks break this down into manageable sizes
    - 512 threads per block
  - SIMD instruction executes 32 elements at a time
  - Thus grid size = 16 blocks
  - Block is analogous to a strip-mined vector loop with vector length of 32
  - Block is assigned to a multithreaded SIMD processor by the thread block scheduler
  - Current-generation GPUs (Fermi) have 7-16 multithreaded SIMD processors

Copyright © 2012, Elsevier Inc. All rights reserved.

AJProença, Parallel Computing, MiEI, UMinho, 2018/19

33

XX

# void saxpy\_serial(int n, float a, float \*x, float \*y) { for (int i = 0; i < n; ++i) y[i] = a\*x[i] + y[i]; Standard C Code } // Invoke serial SAXPY kernel saxpy\_serial(n, 2.0, x, y); \_\_global\_\_ void saxpy\_parallel(int n, float a, float \*x, float \*y) { int i = blockIdx.x\*blockDim.x + threadIdx.x; if (i < n) y[i] = a\*x[i] + y[i]; Parallel C Code } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy\_parallel <<<nblocks, 256>>> (n, 2.0, x, y);

### Terminology (and in NVidia)

### */*/\

- Threads of SIMD instructions (warps)
  - Each has its own IP (up to 48/64 per SIMD processor, Fermi/Kepler)
  - Thread scheduler uses scoreboard to dispatch
  - No data dependencies between threads!
  - Threads are organized into blocks & executed in groups of 32 threads (thread block)
    - · Blocks are organized into a grid
- The <u>thread block scheduler</u> schedules blocks to SIMD processors (<u>Streaming Multiprocessors</u>)
- Within each SIMD processor:
  - 32 SIMD lanes (thread processors)
  - Wide and shallow compared to vector processors

Copyright © 2012, Elsevier Inc. All rights reserved.

AJProença, Parallel Computing, MiEI, UMinho, 2018/19

35

### **CUDA Thread Block**

### XX

- Programmer declares (Thread) Block:
  - Block size 1 to 512 concurrent threads
  - Block shape 1D, 2D, or 3D
  - Block dimensions in threads
- All threads in a Block execute the same thread program
- Threads share data and synchronize while doing their share of the work
- Threads have thread id numbers within Block
- Thread program uses thread id to select work and address shared data

**CUDA Thread Block** 



) David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

### **Parallel Memory Sharing**



### **CUDA Memory Model Overview**



# Hardware Implementation: Memory Architecture

### 众入

- Device memory (DRAM)
  - Slow (2~300 cycles)
  - <u>Local</u>, global, constant, and texture memory
- · On-chip memory
  - Fast (1 cycle)
  - Registers, shared memory, constant/texture cache



Courtesy NVIDIA

39

AJProença, Parallel Computing, MiEI, UMinho, 2018/19

### **Beyond Vector/SIMD architectures**

### 众人

- Vector/SIMD-extended architectures are hybrid approaches
  - mix (super)scalar + vector op capabilities on a single device
  - highly pipelined approach to reduce memory access penalty
  - tightly-closed access to shared memory: lower latency
- Evolution of Vector/SIMD-extended architectures
  - PU (Processing Unit) cores with wider vector units
    - x86 many-core: Intel MIC / Xeon KNL
    - other many-core: IBM Power BlueGene/Q Compute, ShenWay 260
  - coprocessors (require a host scalar processor): accelerator devices
    - on disjoint physical memories (e.g., **Xeon KNC** with PCI-Expr, **PEZY-SC**)
    - ISA-free architectures, code compiled to silica: FPGA
    - focus on SIMT/SIMD to hide memory latency: GPU-type approach
    - focus on tensor/neural nets cores: NVidia, IBM, Intel NNP, Google TPU
  - heterogeneous PUs in a SoC: multicore PUs with GPU-cores

• ...

### Machine learning w/ neural nets & deep learning...



Key algorithms to train & classify use matrix products, but require lower precision numbers!

AJProença, Parallel Computing, MiEI, UMinho, 2018/19

41

# NVidia Volta Architecture: the new Tensor Cores



Figure 8. Tensor Core 4x4 Matrix Multiply and Accumulate



Figure 9. Mixed Precision Multiply and Accumulate in Tensor Core

http://www.nvidia.com/content/gated-pdfs/Volta-Architecture-Whitepaper-v1.1.pdf

For each SM:

8x 64 FMA ops/cycle 1k FLOPS/cycle!

# NVidia competitors with neural net features: IBM TrueNorth chip array (August'2014)



# NVidia competitors with neural net features: the IBM TrueNorth architecture



### NVidia competitors with neural net features: Intel Nervana Neural Network Processor, NNP

200

### History

- Nervana Engine announced in May'16
  - Key features:
- ASIC chip, focused on matrix multiplication, convolutions,... (for neural nets)
  - HBM2: 4x 8GB in-package storage & 1TB/sec memory access b/w
- no h/w managed cache hierarchy (saves die area, higher compute density)
  - built-in networking (6 bi-directional high-b/w links)
  - separate pipelines for computation and data management

 proprietary numeric format Flexpoint in-between floating point and fixed point precision

Nervana acquired by Intel in August 2

renamed the project to "Lake Cre

· later to Nervana NNP, launched in Oct

 Loihi test chip w/ self-learning capal announced in Sept'17, to be launched in

AJProença, Parallel Computing, MiEI, UMinho, 2018/19



https://www.top500.org/news/intel-will-ship-

Loihi

### NVidia competitors with neural net features: Google Tensor Processing Unit, TPU (April'17)

XX

- The Matrix Unit: 65,536 (256x256)
   8-bit multiply-accumulate units
- 700 MHz clock rate
- Peak: 92T operations/second
  - 65,536 \* 2 \* 700M
- >25X as many MACs vs GPU
- >100X as many MACs vs CPU
- 4 MiB of on-chip Accumulator memory
- 24 MiB of on-chip Unified Buffer, (activation memory)
- 3.5X as much on-chip memory vs GPU
- Two 2133MHz DDR3 DRAM channels
- 8 GiB of off-chip weight DRAM memory

# TPU: High-level Chip Architecture



### NVidia competitors with neural net features: Google Tensor Processing Unit, TPU (April'17)

XX

Chip floor plan Unified Buffer Matrix Multiply Unit for Local Activations (256x256x8b=64K MAC) (96Kx256x8b = 24 MiB) 29% of chip Accumulators Host R (4Kx256x32b = 4 MiB) 6% Interf. 2% A M M Control 2% Activation Pipeline 6% port port ddr3 ddr3 **PCle** 3% Misc. I/O 1% Interface 3%

TPU: a Neural Network

Accelerator Chip

TPUs are intensively used by Google, namely in RankBrain, StreetView & Google Translate

AJProença, Parallel Computing, MiEI, UMinho, 2018/19

47

nttps://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processir

# NVidia competitors with neural net features: Google TPUv2 (September'17)

XX





- 16 GB of HBM
- 600 GB/s mem BW
- Scalar unit: 32b float
- MXU: 32b float accumulation but reduced precision for multipliers
- 45 TFLOPS



### **Beyond Vector/SIMD architectures**

### 众入

- Vector/SIMD-extended architectures are hybrid approaches
  - mix (super)scalar + vector op capabilities on a single device
  - highly pipelined approach to reduce memory access penalty
  - tightly-closed access to shared memory: lower latency
- Evolution of Vector/SIMD-extended architectures
  - PU (Processing Unit) cores with wider vector units
    - x86 many-core: Intel MIC / Xeon KNL
    - other many-core: IBM Power BlueGene/Q Compute, ShenWay 260
  - coprocessors (require a host scalar processor): accelerator devices
    - on disjoint physical memories (e.g., **Xeon KNC** with PCI-Expr, **PEZY-SC**)
    - ISA-free architectures, code compiled to silica: FPGA
    - focus on SIMT/SIMD to hide memory latency: GPU-type approach
    - focus on tensor/neural nets cores: NVidia, IBM, Intel NNP, Google TPU
  - heterogeneous PUs in a SoC: multicore PUs with GPU-cores
    - x86 multicore coupled with SIMT/SIMD cores: Intel i5/i7
    - ARMv8 cores coupled with SIMT/SIMD cores: NVidia Tegra

AJProença, Parallel Computing, MiEI, UMinho, 2018/19

49

### Intel multicore coupled with GPU-cores



### **NVidia Tegra: SoC partnership with ARM** (1)

ムス

- Tegra 2 in Android (2010) ...
- Tegra 3 in Audi infotainment (2012) ...



Tegra 3 Nov'2011

AJProença, Parallel Computing, MiEI, UMinho, 2018/19

Tegra 4: replace the 32-bit ARM Cortex A9 by Cortex A15, and add 72 CUDA-cores



Tegra 4 May'2013

NVidia Tegra: SoC partnership with ARM (2)

XX

Replace the GPU block by 192 GPU-cores (from Kepler) and offer either 32/64-bit CPU cores => **Tegra K1** 



### NVidia Tegra: SoC partnership with ARM (2)

人入

Replace the GPU block by 192 GPU-cores (from Kepler) and offer either 32/64-bit CPU cores => **Tegra K1** 



53

### **NVidia Tegra: SoC partnership with ARM** (3)

众人

Replace the 5x 32-bit ARM by 2x4 32-bit Cortex (A57 & A53)
 and the 192 Kepler CUDA cores by 256 Maxwell => Tegra X1



A15

### **NVidia Tegra: pathway towards ARM-64** (1)

### */*/>

Upgrade 32-bit ARM to 64-bit ARM (*Denver 2 & A57*) and replace Maxwell cores by Pascal ones => Parker Aug'2016



## "PARKER" CPU COMPLEX

- 2x Denver2 + 4x Cortex-A57 Fully Coherent HMP system
  - Proprietary Coherent Interconnect
- ARM V8 64-bit
- Highest performance ARM CPU
  - 2nd generation Denver core
  - Significant Perf/W improvements
- Dynamic Code Optimization
  - OoO execution without the power
  - Optimize once, use many times
- 7-wide superscalar
- Low power retention states

7 **② N** 

AJProença, Parallel Computing, MiEI, UMinho, 2018/19

55

### NVidia Tegra: pathway towards ARM-64 (2)

### 众人

 Increment #ARMv8-cores (custom architecture) and replace Pascal cores by Volta (w/ tensor cores) => Xavier Jan'2018?



| NVIDIA ARM SoCs          |                              |                                            |                                            |  |  |  |  |
|--------------------------|------------------------------|--------------------------------------------|--------------------------------------------|--|--|--|--|
|                          | Xavier                       | Parker                                     | Erista (Tegra XI)                          |  |  |  |  |
| CPU                      | 8x NVIDIA Custom ARM         | 2x NVIDIA Denver +<br>4x ARM Cortex-A57    | 4x ARM Cortex-A57 +<br>4x ARM Cortex-A53   |  |  |  |  |
| GPU                      | Volta, 512 CUDA Cores        | Pascal, 256 CUDA Cores                     | Maxwell, 256 CUDA<br>Cores                 |  |  |  |  |
| Memory                   | ?                            | LPDDR4, 128-bit Bus                        | LPDDR3, 64-bit Bus                         |  |  |  |  |
| Video<br>Processing      | 7680x4320 Encode &<br>Decode | 3840x2160p60 Decode<br>3840x2160p60 Encode | 3840x2160p60 Decode<br>3840x2160p30 Encode |  |  |  |  |
| Transistors              | 7B                           | ?                                          | ?                                          |  |  |  |  |
| Manufacturing<br>Process | TSMC 16nm FinFET+            | TSMC 16nm FinFET+                          | TSMC 20nm Planar                           |  |  |  |  |

### Beyond Vector/SIMD architectures

### XX

- Vector/SIMD-extended architectures are hybrid approaches
  - mix (super)scalar + vector op capabilities on a single device
  - highly pipelined approach to reduce memory access penalty
  - tightly-closed access to shared memory: lower latency
- Evolution of Vector/SIMD-extended architectures
  - PU (Processing Unit) cores with wider vector units
    - x86 many-core: Intel MIC / Xeon KNL
    - other many-core: IBM Power BlueGene/Q Compute, ShenWay 260
  - coprocessors (require a host scalar processor): accelerator devices
    - on disjoint physical memories (e.g., **Xeon KNC** with PCI-Expr, **PEZY-SC**)
    - ISA-free architectures, code compiled to silica: FPGA
    - focus on SIMT/SIMD to hide memory latency: GPU-type approach
    - focus on tensor/neural nets cores: NVidia, IBM, Intel NNP, Google TPU
  - heterogeneous PUs in a SoC: multicore PUs with GPU-cores
    - x86 multicore coupled with SIMT/SIMD cores: Intel i5/i7
    - ARMv8 cores coupled with SIMT/SIMD cores: NVidia Tegra

AJProença, Parallel Computing, MiEI, UMinho, 2018/19

57



# Past: processor family distribution of all systems



TOP500 Date



### Processor generations in November'17



# TOP 500 The List.

# Accelerator family distribution over all systems Nov'17

