# Master Informatics Eng.

# 2015/16 *A.J.Proença*

### Data Parallelism 2 (Cell BE, FPGA, MIC, GPU, ...) (most slides are borrowed)

AJProença, Advanced Architectures, MEI, UMinho, 2015/16

# **Beyond Vector/SIMD architectures**

### ~~

XX

- Vector/SIMD-extended architectures are hybrid approaches
  - mix (super)scalar + vector op capabilities on a single device
  - highly pipelined approach to reduce memory access penalty
  - tightly-closed access to shared memory: lower latency
- Evolution of Vector/SIMD-extended architectures
  - CPU cores with wider vectors and/or SIMD cores:
    - DSP VLIW cores with vector capabilities: Texas Instruments (...?)
    - PPC cores coupled with SIMD cores: Cell Broadband Engine (past...)
    - <u>ARM64</u> cores coupled with SIMD cores: from Tegra to Denver (NVidia) (...?)
    - x86 many-core: Intel MIC / Xeon Phi / Knights C/L, AMD FirePro...
  - devices requiring a host scalar processor: accelerator devices
    - typically on disjoint physical memories (e.g., MIC through PCI-Express)
    - focus on SIMT/SIMD to hide memory latency: GPU-type approach
    - ISA-free architectures, code compiled to silica: FPGA

# Texas Instruments: Keystone DSP architecture

公





AJProença, Advanced Architectures, MEI, UMinho, 2015/16

3

# Cell Broadband Engine (PPE)



# Cell Broadband Engine (SPE)



Cell Broadband Engine (EIB)



# Cell Broadband Engine (chip)



# NVidia: pathway towards ARM-64 (1)



## Tegra 3

AJProença, Advanced Architectures, MEI, UMinho, 2015/16

Tegra 4

# NVidia: pathway towards ARM-64 (2)

• Replace the GPU block by 192 GPU-cores (from Kepler) and give a choice of 32/64-bit CPU => Tegra K1



AJProença, Advanced Architectures, MEI, UMinho, 2015/16

XX

# NVidia: pathway towards ARM-64 (3)

公 · Keep both 32-bit ARM and 64-bit ARM (Denver) and replace the Kepler cores by Maxwell cores => Parker



# What is an FPGA

### $\sim$

### Field-Programmable Gate Arrays (FPGA)

A fabric with 1000s of simple configurable logic cells with LUTs, on-chip SRAM, configurable routing and I/O cells



# FPGA as a multiple configurable ISA



# Intel MIC: Many Integrated Core

Many Processing Cores

Wide-Vector Units

Multi-Threading

# From:

XX

- Larrabee (80-core GPU)
- SCC (Single-chip Cloud Comp 24x dual-core tiles)

# to MIC:

- Knights Ferry (pre-production)
- Knights Corner (Xeon Phi co-processors up to 61 Pentium cores)
- Knights Landing (Next generation, with 72x 64-bit Atom cores)

AJProença, Advanced Architectures, MEI, UMinho, 2015/16

13

14

# Intel Knights Corner architecture



# The new Knights Landing architecture



AJProença, Advanced Architectures, MEI, UMinho, 2015/16



AJProença, Advanced Architectures, MEI, UMinho, 2015/16

# **Graphical Processing Units**

# **Graphical Processing Units**

- Question to GPU architects:
  - Given the hardware invested to do graphics well, how can we supplement it to improve the performance of a wider range of applications?
- Key ideas:
  - Heterogeneous execution model
    - CPU is the *host*, GPU is the *device*
  - Develop a C-like programming language for GPU
  - Unify all forms of GPU parallelism as CUDA\_threads
  - Programming model follows SIMT:
     *"Single Instruction Multiple Thread"*



1

Copyright © 2012, Elsevier Inc. All rights reserved.

17

# **Classifying GPUs**

- Don't fit nicely into SIMD/MIMD model
  - Conditional execution in a thread allows an illusion of MIMD
    - But with performance degradation
    - Need to write general purpose code with care

|                                  | Static: Discovered<br>at Compile Time | Dynamic: Discovered<br>at Runtime |  |
|----------------------------------|---------------------------------------|-----------------------------------|--|
| Instruction-Level<br>Parallelism | VLIW                                  | Superscalar                       |  |
| Data-Level<br>Parallelism        | SIMD or Vector                        | GPU device                        |  |

# Performance gap between NVidia GPUs and Intel CPUs



# Performance gap between several computing devices (SP)



http://www.karlrupp.net/2013/06/cpu-gpu-and-mio-hardware-characteristics-over-time/

# # cores/processing elements in several devices



# **NVIDIA GPU Architecture**

- Similarities to vector machines:
  - Works well with data-level parallel problems
  - Scatter-gather transfers
  - Mask registers
  - Large register files

# Differences:

- No scalar processor
- Uses multithreading to hide memory latency
- Has many functional units, as opposed to a few deeply pipelined units like a vector processor





# The GPU as a compute device: the G80

# NVidia GPU structure & scalability



AJProença, Advanced Architectures, MEI, UMinho, 2015/16

# The NVidia Fermi architecture



# **Fermi Architecture Innovations**

- Each SIMD processor has
- Each SIMD processor has
  Two SIMD thread schedulers, two instruction dispatch units
  16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load-store units, 4 special function units
  Thus, two threads of SIMD instructions are scheduled every two clock cycles
  - clock cycles
- Fast double precision
- Caches for GPU memory
- 64-bit addressing and unified address space
- Error correcting codes
- Faster context switching
- Easter atomic instructions



# Families in NVidia GPU

| GPU                 | G80   |                     | GT200                | Fermi                 |  |
|---------------------|-------|---------------------|----------------------|-----------------------|--|
| Transistors         | 681 1 | nillion             | 1.4 billion          | 3.0 billion           |  |
| CUDA Cores          | 128   |                     | 240                  | 512                   |  |
| Double-Precision    | None  | 2                   | 30 FMA ons ner clock | 256 FMA ons ner clock |  |
| GPU                 |       | GT200 (Tesla)       | GF110 (Fermi)        | GK104 (Kepler)        |  |
| Transistors         |       | 1.4 billion         | 3.0 billion          | 3.54 billion          |  |
| CUDA Cores          |       | 240                 | 512                  | 1536                  |  |
| Graphics Core Clock |       | 648MHz              | 772MHz               | 1006MHz               |  |
| Shader Core Clock   |       | 1476MHz             | 1544MHz              | n/a                   |  |
| GFLOPs              |       | 1063                | 1581                 | 3090                  |  |
| Texture Units       |       | 80                  | 64                   | 128                   |  |
| Texel fill-rate     |       | 51.8 Gigatexels/sec | 49.4 Gigatexels/sec  | 128.8 Gigatexels/sec  |  |
| Memory Clock        |       | 2484 MHz            | 4008 MHz             | 6008MHz               |  |
| Memory Bandwidth    |       | 159 GB/sec          | 192.4 GB/sec         | 192.26 GB/sec         |  |
| Max # of Active Dis | olays | 2                   | 2                    | 4                     |  |
| <b>TDP</b> 183W     |       | 183W                | 244W                 | 195W                  |  |
| ECC Memory          | No    |                     | No                   | Yes                   |  |
| Protection          |       |                     |                      |                       |  |
| Concurrent Kernels  | No    |                     | No                   | Up to 16              |  |

AJProença, Advanced Architectures, MEI, UMinho, 2015/16

27

Graphical Processing Units

# **NVIDIA GPU Memory Structures**

- Each SIMD Lane has private section of off-chip DRAM
  - "Private memory" (Local Memory)
  - Contains stack frame, spilling registers, and private variables
- Each multithreaded SIMD processor also has local memory (Shared Memory)
  - Shared by SIMD lanes / threads within a block
- Memory shared by SIMD processors is GPU Memory (Global Memory)
  - Host can read and write GPU memory



# Fermi: Multithreading and Memory Hierarchy



AJProença, Advanced Architectures, MEI, UMinho, 2015/16

29

# From Fermi into Kepler: The Memory Hierarchy



# From Fermi into Kepler: **Compute capabilities**

|                                           | FERMI<br>GF100 | FERMI<br>GF104  | KEPLER<br>GK104    | KEPLER<br>GK110 |
|-------------------------------------------|----------------|-----------------|--------------------|-----------------|
| Compute Capability                        | 2.0            | 2.1             | 3.0                | 3.5             |
| Threads / Warp                            | 32             | 32              | 32                 | 32              |
| Max Warps / Multiprocessor                | 48             | 48              | 64                 | 64              |
| Max Threads / Multiprocessor              | 1536           | 1536            | 2048               | 2048            |
| Max Thread Blocks / Multiprocessor        | 8              | 8               | 16                 | 16              |
| 32-bit Registers / Multiprocessor         | 32768          | 32768           | <mark>65536</mark> | 65536           |
| Max Registers / Thread                    | 63             | <mark>63</mark> | 63                 | 255             |
| Max Threads / Thread Block                | 1024           | 1024            | 1024               | 1024            |
| Shared Memory Size Configurations (bytes) | 16K            | 16K             | 16K                | 16K             |
|                                           | 48K            | 48K             | 32K                | 32K             |
|                                           |                |                 | 48K                | 48K             |
| Max X Grid Dimension                      | 2^16-1         | 2^16-1          | 2^32-1             | 2^32-1          |
| Hyper-Q                                   | No             | No              | No                 | Yes             |
| Dynamic Parallelism                       | No             | No              | No                 | Yes             |

AJProença, Advanced Architectures, MEI, UMinho, 2015/16

公

31

# **Overview of GK110 Kepler Architecture**



# From Fermi to Kepler core: SM and the SMX Architecture

LD/ST

LD/ST

LDIST S

Tex



# Top500: Accelerator distribution over all 500 systems





