## **MSc Informatics Eng.**

#### 2013/14 *A.J.Proença*

#### Data Parallelism 2 (Cell BE, FPGA, MIC, GPU, ...) (most slides are borrowed)

AJProença, Computer Systems & Performance, MEI, UMinho, 2013/14

#### **Beyond Vector/SIMD architectures**

#### 2

XX

- Vector/SIMD-extended architectures are hybrid approaches
  - mix (super)scalar + vector op capabilities on a single device
  - highly pipelined approach to reduce memory access penalty
  - tightly-closed access to shared memory: lower latency
- Evolution of Vector/SIMD-extended architectures
  - CPU cores with wider vectors and/or SIMD cores:
    - <u>DSP</u> VLIW cores with vector capabilities: **Texas Instruments** (...?)
    - PPC cores coupled with SIMD cores: Cell Broadband Engine (past...)
    - <u>ARM64</u> cores coupled with SIMD cores: project Denver/BSC (NVidia) (...?)
    - <u>x86</u> many-cores: Intel MIC / Xeon Phi / Knights C/L, AMD FirePro...
  - devices with no scalar processor: accelerator devices
    - CPU-cores + accel devices (disjoint physical memories) => PCI-Express
    - focus on SIMT/SIMD to hide memory latency: GPU-type architecture
    - ISA-free architectures, code compiled to silica: FPGA

1

#### Texas Instruments: Keystone DSP architecture

公



AJProença, Computer Systems & Performance, MEI, UMinho, 2013/14

3

#### Cell Broadband Engine (PPE)



#### **Cell Broadband Engine** (SPE)



AJProença, Computer Systems & Performance, MEI, UMinho, 2013/14

#### **Cell Broadband Engine** (EIB)



#### Cell Broadband Engine (chip)



#### AJProença, Computer Systems & Performance, MEI, UMinho, 2013/14

#### NVidia: pathway towards ARM-64 (1)

Tegra 4



#### => Tegra 5

AJProença, Computer Systems & Performance, MEI, UMinho, 2013/14

7

#### NVidia: pathway towards ARM-64 (2)

 Replace the 32-bit ARM by a novel 64-bit ARM (*Denver*) and the Kepler SMX by the Maxwell SMX => Tegra 6

XX

XX



#### What is an FPGA

#### Field-Programmable Gate Arrays (FPGA)

A fabric with 1000s of simple configurable logic cells with LUTs, on-chip SRAM, configurable routing and I/O cells



## FPGA as a multiple configurable ISA



#### Intel MIC: Many Integrated Core



#### Intel Knights Corner architecture



AJProença, Computer Systems & Performance, MEI, UMinho, 2013/14

### The new Knights Landing architecture



AJProença, Computer Systems & Performance, MEI, UMinho, 2013/14



#### The GPU as a compute device: the G80



#### NVidia GPU structure & scalability



AJProença, Computer Systems & Performance, MEI, UMinho, 2013/14

17

#### The NVidia Fermi architecture



#### GT200 and Fermi SIMD processor



#### GPU: NVidia Fermi versus AMD Cayman



## Fermi Architecture Innovations

#### Each SIMD processor has

- Each SIMD processor has
  Two SIMD thread schedulers, two instruction dispatch units
  16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load-store units, 4 special function units
  Thus, two threads of SIMD instructions are scheduled every two clock cycles clock cycles
- Fast double precision
- Caches for GPU memory
- 64-bit addressing and unified address space
- Error correcting codes
- Faster context switching
- Faster atomic instructions



Copyright © 2012, Elsevier Inc. All rights reserved.

21

## Families in NVidia GPU

| GPU G80                  |       |                     | GT200                | Fermi                 |  |  |  |  |  |  |
|--------------------------|-------|---------------------|----------------------|-----------------------|--|--|--|--|--|--|
| Transistors              | 681 1 | nillion             | 1.4 billion          | 3.0 billion           |  |  |  |  |  |  |
| CUDA Cores               | 128   |                     | 240                  | 512                   |  |  |  |  |  |  |
| Double-Precision         | None  | 2 Î                 | 30 FMA ons per clock | 256 FMA ons ner clock |  |  |  |  |  |  |
| GPU                      |       | GT200 (Tesla)       | GF110 (Fermi)        | GK104 (Kepler)        |  |  |  |  |  |  |
| Transistors              |       | 1.4 billion         | 3.0 billion          | 3.54 billion          |  |  |  |  |  |  |
| CUDA Cores               |       | 240                 | 512                  | 1536                  |  |  |  |  |  |  |
| Graphics Core Clock      |       | 648MHz              | 772MHz               | 1006MHz               |  |  |  |  |  |  |
| Shader Core Clock        |       | 1476MHz             | 1544MHz              | n/a                   |  |  |  |  |  |  |
| GFLOPs                   |       | 1063                | 1581                 | 3090                  |  |  |  |  |  |  |
| Texture Units            |       | 80                  | 64                   | 128                   |  |  |  |  |  |  |
| Texel fill-rate          |       | 51.8 Gigatexels/sec | 49.4 Gigatexels/sec  | 128.8 Gigatexels/sec  |  |  |  |  |  |  |
| Memory Clock             |       | 2484 MHz            | 4008 MHz             | 6008MHz               |  |  |  |  |  |  |
| Memory Bandwidth         |       | 159 GB/sec          | 192.4 GB/sec         | 192.26 GB/sec         |  |  |  |  |  |  |
| Max # of Active Displays |       | 2                   | 2                    | 4                     |  |  |  |  |  |  |
| TDP                      |       | 183W                | 244W                 | 195W                  |  |  |  |  |  |  |
| ECC Memory               | No    |                     | No                   | Yes                   |  |  |  |  |  |  |
| Protection               |       |                     |                      |                       |  |  |  |  |  |  |
| Concurrent Kernels       | No    |                     | No                   | Up to 16              |  |  |  |  |  |  |

## **NVIDIA GPU Memory Structures**

- Each SIMD Lane has private section of off-chip DRAM
  - "Private memory" (Local Memory)
  - Contains stack frame, spilling registers, and private variables
- Each multithreaded SIMD processor also has local memory (Shared Memory)
  - Shared by SIMD lanes / threads within a block
- Memory shared by SIMD processors is GPU Memory (Global Memory)
  - Host can read and write GPU memory







#### From Fermi into Kepler: The Memory Hierarchy



AJProença, Computer Systems & Performance, MEI, UMinho, 2013/14

25

#### From Fermi into Kepler: Compute capabilities

|                                           | FERMI<br>GF100 | FERMI<br>GF104 | KEPLER<br>GK104 | KEPLER<br>GK110 |
|-------------------------------------------|----------------|----------------|-----------------|-----------------|
| Compute Capability                        | 2.0            | 2.1            | 3.0             | 3.5             |
| Threads / Warp                            | 32             | 32             | 32              | 32              |
| Max Warps / Multiprocessor                | 48             | 48             | 64              | 64              |
| Max Threads / Multiprocessor              | 1536           | 1536           | 2048            | 2048            |
| Max Thread Blocks / Multiprocessor        | 8              | 8              | 16              | 16              |
| 32-bit Registers / Multiprocessor         | 32768          | 32768          | 65536           | 65536           |
| Max Registers / Thread                    | 63             | 63             | 63              | 255             |
| Max Threads / Thread Block                | 1024           | 1024           | 1024            | 1024            |
| Shared Memory Size Configurations (bytes) | 16K            | 16K            | 16K             | 16K             |
|                                           | 48K            | 48K            | 32K             | 32K             |
|                                           |                |                | 48K             | 48K             |
| Max X Grid Dimension                      | 2^16-1         | 2^16-1         | 2^32-1          | 2^32-1          |
| Hyper-Q                                   | No             | No             | No              | Yes             |
| Dynamic Parallelism                       | No             | No             | No              | Yes             |

### Kepler GK110 Die & Architecture



AJProença, Computer Systems & Performance, MEI, UMinho, 2013/14

## **Overview of GK110 Kepler Architecture**



27

#### From Fermi to Kepler core: SM and the SMX Architecture

| SM                                                                                  |                                              |                                                                               |      |      | dI                         | IC   |            |              | e       | 31              | VI.     | Λ          | A              |        | CI              |      | e    | C            | u       | / e            | !   |
|-------------------------------------------------------------------------------------|----------------------------------------------|-------------------------------------------------------------------------------|------|------|----------------------------|------|------------|--------------|---------|-----------------|---------|------------|----------------|--------|-----------------|------|------|--------------|---------|----------------|-----|
| Instruction                                                                         |                                              | SMX                                                                           |      |      |                            |      |            |              |         |                 |         |            |                |        |                 |      |      |              |         |                |     |
| Warp Scheduler         Warp Scheduler           Dispatch Unit         Dispatch Unit |                                              | Instruction Cache Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler |      |      |                            |      |            |              |         |                 |         |            |                |        |                 |      |      |              |         |                |     |
|                                                                                     |                                              | Dis                                                                           | warg | _    | eduler<br>Dispat           | ch   | D          | wa<br>ispatc |         | duler<br>Dispat | ch      | Di         | Warr<br>spatch |        | eduler<br>Dispa | tch  | Di   | Wa<br>ispato |         | duler<br>Dispa | tch |
|                                                                                     | Ť                                            |                                                                               | +    |      | ÷                          |      |            | +            |         | +               |         |            | +              |        | +               |      |      | +            |         | Ŧ              |     |
| Register File (32                                                                   | 2,768 x 32-bit)                              | <b>.</b>                                                                      |      |      | +                          |      |            |              | Regi    | ster F          | -iie (t | 5,53       | 6 x 32         | 2-Dit) |                 | ÷    |      |              |         |                |     |
| + +                                                                                 | + +                                          | Core                                                                          | Core | Core | DP Unit                    | Core | Core       | Core         | DP Unit |                 | SFU     | Core       | Core           | Core   | DP Unit         | Core | Core | Core         | DP Unit | LD/ST          | s   |
|                                                                                     | LD/ST                                        | Core                                                                          | Core | Core | DP Unit                    | Core | Core       | Core         | DP Unit | LD/ST           | SFU     | Core       | Core           | Core   | DP Unit         | Core | Core | Core         | DP Unit | LD/ST          | s   |
| Core Core Core C                                                                    | ore LD/ST                                    | Core                                                                          | Core | Core | DP Unit                    | Core | Core       | Core         | DP Unit | LD/ST           | SFU     | Core       | Core           | Core   | DP Unit         | Core | Core | Core         | DP Unit | LD/ST          | 5   |
|                                                                                     | LD/ST SFU                                    | Core                                                                          | Core | Core | DP Unit                    | Core | Core       | Core         | DP Unit | LD/ST           | SFU     | Core       | Core           | Core   | DP Unit         | Core | Core | Core         | DP Unit | LD/ST          | i,  |
| Core Core Core C                                                                    | LD/ST                                        |                                                                               |      | Core | DP Unit                    | Core | Core       | Core         | DP Unit | LD/ST           |         | Core       |                | Core   | DP Unit         | Core |      | ┝─           | DP Unit | LD/ST          | ł   |
|                                                                                     | LD/ST                                        |                                                                               |      |      |                            |      |            |              | _       |                 |         |            |                |        |                 | -    | Core | Core         |         |                | ł   |
| Core Core Core C                                                                    | LD/ST                                        | Core                                                                          | Core | Core | DP Unit                    | Core | Core       | Core         | DP Unit |                 | SFU     | Core       | Core           | Core   | DP Unit         | Core | Core | Core         | DP Unit | LD/ST          | ļ   |
|                                                                                     | LD/ST SFU                                    | Core                                                                          | Core | Core | DP Unit                    | Core | Core       | Core         | DP Unit |                 | SFU     | Core       | Core           | Core   | DP Unit         | Core | Core | Core         | DP Unit | LD/ST          |     |
| Core Core Core C                                                                    | LD/ST                                        | Core                                                                          | Core | Core | DP Unit                    | Core | Core       | Core         | DP Unit |                 | SFU     | Core       | Core           | Core   | DP Unit         | Core | Core | Core         | DP Unit | LD/ST          |     |
|                                                                                     | LD/ST                                        | Core                                                                          | Core | Core | DP Unit                    | Core | Core       | Core         | DP Unit |                 | SFU     | Core       | Core           | Core   | DP Unit         | Core | Core | Core         | DP Unit | LD/ST          | ſ   |
| Core Core C                                                                         | LD/ST SFU                                    | Core                                                                          | Core | Core | DP Unit                    | Core | Core       | Core         | DP Unit | LD/ST           | SFU     | Core       | Core           | Core   | DP Unit         | Core | Core | Core         | DP Unit | LD/ST          | ľ   |
| Core Core C                                                                         | ore LD/ST                                    | Core                                                                          | Core | Core | DP Unit                    | Core | Core       | Core         | DP Unit | LD/ST           | SFU     | Core       | Core           | Core   | DP Unit         | Core | Core | Core         | DP Unit | LD/ST          | t   |
|                                                                                     | LD/ST                                        | Core                                                                          | Core | Core | DP Unit                    | Core | Core       | Core         | DP Unit | LD/ST           | SEIL    | Core       | Core           | Core   | DP Unit         | Core | Core | Core         | DP Unit | LDIST          | ł   |
| Core Core C                                                                         | LD/ST                                        |                                                                               |      |      |                            |      |            |              |         |                 |         |            |                |        |                 | -    |      |              |         | -              | ł   |
|                                                                                     | LD/ST SFU                                    | Core                                                                          | Core |      | DP Unit                    |      | Core       | Core         | DP Unit |                 |         | Core       | Core           | Core   | DP Unit         | _    | Core | Core         |         | LD/ST          | ł   |
| Core Core C                                                                         | LD/ST                                        | Core                                                                          | Core | Core | DP Unit                    | Core | Core       | Core         | DP Unit | LD/ST           | SFU     | Core       | Core           | Core   | DP Unit         | Core | Core | Core         | DP Unit | LD/ST          |     |
|                                                                                     | LD/ST                                        | Core                                                                          | Core | Core | DP Unit                    | Core | Core       | Core         | DP Unit |                 | SFU     | Core       | Core           | Core   | DP Unit         | Core | Core | Core         | DP Unit | LD/ST          |     |
| Interconnec                                                                         | t Network                                    | Core                                                                          | Core | Core | DP Unit                    | Core | Core       | Core         | DP Unit |                 | SFU     | Core       | Core           | Core   | DP Unit         | Core | Core | Core         | DP Unit | LD/ST          |     |
|                                                                                     |                                              | (1)                                                                           |      |      |                            |      |            |              |         |                 | conne   |            |                |        |                 |      |      |              |         |                |     |
| 64 KB Shared Mer                                                                    | nory / L1 Cache                              |                                                                               |      |      |                            |      |            |              | 64 KB   |                 |         | _          |                |        |                 |      |      |              |         |                |     |
| Uniform Cache                                                                       |                                              |                                                                               | Tex  | _    | 48 KB Read-Only Data Cache |      |            |              |         |                 |         |            |                |        |                 |      |      |              |         |                |     |
| AJProença, Compute                                                                  | AJProença, Computer Systems & Performance, I |                                                                               |      |      | Tex<br>Tex                 |      | Tex<br>Tex |              |         | Tex<br>Tex      |         | Tex<br>Tex |                |        | Tex<br>Tex      |      |      | Tex<br>Tex   |         | Te><br>Te>     |     |

### **Top500: Performance from Accelerators**



31

# **Roofline Performance Model**

- Basic idea:
  - Plot peak floating-point throughput as a function of arithmetic intensity
  - Ties together floating-point performance and memory performance for a target machine
- Arithmetic intensity
  - Floating-point operations per byte read





Copyright  $\ensuremath{\textcircled{O}}$  2012, Elsevier Inc. All rights reserved.



