### **Beyond Vector/SIMD architectures**

#### 0

### **MSc Informatics Eng.**



A.J.Proença

### Data Parallelism 2 (Cell BE, FPGA, GPU, MIC, ...)

(most slides are borrowed)

AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13

#### *1*00

- · Vector/SIMD-extended architectures are hybrid approaches
  - mix (super)scalar + vector op capabilities on a single device
  - highly pipelined approach to reduce memory access penalty
  - tightly-closed access to shared memory: lower latency
- · Evolution of Vector/SIMD-extended architectures
  - CPU cores with wider vectors and/or SIMD cores:
    - DSP VLIW cores with vector capabilities: Texas Instruments
    - <u>PPC</u> cores coupled with SIMD cores: Cell Broadband Engine
    - <u>ARM64</u> cores coupled with SIMD cores: project Denver/BSC (NVidia)
    - upcoming <u>x86</u> many-cores: Intel MIC, AMD FirePro...
  - devices with no scalar processor: accelerator devices
    - · ISA-free architectures, code compiled to silica: FPGA
    - CPU-cores + accel devices (disjoint physical memories) => PCI-Express
    - focus on SIMT/SIMD to hide memory latency: GPU-type architecture

AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13

### Texas Instruments: Keystone DSP architecture





AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13

### Cell Broadband Engine (PPE)



- Heterogeneous multicore processor
- 1 x Power Processor Element (PPE)
- 64-bit Power-architecture-compliant processor
- Dual-issue, in-order execution, 2-way SMT processor
- PowerPC Processor Unit (PPU)
- 32 KB L1 IC, 32 KB L1 DC, VMX unit
- PowerPC Processor Storage Subsystem (PPSS)
- 512 KB L2 Cache
- General-purpose processor to run OS and control-intensive code
- Coordinates the tasks performed by the remaining cores

#### Meeting on Parallel Routine Optimization and Applications <u>– May 26-27, 2008</u>



KB L1 ICache 32 KB L1 DC

512 KB L2 Cache

To FIE

AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13

2

### Cell Broadband Engine (SPE)

### **Cell Broadband Engine** (EIB)





Cell Broadband Engine (chip)



AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13

#### 10

- Pick a successful SoC: Tegra 3
- Replace the 32-bit ARM Cortex 9 cores by 64-bit ARM cores
- Add some Fermi SIMT cores into the same chip?...



NVidia: Project Denver

### Intel: Many Integrated Core



AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13



#### AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13

# Intel MIC architecture



AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13

### What is an FPGA

### 10

10

### Field-Programmable Gate Arrays (FPGA)

A fabric with 1000s of simple configurable logic cells with LUTs, on-chip SRAM, configurable routing and I/O cells



### FPGA as a multiple configurable ISA



### The GPU as a compute device: the G80



AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13

14

### The CUDA programming model

#### 2

- Compute Unified Device Architecture
- · CUDA is a recent programming model, designed for
  - Manycore architectures
  - Wide SIMD parallelism
  - Scalability
- · CUDA provides:
  - A thread abstraction to deal with SIMD
  - Synchr. & data sharing between small groups of threads
- · CUDA programs are written in C with extensions
- OpenCL inspired by CUDA, but hw & sw vendor neutral
  - Programming model essentially identical

15

- Is a coprocessor to the CPU or host
  Has its own DRAM (device memory)
- Runs many threads in parallel

A compute device

 Is typically a GPU but can also be another type of parallel processing device

**CUDA** Devices and Threads

- Data-parallel portions of an application are expressed as device kernels which run on many threads - SIMT
- Differences between GPU and CPU threads
  - GPU threads are extremely lightweight
    - Very little creation overhead, requires LARGE register bank
  - GPU needs 1000s of threads for full efficiency
    - Multi-core CPU needs only a few

Kirk/NVII AL, Univer

### CUDA basic model: Single-Program Multiple-Data (SPMD)

#### 2

- CUDA integrated CPU + GPU application C program
  - Serial C code executes on CPU
  - Parallel Kernel C code executes on GPU thread blocks



### Programming Model: SPMD + SIMT/SIMD

#### *1*00

- Hierarchy
- Device => Grids
  Grid => Blocks
- Grid => Blocks
   Block => Warps
- Warp => Threads
- Single kernel runs on multiple blocks (SPMD)
- Threads within a warp are executed in a lock-step way called singleinstruction multiple-thread (SIMT)
- Single instruction are executed on multiple threads (SIMD)
  - Warp size defines SIMD granularity (32 threads)
- Synchronization within a block using shared memory



AJProença, Computer Systems & Performance, MEI, UMinho, 2012/1\_

### The Computational Grid: Block IDs and Thread IDs

Urbana-Char

and V of Illin

vid Kirk/NVIDIA 498AL, University

© Da



# Terminology (and in NVidia)

- Threads of SIMD instructions (warps)
  - Each has its own PC (up to 48/64 per SIMD processor, Fermi/Kepler)
  - Thread scheduler uses scoreboard to dispatch
  - No data dependencies between threads!
  - Threads are organized into blocks & executed in groups of 32 threads (*thread block*)
    - Blocks are organized into a grid
- The <u>thread block scheduler</u> schedules blocks to SIMD processors (*Streaming Multiprocessors*)
- Within each SIMD processor:
  - 32 SIMD lanes (thread processors)
  - Wide and shallow compared to vector processors



Units

### CUDA Thread Block

CUDA Thread Block

0 1 2 3 4 5 6 7

float x = input[threadID];

float y = func(x); output[threadID] = y; n-mei W. Hw is, Urbana-∩

of I

NVIDIA a

© Dav ECE 4

21

#### 2

- Programmer declares (Thread) Block:
  - Block size 1 to 512 concurrent threads
  - Block shape 1D, 2D, or 3D
  - Block dimensions in threads
- All threads in a Block execute the same thread program
- Threads share data and synchronize while doing their share of the work
- Threads have thread id numbers within Block
- Thread program uses thread id to select work and address shared data

AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13

### CUDA Memory Model Overview

threadID

#### $\infty$

- Each thread can:
  - R/W per-thread registers
  - R/W per-thread local memory
  - R/W per-block shared memory
  - R/W per-grid global memory
  - Read only per-grid constant memory
  - Read only per-grid texture memory
- The host can R/W global, constant, and texture memories



### Parallel Memory Sharing



### Hardware Implementation: Memory Architecture

### */*0<

- Device memory (DRAM)
  - Slow (2~300 cycles)
  - Local, global, constant, and texture memory
- On-chip memory
  - Fast (1 cycle)
  - Registers, shared memory, constant/texture cache



AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13

AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13

## **NVIDIA GPU Memory Structures**

- Each SIMD Lane has private section of off-chip DRAM
  - "Private memory" (Local Memory)
  - Contains stack frame, spilling registers, and private variables
- Each multithreaded SIMD processor also has local memory (Shared Memory)
  - Shared by SIMD lanes / threads within a block
- Memory shared by SIMD processors is GPU Memory (Global Memory)
  - Host can read and write GPU memory



Copyright © 2012, Elsevier Inc. All rights reserved.

### Families in NVidia GPU

| GPU                                                     | G80                    | GT200                  | Fermi                        |  |
|---------------------------------------------------------|------------------------|------------------------|------------------------------|--|
| Transistors                                             | 681 million            | 1.4 billion            | 3.0 billion                  |  |
| CUDA Cores                                              | 128                    | 240                    | 512                          |  |
| Double-Precision<br>Floating Point                      | None                   | 30 FMA ops per clock   | 256 FMA ops per clock        |  |
| Single-Precision<br>Floating Point                      | 128 MADD ops per clock | 240 MADD ops per clock | 512 FMA ops per clock        |  |
| Warp Schedulers<br>per Streaming<br>Multiprocessor (SM) | 1                      | 1                      | 2                            |  |
| Special Function<br>Units per SM                        | 2                      | 2                      | 4                            |  |
| Shared Memory<br>per SM                                 | 16KB                   | 16KB                   | Configurable<br>48KB or 16KB |  |
| L1 Cache<br>per SM                                      | None                   | None                   | Configurable<br>16KB or 48KB |  |
| L2 Cache                                                | None                   | None                   | 768KB                        |  |
| ECC Memory<br>Protection                                | No                     | No                     | Yes                          |  |
| Concurrent Kernels                                      | No                     | No                     | Up to 16                     |  |

AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13

26



NVidia GPU structure & scalability

AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13

y Units

25

### The NVidia Fermi architecture



28

30

### GT200 and Fermi SIMD processor



Fermi: Multithreading and Memory Hierarchy



AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13

### **Fermi Architecture Innovations**

- Each SIMD processor has
  - Two SIMD thread schedulers, two instruction dispatch units
  - 16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load-store units, 4 special function units
  - Thus, two threads of SIMD instructions are scheduled every two clock cycles
- Fast double precision
- Caches for GPU memory
- 64-bit addressing and unified address space
- Error correcting codes
- Faster context switching
- Faster atomic instructions

Copyright © 2012, Elsevier Inc. All rights reserved.

### From Fermi into Kepler: The Memory Hierarchy



### From Fermi into Kepler: **Compute capabilities**

|                                           | FERMI<br>GF100 | FERMI<br>GF104 | KEPLER<br>GK104 | KEPLER<br>GK110 |
|-------------------------------------------|----------------|----------------|-----------------|-----------------|
| Compute Capability                        | 2.0            | 2.1            | 3.0             | 3.5             |
| Threads / Warp                            | 32             | 32             | 32              | 32              |
| Max Warps / Multiprocessor                | 48             | 48             | 64              | 64              |
| Max Threads / Multiprocessor              | 1536           | 1536           | 2048            | 2048            |
| Max Thread Blocks / Multiprocessor        | 8              | 8              | 16              | 16              |
| 32-bit Registers / Multiprocessor         | 32768          | 32768          | 65536           | 65536           |
| Max Registers / Thread                    | 63             | 63             | 63              | 255             |
| Max Threads / Thread Block                | 1024           | 1024           | 1024            | 1024            |
| Shared Memory Size Configurations (bytes) | 16K            | 16K            | 16K             | 16K             |
|                                           | 48K            | 48K            | 32K             | 32K             |
|                                           |                |                | 48K             | 48K             |
| Max X Grid Dimension                      | 2^16-1         | 2^16-1         | 2^32-1          | 2^32-1          |
| Hyper-Q                                   | No             | No             | No              | Yes             |
| Dynamic Parallelism                       | No             | No             | No              | Yes             |

AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13

33

### Kepler GK110 Die & Architecture



AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13

34

.DIST SFU

.DIST SF



### **Overview of GK110 Kepler Architecture**



38

### Example

- Multiply two vectors of length 8192
  - Code that works over all elements is the grid
  - Thread blocks break this down into manageable sizes 512 threads per block
  - SIMD instruction executes 32 elements at a time
  - Thus grid size = 16 blocks
  - Block is analogous to a strip-mined vector loop with vector length of 32
  - Block is assigned to a multithreaded SIMD processor by the thread block scheduler
  - Current-generation GPUs (Fermi) have 7-16 multithreaded SIMD processors

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

37







### GPU: NVidia Fermi versus AMD Cayman



40