# Master Informatics Eng.

### 2016/17 A.J.Proença

#### Data Parallelism 2 (SIMD++, Intel MIC) (most slides are borrowed)

AJProença, Advanced Architectures, MiEl, UMinho, 2016/17

## **Beyond Vector/SIMD architectures**

#### 2

XX

- Vector/SIMD-extended architectures are hybrid approaches
  - mix (super)scalar + vector op capabilities on a single device
  - highly pipelined approach to reduce memory access penalty
  - tightly-closed access to shared memory: lower latency
- Evolution of Vector/SIMD-extended architectures

#### - CPU cores with wider vectors and/or SIMD cores:

- <u>DSP</u> VLIW cores with vector capabilities: **Texas Instruments** (...?)
- <u>PPC</u> cores coupled with SIMD cores: Cell (past...) , IBM Power BQC...
- <u>ARM64</u> cores coupled with SIMD cores: from Tegra to Parker (NVidia) (...?)
- <u>x86</u> many-core: Intel MIC / Xeon KNL, AMD FirePro...
- other many-core: ShenWay 260, Adapteva Epiphany-V...
- coprocessors (require a host scalar processor): accelerator devices
  - on disjoint physical memories (e.g., Xeon KNC with PCI-Expr, PEZY-SC)
  - focus on SIMT/SIMD to hide memory latency: GPU-type approach
  - ISA-free architectures, code compiled to silica: **FPGA**

1

# Texas Instruments: Keystone DSP architecture



# **Beyond Vector/SIMD architectures**

#### $\sim$

Vector/SIMD-extended architectures are hybrid approaches

- mix (super)scalar + vector op capabilities on a single device
- highly pipelined approach to reduce memory access penalty
- tightly-closed access to shared memory: lower latency
- Evolution of Vector/SIMD-extended architectures

#### - CPU cores with wider vectors and/or SIMD cores:

- <u>DSP</u> VLIW cores with vector capabilities: **Texas Instruments** (...?)
- PPC cores coupled with SIMD cores: Cell (past...) , IBM Power BQC ...
- <u>ARM64</u> cores coupled with SIMD cores: from Tegra to Parker (NVidia) (...?)
- x86 many-core: Intel MIC / Xeon KNL, AMD FirePro...
- other many-core: ShenWay 260, Adapteva Epiphany-V...

#### - coprocessors (require a host scalar processor): accelerator devices

- on disjoint physical memories (e.g., Xeon KNC with PCI-Expr, PEZY-SC)
- focus on SIMT/SIMD to hide memory latency: GPU-type approach
- ISA-free architectures, code compiled to silica: **FPGA**

## IBM Cell Broadband Engine (PPE)



## IBM Cell Broadband Engine (SPE)



AJProença, Advanced Architectures, MiEl, UMinho, 2016/17

# IBM Cell Broadband Engine (EIB)



## IBM Cell Broadband Engine (chip)



AJProença, Advanced Architectures, MiEl, UMinho, 2016/17

### IBM Power BlueGene/Q Compute (chip)



AJProença, Advanced Architectures, MiEI, UMinho, 2016/17

#### Features:

- launched in 2010/11 (TOP500: #1 in Jun12, #4 in Jun16)
- **18-cores** (16 compute, 1 OS support, 1 redundant)
  - each 4-way multi-threaded
  - 64 bits PowerISA
  - 1.6 GHz
  - L1 I/D cache = 16 kB/16 kB
  - each core has Quad FPU (4-wide double precision SIMD)
- shared L2 cache: 32 MB
- dual memory controller

#### 9

### IBM Power BlueGene/Q Compute (Sequoia system)



## **Beyond Vector/SIMD architectures**

#### $\sim$

- Vector/SIMD-extended architectures are hybrid approaches
  - mix (super)scalar + vector op capabilities on a single device
  - highly pipelined approach to reduce memory access penalty
  - tightly-closed access to shared memory: lower latency

#### Evolution of Vector/SIMD-extended architectures

#### - CPU cores with wider vectors and/or SIMD cores:

- DSP VLIW cores with vector capabilities: Texas Instruments (...?)
- PPC cores coupled with SIMD cores: Cell (past...) , IBM Power BQC...
- <u>ARM64</u> cores coupled with SIMD cores: from Tegra to Parker (NVidia) (...?)
- x86 many-core: Intel MIC / Xeon KNL, AMD FirePro...
- other many-core: ShenWay 260, Adapteva Epiphany-V...

#### - coprocessors (require a host scalar processor): accelerator devices

- on disjoint physical memories (e.g., Xeon KNC with PCI-Expr, PEZY-SC)
- focus on SIMT/SIMD to hide memory latency: **GPU**-type approach
- ISA-free architectures, code compiled to silica: FPGA

AJProença, Advanced Architectures, MiEI, UMinho, 2016/17

11

## NVidia: pathway towards ARM-64 (1)



#### Tegra 3

AJProença, Advanced Architectures, MiEI, UMinho, 2016/17

Tegra 4

## NVidia: pathway towards ARM-64 (2)

 Replace the GPU block by 192 GPU-cores (from Kepler) and keep the 5x 32-bit CPU cores (Cortex A15) => Tegra K1

XX

XX



### NVidia: pathway towards ARM-64 (3)

 Replace the 5x 32-bit ARM by 2x4 32-bit Cortex (A57 & A53) and the 192 Kepler CUDA cores by 256 Maxwell => Tegra X1



AJProença, Advanced Architectures, MiEI, UMinho, 2016/17

### NVidia: pathway towards ARM-64 (4)

• Upgrade 32-bit ARM to 32- & 64-bit ARM (Denver 2) and replace the Maxwell CUDA cores by Pascal ones => Parker

### **TEGRA KEY FEATURE EVOLUTION**

|         | TK1                                                                | TX1                                              | "PARKER"                                                                         |
|---------|--------------------------------------------------------------------|--------------------------------------------------|----------------------------------------------------------------------------------|
| GPU     | Kepler, 192 CUDA cores                                             | Maxwell, 256 CUDA cores                          | Pascal, 256 CUDA cores                                                           |
| CPU     | 4+1 A15, 2MB+512K L2<br>ARM v7 32b<br>Or 2 Denver 1, 2MB L2<br>64b | 4x A57 2MB L2 +<br>4x A53 512KB L2<br>ARM v8 64b | 2x Denver 2 2MB L2 +<br>4x A57 2MB L2<br>ARM v8 64b Coherent HMP<br>Architecture |
| Camera  | 4 cameras                                                          | 6 cameras                                        | Auto HDR<br>12 cameras                                                           |
| Memory  | 64b LPDDR2/3, DDR3L<br>15 GB/s (LP3, DDR3L)                        | 64b LPDDR4, 25GB/s                               | 128b LPDDR4, 50 GB/s, ECC                                                        |
| Display | Dual Pipeline<br>4K@30fps 24bpp                                    | Dual Pipeline<br>4K@60fps                        | Triple Pipeline<br>4K@60fps                                                      |

AJProença, Advanced Architectures, MiEI, UMinho, 2016/17

公

XX

15

### NVidia: pathway towards ARM-64 (5)



AJProença, Advanced Architectures, MiEI, UMinho, 2016/17

## **Beyond Vector/SIMD architectures**

#### $\sim$

- Vector/SIMD-extended architectures are hybrid approaches
  - mix (super)**scalar + vector** op capabilities on a single device
  - highly pipelined approach to reduce memory access penalty
  - tightly-closed access to shared memory: lower latency
- Evolution of Vector/SIMD-extended architectures

#### - CPU cores with wider vectors and/or SIMD cores:

- <u>DSP</u> VLIW cores with vector capabilities: **Texas Instruments** (...?)
- PPC cores coupled with SIMD cores: Cell (past...) , IBM Power BQC...
- <u>ARM64</u> cores coupled with SIMD cores: from Tegra to Parker (NVidia) (...?)
- <u>x86</u> many-core: Intel MIC / Xeon KNL, AMD FirePro...
- other many-core: ShenWay 260, Adapteva Epiphany-V...
- coprocessors (require a host scalar processor): accelerator devices
  - on disjoint physical memories (e.g., Xeon KNC with PCI-Expr, PEZY-SC)
  - focus on SIMT/SIMD to hide memory latency: **GPU**-type approach
  - ISA-free architectures, code compiled to silica: FPGA

AJProença, Advanced Architectures, MiEI, UMinho, 2016/17

17

# Intel MIC: Many Integrated Core





## Intel Knights Corner architecture

AJProença, Advanced Architectures, MiEI, UMinho, 2016/17



AJProença, Advanced Architectures, MiEl, UMinho, 2016/17

19

## The new Knights Landing architecture



AJProença, Advanced Architectures, MiEI, UMinho, 2016/17

21

#### Intel Knights Landing in 2016: Xeon Phi com 72 cores



AJProença, Advanced Architectures, MiEI, UMinho, 2016/17

#### **PEZY-SC:** <u>Peta</u> <u>Exa</u> <u>Z</u>etta <u>Y</u>otta-<u>S</u>uper<u>C</u>omputer: a 1024-core many-core processor chip

| reen500<br>Rank  | MFLOPS/W | Site*                                                             | Computer*                                                                                                                                            | (KVV)                      | een500 list                                                                                                     |  |
|------------------|----------|-------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------|-----------------------------------------------------------------------------------------------------------------|--|
| 1 94             | 6,673.84 | Advanced Center for Computing and<br>Communication, RIKEN         | Shoubu - ZettaSener-1.6, Ason E5-2618Lv3 8C 2.3GHz,<br>Infiniband FDR PEZY-SCnp                                                                      | 149.99 Jun                 | June'2016                                                                                                       |  |
| <sup>2</sup> 486 | 6,195.22 | Computational Astrophysics Laboratory, RIKEN                      | Satsuki - ZettaScaler-1.6, Xson E5-2618Lv3 8C 2.3GHz,<br>Infiniband FDR PEZY-SCnp                                                                    | 46.89                      |                                                                                                                 |  |
| <sup>3</sup> 1   | 6,051.30 | National Supercomputing Center in Wuxi                            | Sunway TaihuLight - Sunway MPP, Sunway <u>SW26010</u> 260C<br>1.45GHz, Sunway                                                                        | 15,371.00                  |                                                                                                                 |  |
| 4 440            | 5,272.09 | GSI Helmholtz Center                                              | ASUS ESC4000 FDR/G2S, Intel Xeon E5-2690v2 10C 3GHz,<br>Infiniband FDR, AMD FirePro S9150                                                            | 57.15<br>ity               |                                                                                                                 |  |
| <sup>5</sup> 446 | 4,778.46 | Institute of Modern Physics (IMP),<br>Chinese Academy of Sciences | Sugon Cluster W780I, Xeon E5-2640v3 8C 2.6GH                                                                                                         | 5PE)                       | a de la companya de l |  |
| 6 122            | 4,112.11 | Stanford Research Computing Center                                | Sugon Cluster W7801, Xeon E5-2640v3 8C 2.6GP<br>DDR, NVIDIA Tesia K80<br>XStream - Cray CS-Storm, Intel Xeon E5-2680v2<br>Infiniband FDR, Nvidia K80 |                            |                                                                                                                 |  |
| öp500<br>Rank    |          |                                                                   | PEZY-SC                                                                                                                                              | Prefecture<br>16City, 256F | Prefecture                                                                                                      |  |
|                  |          |                                                                   | PEZY-SC<br>2nd Generation Many Core<br>Processor with 1024 Cores<br>Supported by<br>2013 NEOD Project<br>PEZY Computing K.K.<br>B27701432-ES         |                            |                                                                                                                 |  |
| A.IF             | Proenca  | Advanced Architectures. I                                         | MiEl. UMinho. 2016/17                                                                                                                                | Prefecture                 | Prefecture                                                                                                      |  |

# **Beyond Vector/SIMD architectures**

#### ~~

- Vector/SIMD-extended architectures are hybrid approaches
  - mix (super)scalar + vector op capabilities on a single device
  - highly pipelined approach to reduce memory access penalty
  - tightly-closed access to shared memory: lower latency

#### Evolution of Vector/SIMD-extended architectures

#### - CPU cores with wider vectors and/or SIMD cores:

- <u>DSP</u> VLIW cores with vector capabilities: **Texas Instruments** (...?)
- PPC cores coupled with SIMD cores: Cell (past...) , IBM Power BQC...
- <u>ARM64</u> cores coupled with SIMD cores: from Tegra to Parker (NVidia) (...?)
- x86 many-core: Intel MIC / Xeon KNL, AMD FirePro...
- other many-core: ShenWay 260, Adapteva Epiphany-V...

#### - coprocessors (require a host scalar processor): accelerator devices

- on disjoint physical memories (e.g., Xeon KNC with PCI-Expr, PEZY-SC)
- focus on SIMT/SIMD to hide memory latency: GPU-type approach
- ISA-free architectures, code compiled to silica: FPGA





AJProença, Advanced Architectures, MiEl, UMinho, 2016/17





Adapteva announcement in Oct'16: Epiphany-V, a 1024-core RISC chip



AJProença, Advanced Architectures, MiEl, UMinho, 2016/17

#### Top500: Processor family distribution over all systems



## **Beyond Vector/SIMD architectures**

#### $\sim$

- Vector/SIMD-extended architectures are hybrid approaches
  - mix (super)scalar + vector op capabilities on a single device
  - highly pipelined approach to reduce memory access penalty
  - tightly-closed access to shared memory: lower latency
- Evolution of Vector/SIMD-extended architectures

#### - CPU cores with wider vectors and/or SIMD cores:

- <u>DSP</u> VLIW cores with vector capabilities: **Texas Instruments** (...?)
- PPC cores coupled with SIMD cores: Cell (past...) , IBM Power BQC...
- <u>ARM64</u> cores coupled with SIMD cores: from Tegra to Parker (NVidia) (...?)
- <u>x86</u> many-core: Intel MIC / Xeon KNL, AMD FirePro...
- other many-core: ShenWay 260, Adapteva Epiphany-V...
- coprocessors (require a host scalar processor): accelerator devices
  - on disjoint physical memories (e.g., Xeon KNC with **PCI-E**xpr, **PEZY-SC**)
  - focus on SIMT/SIMD to hide memory latency: GPU-type approach
    ISA-free architectures, code compiled to silica: FPGA