**Advanced Architectures** 



## **Master Informatics Eng.**

2019/20 *A.J.Proença* 

## Data Parallelism 5 (other PUs, ...)

(most slides are borrowed)

## **Beyond Vector/SIMD architectures**

#### 公

### • Vector/SIMD-extended architectures are hybrid approaches

- mix (super)**scalar + vector** op capabilities on a single device
- highly pipelined approach to reduce memory access penalty
- tightly-closed access to shared memory: lower latency

### Evolution of Vector/SIMD-extended architectures

#### - PU (Processing Unit) cores with wider vector units

- x86 many-core: Intel MIC / Xeon KNL
- others: ...

#### - coprocessors (require a host scalar processor): accelerator devices

- on disjoint physical memories (e.g., Xeon KNC with PCI-Expr, PEZY-SC)
- ISA-free architectures: ...
- focus on SIMT/SIMD to hide memory latency: GPU-type approach

• ...

#### - heterogeneous PUs in a SoC: multicore PUs with GPU-cores

• ...

## **PEZY-SC:** <u>P</u>eta <u>Exa Z</u>etta <u>Y</u>otta-<u>S</u>uper<u>C</u>omputer: a 1024-core many-core processor chip, each core 8-way SMT

| Green500<br>Rank | TOP500<br>Rank | MFLOPS/W  | Site                                                                                                                                  | System                                                                                        | Total<br>Power(kW)     | Green500 list<br>June'2015               |  |  |
|------------------|----------------|-----------|---------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|------------------------|------------------------------------------|--|--|
| 1                | 160            | 7031.4    | RIKEN                                                                                                                                 | ExaScaler-1.4 80Brick, Xeon<br>E5-2618Lv3 8C 2.3GHz, Infiniband<br>FDR, <mark>PEZY</mark> -SC | 50.3                   |                                          |  |  |
| 2                | 392            | 6841.3    | High Energy Accelerator Research<br>Organization /KEK                                                                                 | ExaScaler-1.4 16Brick, Xeon<br>E5-2618Lv3 8C 2.3GHz, Infiniband,<br>PEZY-SC                   | 28.3                   |                                          |  |  |
| 3                | 366            | 6217.9    | High Energy Accelerator Research<br>Organization /KEK                                                                                 | ExaScaler 32U256SC Cluster,<br>Intel Xeon E5-2660v2 10C<br>Infiniband FDR, PEXY-SC            | 32.6<br>City<br>(16PE) |                                          |  |  |
| 4                | 215            | 5272.1    | GSI Helmholtz Center                                                                                                                  | ASUS ESC4000 FDR/G2S,<br>Xeon E5-2690v2 10C 3GH:<br>Infiniband FDR, AMD Fire<br>S9150         |                        | -3 <b>\$</b>                             |  |  |
| 5                | 469            | 4258.1    | GSIC Center, Tokyo Institute of<br>Technology                                                                                         | LX 1U-4GPU/104Re-1G CI<br>Intel Xeon E5-2620v2 6C<br>2.100GHz, Infiniband FDR<br>K20x         |                        | fecture Prefecture<br>City, 256PE        |  |  |
|                  |                | Pro       | PEZY-SC<br>Generation Many Core<br>cessor with 1024 Cores<br>Supported by<br>2013 NEDO Project<br>PEZY Computing K.K.<br>B27701432-ES | 4 DDR3/4                                                                                      | Pref                   | ecture Prefecture                        |  |  |
| AJP              | Proença,       | , Advance | d Architectures, MiEl, UMin                                                                                                           |                                                                                               |                        | leGen3 ARM PCleGen3<br>2Port X2 X8 2Port |  |  |

## PEZY-SC: <u>Peta</u> <u>Exa</u> <u>Z</u>etta <u>Y</u>otta-<u>S</u>uper<u>C</u>omputer: a 1024-core many-core processor chip





# **Evolution:** the **PEZY-SC2**

PEZY-SC with 2x 32-bit ARM cores (2015)

PEZY-SC2 with 8x 64-bit MIPS cores sharing 40 MiB LLC (2017)



## **PEZY-SC2** in Green500

## Green500 List for November 2017

| $\langle \rangle$ | Rank | TOP500<br>Rank | System                                                                                                                                                                                                  | Cores      | Rmax<br>(TFlop/s) | Power<br>(kW) | Power Efficiency<br>(GFlops/watts) |                                                                               |
|-------------------|------|----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|-------------------|---------------|------------------------------------|-------------------------------------------------------------------------------|
|                   | 1    | 259            | <b>Shoubu system B</b> - ZettaScaler-2.2, Xeon D-1571<br>16C 1.3GHz, Infiniband EDR, PEZY-SC2, PEZY<br>Computing / Exascaler Inc.<br>Advanced Center for Computing and<br>Communication, RIKEN<br>Japan | 794,400    | 842.0             | 50            | 17.009                             |                                                                               |
|                   | 2    | 307            | Suiren2 - ZettaScaler-2.2, Xeon D-1571 16C<br>1.3GHz, Infiniband EDR, PEZY-SC2, PEZY<br>Computing / Exascaler Inc.<br>High Energy Accelerator Research Organization<br>/KEK<br>Japan                    | 762,624    | 788.2             | 47            | 16.759                             |                                                                               |
|                   | 3    | 276            | <b>Sakura</b> - ZettaScaler-2.2, Xeon E5-2618Lv3 8C<br>2.3GHz, Infiniband EDR, PEZY-SC2, PEZY<br>Computing / Exascaler Inc.<br>PEZY Computing K.K.<br>Japan                                             | 794,400    | 824.7             | 50            | 16.657                             |                                                                               |
|                   | 4    | 149            | <b>DGX SaturnV Volta</b> - NVIDIA DGX-1 Volta36, Xeon<br>E5-2698v4 20C 2.2GHz, Infiniband EDR, NVIDIA<br>Tesla V100, Nvidia<br>NVIDIA Corporation<br>United States                                      | 22,440     | 1,070.0           | 97            | 15.113                             | None of these                                                                 |
|                   | 5    | 4              | <b>Gyoukou</b> - ZettaScaler-2.2 HPC system, Xeon<br>D-1571 16C 1.3GHz, Infiniband EDR, <u>PEZY-SC2</u><br>700Mhz , ExaScaler<br>Japan Agency for Marine-Earth Science and<br>Technology<br>Japan       | 19,860,000 | 19,135.8          | 1,350         | 14.173                             | systems are<br>in Nov'19 list,<br>y a PEZY-SC2<br>in 2 <sup>nd</sup> position |

## The PEZY-SCx Road Map

 $\sim$ 

## PEZY-SCx Processor Roadmap

|                         | PEZY-SC     | PEZY-SC2     | PEZY-SC3       | PEZY-SC4      |
|-------------------------|-------------|--------------|----------------|---------------|
| Process                 | 28mm        | 16nm         | 7nm            | 5nm           |
| Die Size                | 412mm2      | 620mm2       | 700mm2         | 740mm2        |
| Number of Cores         | 1,024       | 2,048        | 8,192          | 16,384        |
| Core Voltage            | 0.9V        | 0.8V         | 0.05V          | 0.55V         |
| Core Clock              | 733MHz      | 1GHz         | 1.33GHz        | 1.6GHz        |
| DRAM-IO                 | DDR4        | DDR4         | DDR4/5         | DDR5          |
| DDR Clock               | 2,133MHz    | 2,666MHz     | 3.6GHz         | 4GHz          |
| Port                    | 8           | 4            | 4              | 4             |
| Wide-IO Clock           |             | 2GHz DDR     | 3 GHz DDR      | 3GHz DDR      |
| Wide-IO Width           | -           | 1,024bit     | 2,048bit       | 4,096bit      |
| Wide-IO Ports           |             | 4            | ~              | 8             |
| Memory Bandwidth        | 153.6GB/s   | 2.1TB/s      | 12.2TB/s       | 24.4TB/s      |
| Peripheral IO           | PCI3e Gen3  | PCIe Gen4    | Custom Optical | Custom Optica |
| Peripheral IO lane      | 24          | 32           | 128            | 512           |
| Peripheral IO Bandwidth | 32GB/s      | 64GB/s       | 256CB/s        | 1TB/s         |
| DP Performance          | 1.5TFLOPS   | 4.1TFLOPS    | 21.8TFLOPS     | 52.5TFLOPS    |
| SP Performance          | 3.0TFLOPS   | 8.2TFLOPS    | 43.6TFLOPS     | 105TFLOPS     |
| HP Performance          | -           | 16.4TFLOPS   | 87.2TFLOPS     | 210TFLOPS     |
| Power Consumption       | 100W        | 200W         | 400W           | 640W          |
| Power Efficiency        | 15GFLOPS/w  | 20.5GFLOPS/w | 54.5GFLOPS/w   | 82.0GFLOPS/v  |
| System Efficiency       | 6.7GFLOPS/w | 15GFLOPS/w   | 40GFLOPS/w     | 60GFLOPS/w    |

PEZY-SC3 expected in end 2019

## **Beyond Vector/SIMD architectures**

#### 公

### • Vector/SIMD-extended architectures are hybrid approaches

- mix (super)**scalar + vector** op capabilities on a single device
- highly pipelined approach to reduce memory access penalty
- tightly-closed access to shared memory: lower latency

### Evolution of Vector/SIMD-extended architectures

#### - PU (Processing Unit) cores with wider vector units

- <u>x86</u> many-core: Intel MIC / Xeon KNL
- others: IBM Power BlueGene/Q, ShenWay 260, Matrix-2000, A64FX Arm

#### - coprocessors (require a host scalar processor): accelerator devices

- on disjoint physical memories (e.g., Xeon KNC with PCI-Expr, PEZY-SC)
- ISA-free architectures: ...
- ...

#### - heterogeneous PUs in a SoC: multicore PUs with GPU-cores

• ...

## IBM Power BlueGene/Q Compute (chip)



Features:

- launched in 2010/11 (TOP500: #1 in Jun12, #4 in Jun16)
- 18-cores
  - 16 compute,
    1 OS support, 1 redundant
  - 64 bits PowerISA
  - 1.6 GHz
  - L1 I/D cache => 16 kiB / 16 kiB
  - each core: <u>quad-FPU</u> (4-wide double precision SIMD)
  - each core: 4-way SMT
- shared L2 cache: 32 MiB
- dual memory controller
- IBM ended development of BlueGene project in 2015...



## IBM Power BlueGene/Q Compute (Sequoia)



## The Sunway TaihuLight #1 Jun'16-Nov'18 TOP500



#### One card with two nodes (two SW26010 chips)



#### SW26010: the 4x64-core 64-bit RISC processor (w/ 256-bit vector instructions & only cache L1)





## Replacing the KNC in Tianhe-2A: the Matrix-2000 accelerator

SN1

C C C C

Cluster

On chip interconnection

DDR4

ссс

DDR4

SN2

CCCC

DDR4

ссс

## Matrix-2000 accelerator

SN0

PCIE

CCCC

## • Chip specification

- **128cores** 
  - 4 super-nodes (SN)
  - 8 clusters per SN
  - 4 cores per cluster
  - Core
    - Self-defined 256-bit vector ISA
    - 16 DP flops/cycle per core
- Peak performance: <u>2.4576Tflops@1.2GHz</u>

4 SNs x 8 clusters x 4cores x 16 flops x 1.2 GHz = 2.4576 Tflops

- Peak power dissipation: ~240w
- Interface

AJP

- 8 DDR4-2400 channels
- X16 PCIE 3.0 EP Port



DDR4

15

## Fujitsu's A64FX Arm & PEZY-SC2 in Green500

#### Green500 List for November 2019

| <b>.</b> . | TOP500 |                                                                                                                                                                                                                             |           | Rmax      | Power  | Power Efficiency |
|------------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|-----------|--------|------------------|
| Rank       | Rank   | System                                                                                                                                                                                                                      | Cores     | (TFlop/s) | (kW)   | (GFlops/watts)   |
| 1          | 159    | <b>A64FX prototype</b> - Fujitsu A64FX <mark>, Fujitsu A64FX</mark><br>48C 2GHz, Tofu interconnect D , Fujitsu<br>Fujitsu Numazu Plant<br>Japan                                                                             | 36,864    | 1,999.5   | 118    | 16.876           |
| 2          | 420    | NA-1 - ZettaScaler-2.2, Xeon D-1571 16C 1.3GHz,<br>Infiniband EDR, <u>PEZY-SC2</u> 700Mhz , PEZY<br>Computing / Exascaler Inc.<br>PEZY Computing K.K.<br>Japan                                                              | 1,271,040 | 1,303.2   | 80     | 16.256           |
| 3          | 24     | AiMOS - IBM Power System AC922, IBM POWER9<br>20C 3.45GHz, Dual-rail Mellanox EDR Infiniband,<br>NVIDIA Volta GV100, IBM<br>Rensselaer Polytechnic Institute Center for<br>Computational Innovations (CCI)<br>United States | 130,000   | 8,045.0   | 510    | 15.771           |
| 4          | 373    | Satori - IBM Power System AC922, IBM POWER9<br>20C 2.4GHz, Infiniband EDR, NVIDIA Tesla V100<br>SXM2 , IBM<br>MIT/MGHPCC Holyoke, MA<br>United States                                                                       | 23,040    | 1,464.0   | 94     | 15.574           |
| 5          | 1      | Summit - IBM Power System AC922, IBM POWER9<br>22C 3.07GHz, NVIDIA Volta GV100, Dual-rail<br>Mellanox EDR Infiniband , IBM<br>DOE/SC/Oak Ridge National Laboratory<br>United States                                         | 2,414,592 | 148,600.0 | 10,096 | 14.719           |



# *Fujitsu's A64FX Arm Chip:* 48+4 cores

#### A64FX Arm:

 Armv8.2-A spec with <u>512-bit SVE extensions</u> CMG specification 13 cores L2\$ 8MiB Mem 8GiB, 256GB/s

- HP math and a dot-product engine
- 4 core memory groups interconnected with a double ring bus
- cores in CMG linked by a crossbar to a 16-way associative 8 MiB L2 cache and to the HBM2 mem controller
- No L3 cache
- a Tofu3 controller on the die
- <u>Cray CS500 will use A64FX package</u>









## **Beyond Vector/SIMD architectures**

#### 

#### • Vector/SIMD-extended architectures are hybrid approaches

- mix (super)scalar + vector op capabilities on a single device
- highly pipelined approach to reduce memory access penalty
- tightly-closed access to shared memory: lower latency

### Evolution of Vector/SIMD-extended architectures

#### - PU (Processing Unit) cores with wider vector units

- x86 many-core: Intel MIC / Xeon KNL
- others: IBM BlueGene/Q Compute, ShenWay 260, Matrix-2000, A64FX Arm

#### - coprocessors (require a host scalar processor): accelerator devices

- on disjoint physical memories (e.g., Xeon KNC with PCI-Expr, PEZY-SC)
- ISA-free architectures, code compiled to silica: FPGA
- ...

#### - heterogeneous PUs in a SoC: multicore PUs with GPU-cores

• ...

## What is an FPGA

#### $\sim$

## Field-Programmable Gate Arrays (FPGA)

# A fabric with 1000s of simple configurable logic cells with LUTs, on-chip SRAM, configurable routing and I/O cells



## FPGA as a multiple configurable ISA



## FPGA as a computing accelerator



## CPU vs. GPU vs. FPGA vs. ASIC

#### $\sim$









## Intel<sup>®</sup> FPGA Programmable Acceleration Card (PAC) D5005

#### Introduction

This high-performance FPGA acceleration card for data centers offers both inline and lookaside acceleration. Expanding upon the Intel® FPGA Programmable Acceleration Card (PAC) portfolio, it offers inline high-speed interfaces up to 100 Gbps for video transcode and streaming analytics applications. It provides the performance and versatility of FPGA acceleration and is one of several platforms supported by the Acceleration Stack for Intel Xeon® CPUs with FPGAs. This acceleration stack provides a common developer interface for both application and accelerator function developers, and includes drivers, application programming interfaces (APIs), and an FPGA interface manager. Together with acceleration libraries and development tools, the acceleration stack saves developer's time and enables code re-use across multiple Intel FPGA platforms.





AJProença, A

#### Targeted Workloads

#### **Power and Thermals**

## Integrating programmable acceleration cards at Intel



#### Intel® Xeon® Processor + Field Programmable Gate Array Tool Flow



Programming Interfaces: OpenCL

Field Programmable Gate Array (FPGA)



https://www.slideshare.net/insideHPC/using-xeon-fpga-for-accelerating-hpc-workloads

24

Kernels

OpenCL

Compiler

bit-

stream

FPGA