#### **Advanced Architectures**



2020/21 *A.J.Proença* 

#### From ILP to Multithreading (online)

(most slides are borrowed)

AJProença, Advanced Architectures, MiEI, UMinho, 2020/21

公

#### Key issues for parallelism in a single-core



#### Pipelining & superscalarity: a review



- The analysed pipelines were only in the P6 **Execution Unit**,  $\bullet$ assuming that the Instruction Control Unit issues at each clock cycle all the required instructions for parallel execution
- The image suggests (i) a 3-way superscalar engine and (ii) an execution engine with 6 functional units

AJProenca, Advanced Architectures, MiEI, UMinho, 2020/21

公

#### Intel Sunny Cove microarchitecture: 30 functional units



#### **Comments to the slides on performance evaluation** (1)



#### Assembly version for combine4

- data type: integer ; operation: multiplication

| .L24: |                    | # Loop:               |
|-------|--------------------|-----------------------|
| imull | (%eax,%edx,4),%ecx | # t *= data[i]        |
| incl  | %edx               | # i++                 |
| cmpl  | %esi,%edx          | <pre># i:length</pre> |
| jl    | .124               | # if < goto Loop      |

• Translating 1<sup>st</sup> iteration into RISC-like instructions

| load  | (%eax,%edx.0,4)         | → | t.1    |
|-------|-------------------------|---|--------|
| imull | t.1, %ecx.0             | → | %ecx.1 |
| incl  | %edx.0                  | → | %edx.1 |
|       | <pre>%esi, %edx.1</pre> | → | cc.1   |
| jl    | -taken cc.1             |   |        |

| 3+miss penalty? |                           |  |
|-----------------|---------------------------|--|
| +4              |                           |  |
| +1              | <b>Expected duration:</b> |  |
| +1              | 10+ clock cycles          |  |
| +1              | per vector element        |  |

Timings in clock cycles

#### **Comments to the slides on performance evaluation** (2)

#### $\sim$

#### **Features that lead to CPE=2:**

#### in the hardware

- pipelined execution units with 1 clock-cycle/issue
- mem hierarchy with cache
- out-of-order execution
- at least 5-way superscalar
- more 1 arithm & 1 load units
- speculative jump

#### at the code level

- loop unroll 2x
- 2-way parallelism



#### $\sim$



#### 公



AJProença, Advanced Architectures, MiEI, UM

Package Top side view

Package bottom side view

#### 公



#### $\sim$

#### Ampere<sup>™</sup> Altra<sup>™</sup> processor complex

## ARM

#### 80 64-bit Arm CPU cores @ 3.0 GHz Turbo

- 4-Wide superscalar aggressive out-of-order execution
- Single threaded cores for performance and security isolation



# AMPERE Altra Corres & Coherent Meth Network B Corres & Coherent Meth Network B PCIP Gend P

Ampere Altra: 80 cores



AJProença, Advanced Architectures, MiEI, UMinho, 2020/21



AJProença, Advanced Architectures, MiEI, UMinho, 2020/21

#### $\sim$



## China

#### Sunway SW 26010: 256+4 cores (in #1 TOP500, June 2016)





#### What is needed to increase the #cores in a chip?

#### $\sim$





# What is needed to increase the #cores in a chip?

# Using the same microelectronics technology, **<u>remove</u>** parts from the core

#### Which parts?

- L3 cache
- AVX-512

. . .

- reduce L2 cache
- in-order exec
- less functional units



AJProença, Advanced Architectures,

#### SMT in architectures designed by other companies

#### $\langle \rangle$

For each manufacturer identify the max hw support for SMT at each core (how many ways):

- Intel Xeon
- AMD Epyc
- Fujitsu Arm64FX
- IBM Power 9
- Sunway SW2610
- Apple A14