## Master Informatics Eng.

2018/19 A.J.Proença

## Instruction-Level Parallelism & Data Parallelism

(some slides are borrowed, mod's in green)

AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

XX

#### **Performance Summary (single-core)**



1

## **Background for Advanced Architectures**

#### $\sim$

#### Key concepts to revise:

- numerical data representation (for error analysis)
- ISA (Instruction Set Architecture)
- how C compilers generate code (a look into assembly code)
  - how scalar and structured data are allocated
  - how control structures are implemented
  - how to call/return from function/procedures
  - what architecture features impact performance

#### - Improvements to enhance performance in a single PU

- ILP: pipeline, multiple issue, ...
- thread-level parallelism
- data parallelism: SIMD/vector processing, ...
- memory hierarchy: cache levels, ...

AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

3

#### **Pipeline Summary**

# The BIG Picture Pipelining improves performance by increasing instruction throughput Executes multiple instructions in parallel Each instruction has the same latency Subject to hazards Structure, data, control Instruction set design affects complexity of

pipeline implementation

## Processor arch: beyond Instruction-Level Parallelism



AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

5

**Does Multiple Issue Work?** 



## **Multiple Issue and Static Scheduling**

- To achieve CPI < 1, need to complete multiple instructions per clock cycle
- Solutions:

 $\mathcal{K}$ 

- statically scheduled superscalar processors
- VLIW (very long instruction word) processors
- dynamically scheduled superscalar processors



## Internal x86 roadmap

|            |           | NetB     | urst        |                    |      |        |            |          |               |            |              |           |          |            |
|------------|-----------|----------|-------------|--------------------|------|--------|------------|----------|---------------|------------|--------------|-----------|----------|------------|
| Willamette | Northwood | Prescott | Tejas       | Nehalem (NetBurst) |      |        |            |          |               |            |              |           |          |            |
|            |           |          | ų           | Cedar Mill (Tejas) |      |        |            |          |               |            |              |           |          |            |
|            |           | Ļ        | Prescott-2M | Cedar Mill         |      |        |            |          |               |            |              |           |          |            |
|            |           |          | Smithfield  | Presler            |      |        |            |          |               |            |              |           |          |            |
| P6         |           |          | С           | ore                | Neh  | alem   | Sandy B    | ridge    | На            | swell      | Sk           | ylake     |          |            |
| Coppermine | Tualatin  | Banias   | Dothan      | Yonah              | Core | Penryn | Nehalem    | Westmere | Sandy Bridge  | Ivy Bridge | Haswell      | Broadwell | Skylake  | Cannonlake |
| 180 nm     | 130 r     | nm       | 90 nm       | 65 nm              |      | 45     | inm        | 3        | 2 nm          | 22 n       | m            | 14        | nm       | 10 nm      |
|            |           |          |             |                    |      | Во     | nnell      | Sa       | aitwell       | Silvern    | nont         | Airmont   | Goldmont |            |
| Atc        |           |          |             | om Bonnell         |      |        | Silvermont |          | t             | Goldmont   |              |           |          |            |
| P5 Xeon F  |           |          |             | Phi                |      |        |            |          | Knigh<br>Corn |            | Knig<br>Land |           |          |            |

## Internal architecture of Intel P6 processors





AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

~~

Some capabilities of Intel P6



| Instruction               | Latency_ | Cycles <u>/</u> Issue |
|---------------------------|----------|-----------------------|
| Load / Store              | 3        | 1                     |
| Integer Multiply          | 4        | 1                     |
| Integer Divide            | 36       | 36                    |
| Double/Single FP Multiply | 5        | 2                     |
| Double/Single FP Add      | 3        | 1                     |
| Double/Single FP Divide   | 38       | 38                    |

A detailed example: generic & abstract form of combine

```
void abstract combine4(vec ptr v, data t *dest)
  int i;
  int length = vec length(v);
  data t *data = get vec start(v);
  data t t = IDENT;
  for (i = 0; i < length; i++)</pre>
    t = t OP data[i];
  *dest = t;
```

- Procedure to perform addition (w/ some improvements)
- compute the sum of all vector elements
- store the result in a given memory location
- structure and operations on the vector defined by ADT
- Metrics

XX

- Clock-cycles Per Element, CPE

AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

11

#### Converting instructions with registers into operations with tags

| XX -    |           |                                      |            |                |
|---------|-----------|--------------------------------------|------------|----------------|
| • Asse  | embly v   | ersion for combine                   | <b>e</b> 4 |                |
| – da    | ata type: | <i>integer</i> ; operation: <i>m</i> | ulti       | iplication     |
|         | .L24:     |                                      | #          | Loop:          |
|         | imull     | (%eax,%edx,4),%ecx                   | #          | t *= data[i]   |
|         | incl      | %edx                                 | #          | i++            |
|         | cmpl      | %esi,%edx                            | #          | i:length       |
|         | jl        | . L24                                | #          | if < goto Loop |
|         | [         | _                                    |            |                |
| • Tran  | slating   | 1 <sup>st</sup> iteration            |            |                |
| . L24 : |           |                                      |            |                |

```
(%eax,%edx.0,4) → t.1
imull (%eax,%edx,4),%ecx
                             load
                             imull t.1, %ecx.0
                                   %edx.0
incl
      %edx
                             incl
      %esi,%edx
                             cmpl
                                   %esi, %edx.1
cmpl
jl
      .L24
                             jl
                                   -taken cc.1
```

AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

→ %ecx.1

 $\rightarrow$  %edx.1

→ cc.1

## Visualizing instruction execution in P6: 1 iteration of the multiplication cycle on combine



AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

13

#### Visualizing instruction execution in P6: 3 iterations of the same cycle on combine



Visualizing instruction execution in P6: 4 iterations of the addition cycle on combine



#### • Performance

- it can start a new iteration at each clock cycle
- theoretical CPE: 1.0
- it requires parallel execution of 4 integer operations

AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

15



#### Performance

- expected CPE: 2.0

## Machine dependent optimization techniques: loop unroll (1)

 $\sim$ 

```
void combine5(vec_ptr v, int *dest)
{
    int length = vec_length(v);
    int limit = length-2;
    int *data = get_vec_start(v);
    int sum = 0;
    int i;
    /* junta 3 elem's no mesmo ciclo */
    for (i = 0; i < limit; i+=3) {
        sum += data[i] + data[i+1]
            + data[i+2];
    }
    /* completa os restantes elem's */
    for (; i < length; i++) {
        sum += data[i];
    }
    *dest = sum;
}
</pre>
```

**Optimization 4**:

- merges several (3) iterations in a single loop cycle
- reduces cycle
   overhead in loop
   iterations
- runs the extra work at the end
- -CPE: 1.33

AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

17

## Machine dependent optimization techniques: loop unroll (2)



## Machine dependent optimization techniques: loop unroll (3)



1 iteration for each 4 cycles

AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

19

## Machine dependent optimization techniques: loop unroll (4)

| ~     |
|-------|
| - X X |
|       |

CPE value for several cases of loop unroll:

| Degree  | of Unroll | 1    | 2    | 3    | 4    | 8    | 16   |
|---------|-----------|------|------|------|------|------|------|
| Integer | Addition  | 2.00 | 1.50 | 1.33 | 1.50 | 1.25 | 1.06 |
| Integer | Product   | 4.00 |      |      |      |      |      |
| fp      | Addition  | 3.00 |      |      |      |      |      |
| fp      | Product   | 5.00 |      |      |      |      |      |

- only improves the integer addition

- · remaining cases are limited to the unit latency
- result does not linearly improve with the degree of unroll
  - subtle effects determine the exact allocation of operations

What else can be done?



Machine dependent optimization techniques: loop unroll with parallelism (1)

void combine6(vec ptr v, int \*dest) int length = vec length(v); -accumulate in 2 int limit = length-1; int \*data = get\_vec\_start(v); int x0 = 1;int x1 = 1;int i; /\* junta 2 elem's de cada vez \*/ for (i = 0; i < limit; i+=2) {</pre> x0 \*= data[i]; -Performance x1 \*= data[i+1]; } -CPE: 2.0 /\* completa os restantes elem's \*/ for (; i < length; i++) {</pre> x0 \*= data[i]; \*dest = x0 \* x1;

Sequential ... versus parallel!

## **Optimization 5**:

- different products
  - can be in parallel, if **OP** is associative!
- -merge at the end
- - -improvement 2x

AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

XX

21

## Machine dependent optimization techniques: loop unroll with parallelism (2)



AJProença, Advanced Architectures, MiEI, UMinho, 2018/19



# Code optimization techniques: comparative analyses of combine

| Method                      | Intege | er    | <b>Real</b> (single precision) |        |  |
|-----------------------------|--------|-------|--------------------------------|--------|--|
|                             | +      | *     | +                              | *      |  |
| Abstract -g                 | 42.06  | 41.86 | 41.44                          | 160.00 |  |
| Abstract -02                | 31.25  | 33.25 | 31.25                          | 143.00 |  |
| Move vec length             | 20.66  | 21.25 | 21.15                          | 135.00 |  |
| Access to data              | 6.00   | 9.00  | 8.00                           | 117.00 |  |
| Accum. in temp              | 2.00   | 4.00  | 3.00                           | 5.00   |  |
| Unroll 4x                   | 1.50   | 4.00  | 3.00                           | 5.00   |  |
| Unroll 16x                  | 1.06   | 4.00  | 3.00                           | 5.00   |  |
| Unroll 2x, paral. 2x        | 1.50   | 2.00  | 2.00                           | 2.50   |  |
| <i>Unroll</i> 4x, paral. 4x | 1.50   | 2.00  | 1.50                           | 2.50   |  |
| <i>Unroll</i> 8x, paral. 4x | 1.25   | 1.25  | 1.50                           | 2.00   |  |
| Theoretical Optimiz         | 1.00   | 1.00  | 1.00                           | 2.00   |  |
| Worst : Best                | 39.7   | 33.5  | 27.6                           | 80.0   |  |

AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

XX

25

## Processor arch: beyond Instruction-Level Parallelism

| 1 |                                                                          |
|---|--------------------------------------------------------------------------|
| • | When exploiting ILP, goal is to minimize CPI                             |
|   | Pipeline CPI (efficient to exploit loop-level parallelism) =>            |
|   | <ul> <li>Ideal pipeline CPI +</li> </ul>                                 |
|   | <ul> <li>Structural stalls +</li> </ul>                                  |
|   | <ul> <li>Data hazard stalls +</li> </ul>                                 |
|   | Control stalls +                                                         |
|   | <ul> <li>Memory stalls cache techniques</li> </ul>                       |
|   | > Multiple issue =>                                                      |
|   | <ul> <li>find enough parallelism to keep pipeline(s) occupied</li> </ul> |
| • | Multithreading =>                                                        |
|   | Find additional ways to keep pipeline(s) occupied                        |
|   |                                                                          |
|   |                                                                          |

# **Multithreading**

- Performing multiple threads of execution in parallel
  - Replicate registers, PC/IP, etc.
  - Fast switching between threads
- Fine-grain multithreading / time-multiplexed MT
  - Switch threads after each cycle
  - Interleave instruction execution
  - If one thread stalls, others are executed
- Coarse-grain multithreading
  - Only switch on long stall (e.g., L2-cache miss)
  - Simplifies hardware, but doesn't hide short stalls (eg, data hazards)

Chapter 7 — Multicores, Multiprocessors, and Clusters — 27

# **Simultaneous Multithreading**

- In multiple-issue dynamically scheduled processor
  - Schedule instructions from multiple threads
  - Instructions from independent threads execute when function units are available
  - Within threads, dependencies handled by scheduling and register renaming
- Example: Intel Pentium-4 HT
  - Two threads: duplicated registers, shared function units and caches

**HT**: Hyper-Threading, Intel trade mark for their SMT implementation MT in Xeon Phi KNC: 4-way SMT with time-mux MT, **not HT**!



# **Multithreading Example**



Chapter 7 — Multicores, Multiprocessors, and Clusters — 29

# **Multithreading Example**



Chapter 7 — Multicores, Multiprocessors, and Clusters — 30



## Processor arch: beyond Instruction-Level Parallelism



# **Instruction and Data Streams**

|                        |          | Data Streams                       |                                       |  |  |  |
|------------------------|----------|------------------------------------|---------------------------------------|--|--|--|
|                        |          | Single                             | Multiple                              |  |  |  |
| Instruction<br>Streams | Single   | SISD:<br>Intel Pentium 4           | <b>SIMD</b> : SSE instructions of x86 |  |  |  |
|                        | Multiple | <b>MISD</b> :<br>No examples today | MIMD:<br>Intel Xeon e5345             |  |  |  |

Introduction

- SPMD: Single Program Multiple Data
  - A parallel program on a MIMD computer
  - Conditional code for different processors

Chapter 7 — Multicores, Multiprocessors, and Clusters — 33

# Introduction

- SIMD architectures can exploit significant datalevel parallelism for:
  - matrix-oriented scientific computing
  - media-oriented <u>image</u> and <u>sound</u> processing
- SIMD is more energy efficient than MIMD
  - only needs to fetch one instruction per data operation
  - makes SIMD attractive for personal mobile devices
- SIMD allows programmers to continue to think sequentially





## **Vector Architectures**

- Basic idea:
  - Read sets of data elements (<u>gather</u> from memory) into "vector registers"
  - Operate on those registers
  - Store/<u>scatter</u> the results back into memory
- Registers are controlled by the compiler
  - Used to hide memory latency
  - Leverage memory bandwidth



Vector Architectures



AJProença, Advanced Architectures, MiEI, UMinho, 2016/17

37

Vector Architectures

## Challenges

- Start up time
  - Latency of vector functional unit
  - Assume the same as Cray-1
    - Floating-point add => 6 clock cycles
    - Floating-point multiply => 7 clock cycles
    - Floating-point divide => 20 clock cycles
    - Vector load => 12 clock cycles
- Improvements:
  - > 1 element per clock cycle (1)
  - Non-64 wide vectors (2)
  - IF statements in vector code (3)
  - Memory system optimizations to support vector processors (4)
  - Multiple dimensional matrices (5)
  - Sparse matrices (6)
  - Programming a vector computer (7)



## Vector Programming (7)

- Compilers are a key element to give hints on whether a code section will vectorize or not
- Check if loop iterations have data dependencies, otherwise vectorization is compromised
- Vector Architectures have a too high cost, but simpler variants are currently available on off-the-shelf devices; however:
  - most do not support non-unit stride => care must be taken in the design of data structures
  - same applies for gather-scatter...





No mask registers



## SIMD Implementations

## Implementations:

- Intel MMX (1996)
  - Eight 8-bit integer ops or four 16-bit integer ops
- Streaming SIMD Extensions (SSE) (1999)
  - Eight 16-bit integer ops
  - Four 32-bit integer/fp ops or two 64-bit integer/fp ops
- Advanced Vector eXtensions (AVX) (2010)
  - Eight 32-bit fp ops or Four 64-bit fp ops (integer in AVX-2)
  - 512-bits wide in AVX-512 (and also in Larrabee & Phi-KC)

## Operands must be in consecutive and aligned memory locations





#### Extensões de processamento vetorial no Intel 64

## Additional features in Intel x86



43

## **Beyond Vector/SIMD architectures**

#### $\sim$

- Vector/SIMD-extended architectures are hybrid approaches
  - mix (super)scalar + vector op capabilities on a single device
  - highly pipelined approach to reduce memory access penalty
  - tightly-closed access to shared memory: lower latency
- Evolution of Vector/SIMD-extended architectures

#### - CPU cores with wider vectors and/or SIMD cores:

- <u>DSP</u> VLIW cores with vector capabilities: **Texas Instruments** (...?)
- PPC cores coupled with SIMD cores: Cell (past...) , IBM Power BQC...
- <u>ARM64</u> cores coupled with SIMD cores: from Tegra to Parker (NVidia) (...?)
- <u>x86</u> many-core: Intel MIC / Xeon KNL, AMD FirePro...
- other many-core: ShenWay 260, Adapteva Epiphany-V...

#### - coprocessors (require a host scalar processor): accelerator devices

- on disjoint physical memories (e.g., Xeon KNC with PCI-Expr, PEZY-SC)
- focus on SIMT/SIMD to hide memory latency: GPU-type approach
- ISA-free architectures, code compiled to silica: **FPGA**

## Intel MIC: Many Integrated Core



## Intel Knights Corner architecture

46





## Intel Knights Landing architecture

AJProença, Advanced Architectures, MiEI, UMinho, 2018/19

47

# INTEL® XEON PHI™ X200 PROCESSOR OVERVIEW





AJProença, Advanced Architectures, MiEI, UMinho, 2018/19



http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf





#1 from June'16 TOP500: Sunway TaihuLight http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf

One card with two nodes (two SW26010 chips)



SW26010: the 4x64-core 64-bit RISC processor (with SIMD extensions)

