**Advanced Architectures** 

#### **Master Informatics Eng.**

2020/21 A.J.Proença

#### **Vector computing & SIMD extensions**

(most slides are borrowed)

AJProença, Advanced Architectures, MiEI, UMinho, 2020/21

公



## **Instruction and Data Streams**

#### Flynn's Taxonomy of Computers \*

|                        |          | Data Streams               |                                       |  |
|------------------------|----------|----------------------------|---------------------------------------|--|
|                        |          | Single                     | Multiple                              |  |
| Instruction<br>Streams | Single   | SISD:<br>Intel Pentium 4   | <b>SIMD</b> : SSE instructions of x86 |  |
|                        | Multiple | MISD:<br>No examples today | MIMD:<br>Intel Xeon e5345             |  |

- SPMD: Single Program Multiple Data
  - A parallel program on a MIMD computer
  - Conditional code for different processors

\* Mike Flynn, "Very High-Speed Computing Systems," Proc. of IEEE, 1966

Chapter 7 — Multicores, Multiprocessors, and Clusters — 3

# Introduction

- SIMD architectures can exploit significant datalevel parallelism for:
  - matrix-oriented <u>scientific computing</u>
  - media-oriented <u>image</u> and <u>sound</u> processing
- SIMD is more energy efficient than MIMD
  - only needs to fetch one instruction per data operation
  - makes SIMD attractive for personal mobile devices
- SIMD allows programmers to continue to think sequentially



## **SIMD Parallelism**

#### Vector architectures

- Read sets of data elements (<u>gather</u> from memory) into "vector registers"
- Operate on those registers
- Store/<u>scatter</u> the results back into memory
- SIMD & extensions on scalar processors
- Graphics Processor Units (GPUs) (next set of slides)



## **Vector Architectures**

### Basic idea:

- Read sets of data elements (<u>gather</u> from memory) into "vector registers"
- Operate on those registers
- Store/<u>scatter</u> the results back into memory
- Registers are controlled by the compiler
  - Used to hide memory latency
  - Leverage memory bandwidth







Can overlap execution of multiple vector instructions

- Consider machine with 32 elements per vector register and 8 lanes:



Complete 24 operations/cycle while issuing 1 short instruction/cycle 8/19/2009 John Kubiatowicz Parallel Architecture: 35

## **VMIPS**







AJProença, Advanced Architectures, MiEI, UMinho, 2020/21

### **VMIPS Instructions**

- ADDVV.D: add two vectors
- ADDVS.D: add vector to a scalar
- LV/SV: vector load and vector store from address
- Example: DAXPY (<u>D</u>ouble-precision <u>A x X Plus Y</u>)

| L.D     | F0,a     | ; | load scalar a          |
|---------|----------|---|------------------------|
| LV      | V1,Rx    | ; | load vector X          |
| MULVS.D | V2,V1,F0 | ; | vector-scalar multiply |
| LV      | V3,Ry    | ; | load vector Y          |
| ADDVV   | V4,V2,V3 | ; | add                    |
| SV      | Ry,V4    | ; | store the result       |

 Requires the execution of 6 instructions versus almost 600 for MIPS (assuming DAXPY is operating on a vector with 64 elements)



### **Vector Execution Time**

#### Execution time depends on three factors:

- Length of operand vectors
- Structural hazards
- Data dependencies
- VMIPS functional units consume one element per clock cycle
  - Execution time is approximately the vector length

#### Convoy

 Set of vector instructions that could potentially execute together in one unit of time, *chime*



## Challenges

#### Start up time

- Latency of vector functional unit
- Assume the same as Cray-1
  - Floating-point add => 6 clock cycles
  - Floating-point multiply => 7 clock cycles
  - Floating-point divide => 20 clock cycles
  - Vector load => 12 clock cycles

#### Improvements:

- > 1 element per clock cycle (1)
- Non-64 wide vectors (2)
- IF statements in vector code (3)
- Memory system optimizations to support vector processors (4)
- Multiple dimensional matrices (5)
- Sparse matrices (6)
- Programming a vector computer (7)



## Multiple Lanes (1)

- Element *n* of vector register *A* is "hardwired" to element *n* of vector register *B*
  - Allows for multiple hardware lanes







Complete 24 operations/cycle while issuing 1 short instruction/cycle 8/19/2009 John Kubiatowicz Parallel Architecture: 35

## Vector Length Register (2)

- Handling vector length not known at compile time
- Use Vector Length Register (VLR)
- Use strip mining for vectors over the maximum length:
  low = 0;

VL = (n % MVL); /\*find odd-size piece using modulo op % \*/ for (j = 0; j <= (n/MVL); j=j+1) { /\*outer loop\*/

for (i = low; i < (low+VL); i=i+1) /\*runs for length VL\*/
Y[i] = a \* X[i] + Y[i] ; /\*main operation\*/
low = low + VL; /\*start of next vector\*/</pre>

VL = MVL; /\*reset the length to maximum vector length\*/





}

Copyright © 2012, Elsevier Inc. All rights reserved.

## Vector Mask Registers (3)

- Handling IF statements in Vector Loops: for (i = 0; i < 64; i=i+1) if (X[i] != 0) X[i] = X[i] - Y[i];
- Use vector mask register to "disable" elements:

| LV      | V1,Rx    | ;load vector X into V1        |
|---------|----------|-------------------------------|
| LV      | V2,Ry    | ;load vector Y                |
| L.D     | F0,#0    | ;load FP zero into F0         |
| SNEVS.D | V1,F0    | ;sets VM(i) to 1 if V1(i)!=F0 |
| SUBVV.D | V1,V1,V2 | ;subtract under vector mask   |
| SV      | Rx,V1    | ;store the result in X        |

#### GFLOPS rate decreases!



## Memory Banks (4)

- Memory system must be designed to support high bandwidth for vector loads and stores
- Spread accesses across multiple banks
  - Control bank addresses independently
  - Load or store non sequential words
  - Support multiple vector processors sharing the same memory
- Example (Cray T932, 1996; Ford acquired 1 out of 13, \$39M):
  - 32 processors, each generating 4 loads and 2 stores per cycle
  - Processor cycle time is 2.167 ns, SRAM cycle time is 15 ns
  - How many memory banks needed?



### Stride (5)

Handling <u>multidimensional arrays</u> in Vector Architectures:

- Must vectorize multiplication of rows of B with columns of D
- Use non-unit stride (in VMIPS: load/store vector with stride)
- Bank conflict (stall) occurs when the same bank is hit faster than bank busy time:
  - #banks / Least\_Common\_Multiple (stride, #banks) < bank busy time</p>



### **Scatter-Gather** (6)

Handling <u>sparse matrices</u> in Vector Architectures:

#### Use index vector:

| LV      | Vk, | Rk          | ;load K       |
|---------|-----|-------------|---------------|
| LVI     | Va, | (Ra+Vk)     | ;load A[K[]]  |
| LV      | Vm, | Rm          | ;load M       |
| LVI     | Vc, | (Rc+Vm)     | ;load C[M[]]  |
| ADDVV.D | Va, | Va, Vc ;add | them          |
| SVI     | (Ra | +Vk), Va    | ;store A[K[]] |



## **Vector Programming** (7)

- Compilers are a key element to give hints on whether a code section will vectorize or not
- Check if loop iterations have data dependencies, otherwise vectorization is compromised
- Vector Architectures have a too high cost, but simpler variants are currently available on off-the-shelf devices; however:
  - most do not support non-unit stride => care must be taken in the design of data structures
  - same applies for gather-scatter...



# **SIMD Extensions**

#### Media applications operate on data types narrower than the native word size

for (i=0: i<n: i++)

Х

Y

X + Y

z[i] = x[i] + y[i];

Figure 1 Scalar and vectorized loop versions with Intel® SSE, AVX and AVX-512.

x8+y8 x5+y5 x4+y4 x3+y3 x2+y2 x1+y1 x0+y0

- Intel SIMD Ext started with 64-bit wide vectors and grew to wider vectors and more capabilities
- Current AVX generation is 512-bit wide
- Limitations, compared to vector architectures (before AVX...):
  - Number of data operands encoded into op code
  - No sophisticated addressing modes (strided, scatter-gather)
  - No mask registers



# **SIMD Implementations**

- Intel implementations:
  - MMX (1996)
    - Eight 8-bit integer ops or four 16-bit integer ops
  - Streaming SIMD Extensions (SSE) (1999)
    - Eight 16-bit integer ops
    - Four 32-bit integer/fp ops or two 64-bit integer/fp ops
  - Advanced Vector eXtensions (AVX) (2010...)
    - Eight 32-bit fp ops or Four 64-bit fp ops (integers in AVX-2)
    - 512-bits wide in AVX-512 (and also in Larrabee & Phi-KNĆ)
  - Operands <u>must / should be in consecutive and</u> <u>aligned</u> memory locations
- AMD Zen/Epyc (Opteron follow-up): with AVX-2
- ARM v8 (64-bit) implementations (next...)



#### **Reading suggestions** (from CAQA 5<sup>th</sup> Ed)

#### • Vector architecture: 4.2

• SIMD instruction set extensions for multimedia: 4.3

#### For the slides on GPU (later)

公

- Graphic processing units: 4.4
- Detecting and enhancing loop-level parallelism: 4.5