A

## **Master Informatics Eng.**

2015/16
A.J.Proença

Data Parallelism 1 (vector, SIMD ext., GPU)
(most slides are borrowed)

AJProença, Advanced Architectures, MEI, UMinho, 2015/16

#### Chapter 7 — Multicores, Multiprocessors, and Clusters —

## Introduction

- SIMD architectures can exploit significant datalevel parallelism for:
  - matrix-oriented <u>scientific computing</u>
  - media-oriented image and sound processing
- SIMD is more energy efficient than MIMD
  - only needs to fetch one instruction per data operation
  - makes SIMD attractive for personal mobile devices
- SIMD allows programmers to continue to think sequentially

# MK®

## **Instruction and Data Streams**

An alternate classification

|                        |          | Data Streams               |                               |
|------------------------|----------|----------------------------|-------------------------------|
|                        |          | Single                     | Multiple                      |
| Instruction<br>Streams | Single   | SISD:<br>Intel Pentium 4   | SIMD: SSE instructions of x86 |
|                        | Multiple | MISD:<br>No examples today | MIMD:<br>Intel Xeon e5345     |

- SPMD: Single Program Multiple Data
  - A parallel program on a MIMD computer
  - Conditional code for different processors

Chapter 7 — Multicores, Multiprocessors, and Clusters — 2

## **SIMD Parallelism**

- Vector architectures (slides 5 to 18)
- SIMD & extensions (slides 19 to 23)
- Graphics Processor Units (GPUs) (next set)
- For x86 processors:
  - Expected grow:2 more cores/chip/year
  - SIMD width: 2x every 4 years
  - Potential speedup: SIMD 2x that from MIMD!



## **Vector Architectures**

- Basic idea:
  - Read sets of data elements (gather from memory) into "vector registers"
  - Operate on those registers
  - Store/scatter the results back into memory
- Registers are controlled by the compiler
  - Used to hide memory latency
  - Leverage memory bandwidth

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

\_

#### **VMIPS** ■ Example architecture: VMIPS ■ Loosely based on Cray-1 (next slide) Vector registers Main memory Each register holds a 64-element, 64 bits/element vector Register file has 16 read ports and 8 write ports FP add/subtract Vector functional units FP multiply ■ Fully pipelined, new op each clock-cycle Data & control hazards are detected Vector load-store unit Integer Fully pipelined 1 word/clock-cycle after initial latency Scalar registers

## Vector Instruction Parallelism



#### Can overlap execution of multiple vector instructions

- Consider machine with 32 elements per vector register and 8 lanes:



Complete 24 operations/cycle while issuing 1 short instruction/cycle 8/19/2009 Parallel Architectures



32 general-purpose registers

32 floating-point registers

Crossbar switches

# /ector Architectur

### **VMIPS Instructions**

- ADDVV.D: add two vectors
- ADDVS.D: add vector to a scalar
- LV/SV: vector load and vector store from address
- Example: DAXPY (<u>Double-precision A x X Plus Y</u>)

```
L.D
        F0,a
                    : load scalar a
T.V
        V1, Rx
                    : load vector X
                    ; vector-scalar multiply
MULVS.D V2,V1,F0
LV
        V3, Rv
                    ; load vector Y
ADDVV
        V4, V2, V3
                    ; add
SV
        Rv.V4
                    ; store the result
```

 Requires the execution of 6 instructions versus almost 600 for MIPS (assuming DAXPY is operating on a vector with 64 elements)

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

.

# **Challenges**

- Start up time
  - Latency of vector functional unit
  - Assume the same as Cray-1
    - Floating-point add => 6 clock cycles
    - Floating-point multiply => 7 clock cycles
    - Floating-point divide => 20 clock cycles
    - Vector load => 12 clock cycles
- Improvements:
  - > 1 element per clock cycle (1)
  - Non-64 wide vectors (2)
  - IF statements in vector code (3)
  - Memory system optimizations to support vector processors (4)
  - Multiple dimensional matrices (5)
  - Sparse matrices (6)
  - Programming a vector computer (7)

ector Architectures

#### **Vector Execution Time**

- Execution time depends on three factors:
  - Length of operand vectors
  - Structural hazards
  - Data dependencies
- VMIPS functional units consume one element per clock cycle
  - Execution time is approximately the vector length
- Convoy
  - Set of vector instructions that could potentially execute together in one unit of time, chime

M<

Copyright © 2012, Elsevier Inc. All rights reserved

10

## **Multiple Lanes** (1)

 Element n of vector register A is "hardwired" to element n of vector register B

Allows for multiple hardware lanes





## **Vector Length Register (2)**

- Handling vector length not known at compile time
- Use Vector Length Register (VLR)
- Use strip mining for vectors over the maximum length:

```
VL = (n % MVL); /*find odd-size piece using modulo op % */
for (j = 0; j \le (n/MVL); j=j+1) \{ /*outer loop*/
  for (i = low; i < (low+VL); i=i+1) /*runs for length VL*/
   Y[i] = a * X[i] + Y[i] ; /*main operation*/
  low = low + VL; /*start of next vector*/
  VL = MVL; /*reset the length to maximum vector length*/
       Value of i
```

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

## **Memory Banks** (4)

- Memory system must be designed to support high bandwidth for vector loads and stores
- Spread accesses across multiple banks
  - Control bank addresses independently
  - Load or store non sequential words
  - Support multiple vector processors sharing the same memory
- Example (Cray T932, 1996; Ford acquired 1 out of 13, \$39M):
  - 32 processors, each generating 4 loads and 2 stores per cycle
  - Processor cycle time is 2.167 ns, SRAM cycle time is 15 ns
  - How many memory banks needed?

## **Vector Mask Registers (3)**

Handling IF statements in Vector Loops:

```
for (i = 0; i < 64; i=i+1)
     if (X[i] != 0)
         X[i] = X[i] - Y[i];
```

Use vector mask register to "disable" elements:

```
LV
          V1,Rx
                      ;load vector X into V1
          V2,Rv
                      :load vector Y
LV
L.D
          F0,#0
                      ;load FP zero into F0
                      :sets VM(i) to 1 if V1(i)!=F0
SNEVS.D
         V1.F0
SUBVV.D
         V1,V1,V2
                      ; subtract under vector mask
          Rx, V1
                      ; store the result in X
```

GFLOPS rate decreases!

M<

Copyright © 2012, Elsevier Inc. All rights reserved

## Stride (5)

Handling multidimensional arrays in Vector Architectures:

```
for (i = 0; i < 100; i=i+1) {
   for (j = 0; j < 100; j=j+1) {
   A[i][j] = 0.0;
   for (k = 0; k < 100; k=k+1)
           A[i][j] = A[i][j] + B[i][k] * D[k][j];
```

- Must vectorize multiplication of rows of B with columns of D
- Use non-unit stride (in VMIPS: load/store vector with stride)
- Bank conflict (stall) occurs when the same bank is hit faster than bank busy time:
  - #banks / Least Common Multiple (stride, #banks) < bank busy time</p>

Handling <u>sparse matrices</u> in Vector Architectures:

for (i = 0; i < n; i=i+1)  

$$A[K[i]] = A[K[i]] + C[M[i]];$$

Use index vector:

```
LV
         Vk, Rk
                         :load K
T.V.T
        Va, (Ra+Vk)
                         ;load A[K[]]
LV
        Vm, Rm
                         ;load M
TJVT
        Vc. (Rc+Vm)
                         ;load C[M[]]
ADDVV.D Va, Va, Vc
                         ; add them
SVI
         (Ra+Vk), Va
                         ;store A[K[]]
```

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

17



- · Single Instruction Multiple Data (SIMD)
- Central controller broadcasts instructions to multiple processing elements (PEs)
  - Only requires one controller for whole array
  - Only requires storage for one copy of program
  - All computations fully synchronized
- Recent Return to Popularity:
  - GPU (Graphics Processing Units) have SIMD properties
  - However, also multicore behavior, so mix of SIMD and MIMD (more later)
- Dual between Vector and SIMD execution

John Kuhiatowicz

Parallel Architecture: 36

- Compilers are a key element to give hints on whether a code section will vectorize or not
- Check if loop iterations have data dependencies, otherwise vectorization is compromised
- Vector Architectures have a too high cost, but simpler variants are currently available on off-the-shelf devices; however:
  - most do not support non-unit stride => care must be taken in the design of data structures
  - same applies for gather-scatter...

M<

Copyright © 2012, Elsevier Inc. All rights reserved

10

## **SIMD Extensions**

 Media applications operate on data types narrower than the native word size



- Limitations, compared to vector instructions:
  - Number of data operands encoded into op code
  - No sophisticated addressing modes (strided, scatter-gather)
  - No mask registers

## **SIMD Implementations**

- Implementations:
  - Intel MMX (1996)
    - Eight 8-bit integer ops or four 16-bit integer ops
  - Streaming SIMD Extensions (SSE) (1999)
    - Eight 16-bit integer ops
    - Four 32-bit integer/fp ops or two 64-bit integer/fp ops
  - Advanced Vector eXtensions (AVX) (2010)
    - Eight 32-bit fp ops or Four 64-bit fp ops (integer in AVX-2)
    - 512-bits wide in AVX-512 (and also in Larrabee & MIC-KC)
  - Operands must be in consecutive and aligned memory locations

M<

Copyright © 2012, Elsevier Inc. All rights reserved.

21

## **Example SIMD Code**

Example DAXPY:

```
;load scalar a
;copy a into F1 for SIMD MUL
;copy a into F2 for SIMD MUL
;copy a into F3 for SIMD MUL
;copy a into F3 for SIMD MUL
; last address to load

;load X[i], X[i+1], X[i+2], X[i+3]
;a×X[i],a×X[i+1],a×X[i+2],a×X[i+3]
;load Y[i], Y[i+1], Y[i+2], Y[i+3]
;a×X[i]+Y[i], ..., a×X[i+3]+Y[i+3]
;store into Y[i],Y[i+1],Y[i+2],Y[i+3]
;increment index to X
    L.D
                     F0,a
                     F1, F0
                     F2, F0
                     F3, F0
    DADDIU R4, Rx, #512 ; last address to load
Loop:
    L.4D
                     F4,0[Rx]
    MUL.4D F4,F4,F0
    L.4D
                     F8,0[Rv]
    ADD.4D F8,F8,F4
                     0[Ry],F8
    S.4D
                                          ;increment index to X
    DADDIU
                   Rx, Rx, #32
                                          ;increment index to Y
    DADDIU
                    Ry, Ry, #32
    DSUBU
                    R20, R4, Rx
                                          ; compute bound
    BNEZ
                     R20, Loop
                                           ; check if done
```

#### A Brief History of x86 SIMD



AJProença, Advanced Architectures, MEI, UMinho, 2015/16

#### Reading suggestions (from CAQA 5th Ed)

| • | Concepts and challenges in ILP: section                                      | 3.1     |
|---|------------------------------------------------------------------------------|---------|
| • | Exploiting ILP w/ multiple issue & static scheduling:                        | 3.7     |
| • | Exploiting ILP w/ dyn sched, multiple issue & specul:                        | 3.8     |
| • | Multithread: exploiting TLP on uniprocessors:                                | 3.12    |
| • | Multiprocessor cache coherence and snooping coherence protocol with example: | 5.2     |
| • | Basics on directory-based cache coherence:                                   | 5.4     |
| • | Models of memory consistency:                                                | 5.6     |
| • | A tutorial by Sarita Ave & K. Gharachorloo (see link at w                    | ebsite) |

AJProença, Advanced Architectures, MEI, UMinho, 2015/16