## **MSc Informatics Eng.**

2014/15 *A.J.Proença* 

#### Data Parallelism 3 (MIC/CUDA programming) (most slides are borrowed)

AJProença, Advanced Architectures, MEI, UMinho, 2014/15

XX



1

Programming Models for Intel® Xeon® processors and Intel® Many Integrated Core (Intel MIC) Architecture

> Scott McMillan Senior Software Engineer Software & Services Group

April 11, 2012 TACC-Intel Highly Parallel Computing Symposium



#### Programming Intel<sup>®</sup> MIC-based Systems MPI+Offload

- MPI ranks on Intel<sup>®</sup> Xeon<sup>®</sup> processors (only)
- All messages into/out of processors
- Offload models used to accelerate MPI ranks
- Intel® Cilk<sup>™</sup> Plus, OpenMP\*, Intel® Threading Building Blocks, Pthreads\* within Intel<sup>®</sup> MIC
- Homogenous network of hybrid nodes:







# **Offload Code Examples**

#### • C/C++ Offload Pragma

```
#pragma offload target (mic)
#pragma omp parallel for reduction(+:pi)
for (i=0; i<count; i++) {
    float t = (float)((i+0.5)/count);
    pi += 4.0/(1.0+t*t);
}
pi /= count;</pre>
```

#### Function Offload Example

#pragma offload target(mic)
in(transa, transb, N, alpha, beta) \
in(A:length(matrix\_elements)) \
in(B:length(matrix\_elements)) \
inout(C:length(matrix\_elements))
sgemm(&transa, &transb, &N, &N, &N, &N, &AIpha, A, &N, B, &N, &beta, C, &N);

# Fortran Offload Directive !dir\$ omp offload target(mic) !\$omp parallel do do i=1,10 A(i) = B(i) \* C(i) enddo C/C++ Language Extension class \_Cilk\_Shared common { int data1; char \*data2; class common \*next; void process(); }; \_Cilk\_Shared class common obj1, obj2; \_Cilk\_spawn \_Offload obj1.process();

\_Cilk\_spawn \_Onload obj1.process();



12

#### Programming Intel® MIC-based Systems Many-core Hosted

 MPI ranks on Intel® MIC (only) Data All messages into/out of Intel® MIC Xeon MIC MPI Intel<sup>®</sup> Cilk<sup>™</sup> Plus, OpenMP\*, Intel<sup>®</sup> Threading Building Data Blocks, Pthreads used directly within MPI processes MIC Xeon Network • Programmed as homogenous network of many-core CPUs: Data Xeon MIC Data MIC Xeon





#### Programming Intel® MIC-based Systems Symmetric

- MPI ranks on Intel® MIC and Intel® Xeon® processors
- Messages to/from any core
- Intel® Cilk<sup>™</sup> Plus, OpenMP\*, Intel® Threading Building Blocks, Pthreads\* used directly within MPI processes
- Programmed as heterogeneous network of homogeneous nodes:





Keys to Productive Performance on Intel® MIC Architecture

- Choose the right Multi-core centric or Many-core centric model for your application
- Vectorize your application (today)
  - Use the Intel vectorizing compiler
- Parallelize your application (today)
  - With MPI (or other multi-process model)
  - With threads (via Intel® Cilk<sup>™</sup> Plus, OpenMP\*, Intel® Threading Building Blocks, Pthreads, etc.)
- Go asynchronous to overlap computation and communication



inte

# **Options for Thread Parallelism**





## **Options for Vectorization**

| Intel® Math Kernel Library                                                                         | Ease of use / code<br>maintainability (depends<br>on problem) |
|----------------------------------------------------------------------------------------------------|---------------------------------------------------------------|
| Array Notation: Intel® Cilk™ Plus                                                                  |                                                               |
| Automatic vectorization                                                                            |                                                               |
| Semiautomatic vectorization with annotation:<br>#pragma vector, #pragma ivdep, and #pragma<br>simd |                                                               |
| C/C++ Vector Classes (F32vec16, F64vec8)                                                           |                                                               |
| Vector intrinsics (mm_add_ps, addps)                                                               | Programmer control                                            |



## **Summary**

- Intel® MIC Architecture offers familiar and flexible programming models
- Hybrid MPI/threading is becoming increasingly important as core counts grow
- Intel tools support hybrid programming today, exploiting existing standards
- Hybrid parallelism on Intel® Xeon® processors + Intel® MIC delivers superior productivity through code reuse
- Hybrid programming today on Intel® Xeon® processors readies you for Intel® MIC





Copyright © 2012, Intel Corporation. All rights reserved.

Sponsors of Tomorrow: (Inte

#### "Stand-alone" Intel<sup>®</sup> MIC Architecture Computing Environment

- Intel<sup>®</sup> MIC Architecture software environment includes a highly functional, Linux<sup>\*</sup> OS running on the co-processor with:
  - A familiar interactive shell
  - IP Addressability [headless node]
  - A local file system with subdirectories, file reads, writes, etc
  - standard i/o including printf
  - Virtual memory management
  - Process, thread management & scheduling
  - Interrupt and exception handling
  - Semaphores, mutexes, etc...
- What does this mean?
  - A large majority of existing code even with OS oriented calls like fork() can port with a simple recompile
  - Intel MIC Architecture natively supports parallel coding models like Intel<sup>®</sup> Cilk<sup>™</sup> Plus, Intel<sup>®</sup> Threading Building Blocks, pThreads<sup>\*</sup>, OpenMP<sup>\*</sup>

|         | fooey.c                                                                |
|---------|------------------------------------------------------------------------|
|         | main()<br>{<br>printf("running Foo()\n");<br>Foo();<br>}               |
|         | Foo()<br>{<br>printf("fooey\n");<br>}                                  |
|         | Intel MIC Architecture<br>(Knights Corner console)                     |
| e<br>*, | mymic>ls<br>fooey<br>mymic>./fooey<br>running Foo()<br>fooey<br>mymic> |
| S       | ponsors of Tomorrow. (intel)                                           |

3/13/2012 Copyright © 2012, Intel Corporation. All rights reserved. Intel\* Many Integrated Core Architecture (Intel\* MIC Architecture)

## Stand-alone Example: Computing Pi

```
# define NSET 1000000
int main ( int argc, const char** argv )
ſ
   long int i;
     float num_inside, Pi;
     num_inside = 0.0f;
    #pragma omp parallel for reduction(+:num_inside)
for( i = 0; i < NSET; i++ )</pre>
              float x, y, distance_from_zero;
     {
                            // Generate x, y random numbers in [0,1)
x = float(rand()) / float(RAND_MAX + 1);
                            y = float(rand()) / float(RAND_MAX + 1);
                            distance_from_zero = sqrt(x*x + y*y);
if ( distance_from_zero <= 1.0f )</pre>
                            num inside += 1.0f;
    Pi = 4.0f * ( num_inside / NSET );
   printf("Value of Pi = %f \n",Pi);
}
```

Original Source Code Compiler command line switch targets platform 19

3/13/2012

## **Co-Processing Example: Computing Pi**

```
# define NSET 1000000
int main ( int argc, const char** argv )
    long int i;
    float num_inside, Pi;
    num inside = 0.0f;
    #pragma offload target (MIC)
    #pragma omp parallel for reduction(+:num_inside)
    for( i = 0; i < NSET; i++ )</pre>
           float x, y, distance_from_zero;
                       // Generate x, y random numbers in [0,1)
x = float(rand()) / float(RAND_MAX + 1);
                       y = float(rand()) / float(RAND_MAX + 1);
                        distance_from_zero = sqrt(x*x + y*y);
                        if ( distance_from_zero <= 1.0f )
                        num_inside += 1.0f;
   Pi = 4.0f * ( num_inside / NSET );
   printf("Value of Pi = %f \n",Pi);
}
```

3/13/2012

A one line change from the CPU version

#### The CUDA programming model



#### XX

- A compute device
  - is a coprocessor to the CPU or host
  - has its own DRAM (device memory)
  - runs many threads in parallel
  - is typically a GPU but can also be another type of parallel processing device
- Data-parallel portions of an application are expressed as device kernels which run on many threads - SIMT
- Differences between GPU and CPU threads
  - GPU threads are extremely lightweight
    - very little creation overhead, requires LARGE register bank •
  - GPU needs 1000s of threads for full efficiency
    - multi-core CPU needs only a few

AJProença, Advanced Architectures, MEI, UMinho, 2014/15





AJProença, Advanced Architectures, MEI, UMinho, 2014/15

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

17

#### Programming Model: SPMD + SIMT/SIMD

#### XX

- Hierarchy
  - Device => Grids
  - Grid => Blocks
  - Block => Warps
  - Warp => Threads
- Single kernel runs on multiple blocks (SPMD)
- Threads within a warp are executed • in a lock-step way called singleinstruction multiple-thread (SIMT)
- Single instruction are executed on multiple threads (SIMD)
  - Warp size defines SIMD granularity (32 threads)
- Synchronization within a block uses shared memory





#### The Computational Grid: Block IDs and Thread IDs



21

Graphical Processing Units

# Example



- Code that works over all elements is the grid
- Thread blocks break this down into manageable sizes
  512 threads per block
- SIMD instruction executes 32 elements at a time
- Thus grid size = 16 blocks
- Block is analogous to a strip-mined vector loop with vector length of 32
- Block is assigned to a multithreaded SIMD processor by the thread block scheduler
- Current-generation GPUs (Fermi) have 7-16 multithreaded SIMD processors



Copyright © 2012, Elsevier Inc. All rights reserved.

**Terminology** (and in NVidia)

Threads of SIMD instructions (warps)

- Each has its own IP (up to 48/64 per SIMD processor, Fermi/Kepler)
- Thread scheduler uses scoreboard to dispatch
- No data dependencies between threads!
- Threads are organized into blocks & executed in groups of 32 threads (*thread block*)
  - Blocks are organized into a grid
- The <u>thread block scheduler</u> schedules blocks to SIMD processors (*Streaming Multiprocessors*)
- Within each SIMD processor:
  - 32 SIMD lanes (thread processors)
  - Wide and shallow compared to vector processors



### CUDA Thread Block

**CUDA Thread Block** 

#### 1

- Programmer declares (Thread) Block:
  - Block size 1 to 512 concurrent threads
  - Block shape 1D, 2D, or 3D
  - Block dimensions in threads
- All threads in a Block execute the same thread program
- · Threads share data and synchronize while doing their share of the work
- Threads have thread id numbers within Block
- Thread program uses thread id to select work and address shared data

AJProença, Advanced Architectures, MEI, UMinho, 2014/15





#### Parallel Memory Sharing

## **CUDA Memory Model Overview**

#### $\sim$



Hardware Implementation: Memory Architecture



**Device memory** 

Courtesy NVIDIA

# **NVIDIA GPU Memory Structures**

- Each SIMD Lane has private section of off-chip DRAM
  - "Private memory" (Local Memory)
  - Contains stack frame, spilling registers, and private variables
- Each multithreaded SIMD processor also has local memory (Shared Memory)
  - Shared by SIMD lanes / threads within a block
- Memory shared by SIMD processors is GPU Memory (Global Memory)
  - Host can read and write GPU memory







# Vector Processor versus CUDA core



Copyright © 2012, Elsevier Inc. All rights reserved.

Graphical Processing Units

Graphical Processing Units

# **Conditional Branching**

- Like vector architectures, GPU branch hardware uses internal masks
- Also uses
  - Branch synchronization stack
    - Entries consist of masks for each SIMD lane
    - I.e. which threads commit their results (all threads execute)
  - Instruction markers to manage when a branch diverges into multiple execution paths
    - Push on divergent branch
  - ...and when paths converge
    - Act as barriers
    - Pops stack
- Per-thread-lane 1-bit predicate register, specified by programmer

