#### Advanced Architectures

### **Advanced Architectures**



10

**Background for Advanced Architectures** 

#### $\infty$

#### Key concepts to revise:

- numerical data representation (for error analysis)
- ISA (Instruction Set Architecture)
- how C compilers generate code (a look into assembly code)
  - how scalar and structured data are allocated
  - how control structures are implemented
  - how to call/return from function/procedures
  - what architecture features impact performance
- Improvements to enhance performance in a single CPU
  - ILP: pipeline, multiple issue, SIMD/vector processing, ...
  - memory hierarchy: cache levels, ...
  - thread-level parallelism

#### A hierarquia de cache em arquiteturas multicore



### Lançamento da Intel em 2012: Sandy/Ivy Bridge (8-core)

| y Bridge-E 8-core 32nm        |                                        |                                      | PCIe PLL                       |          |                                                       |                |                                                       |    |
|-------------------------------|----------------------------------------|--------------------------------------|--------------------------------|----------|-------------------------------------------------------|----------------|-------------------------------------------------------|----|
| QPI PHY1                      | QPI PHY2                               | PCIe PHY1                            | PCle PHY2                      |          |                                                       |                |                                                       |    |
| QP<br>Power Control           | l agent                                | PCK                                  | e agent                        |          |                                                       |                |                                                       |    |
| CPU<br>Core                   | LL Cache Slice<br>2.5MB                | Last Level<br>Cache Slice<br>2.5MB   | Sandy Bridge<br>CPU Core       |          |                                                       |                |                                                       |    |
| CPU-<br>Core                  | LL Cache Slice<br>2.5MB                | L Cache Slice<br>2.5MB               | CPU<br>Core                    |          |                                                       |                |                                                       |    |
| CPU<br>Core                   | LL Cache Slice<br>2.5MB<br>Ring<br>Bus | Cache Slice<br>2.5MB<br>Ring<br>Stop | CPU<br>Core                    |          | Intel Xeon®                                           |                | Intel Xeon®                                           |    |
| CPU-<br>Core                  | LL Cache Slice LI<br>2.5MB             | Cache Slice<br>2.5MB                 | CPU<br>Core                    |          | E5-2600<br>Core 2 Core 1<br>Core 4 Core 3             | QPI 1<br>QPI 2 | E5-2600<br>Core 1 Core 2<br>Core 3 Core 4             |    |
| 2-Channe<br>DDR3 I/O<br>PHY 1 | Memory a                               | gent <b>a</b>                        | 2-Channel<br>DDR3 I/O<br>PHY 2 | <u> </u> | Core 6 Core 5<br>Core 8 Core 7<br>Up to 20MB<br>CACHE |                | Core 5 Core 6<br>Core 7 Core 8<br>Up to 20MB<br>CACHE | -0 |
| AJProença, Sister             | nas de Computaçã                       | o, UMinho, 2                         | 2013/14                        |          |                                                       |                |                                                       |    |

#### Exemplo de chip com processadores RISC: 2x ARM's no A6 do iPhone 5



AJProença, Sistemas de Computação, UMinho, 2013/14



AJProença, Sistemas de Computação, UMinho, 2013/14

#### Exemplo de chip com processadores RISC: 4+1 ARM's no Tegra 4i da NVidia



#### Exemplo de chip com processadores RISC: 4+4 ARM's no Exvnos 5 Octa. Galaxy S 4



AJProença, Sistemas de Computação, UMinho, 2013/14

#### Processadores Intel x86 versus ARM

10



AJProença, Sistemas de Computação, UMinho, 2013/14

JOHN L. HENNESSY DAVID A. PATTERSON <u>COMPUTER</u> ARCHITECTURE Printed Text A Quantitative Approach Online M<

### Key textbook for AA

#### **Computer Architecture, 5th Edition**

Hennessy & Patterson

#### Table of Contents

- Chap 1: Fundamentals of Quantitative Design and Analysis Chap 2: Memory Hierarchy Design Chap 3: Instruction-Level Parallelism and Its Exploitation Chap 4: Data-Level Parallelism in Vector, SIMD, and GPU Architectures Chap 5: Multiprocessors and Thread-Level Parallelism Chap 6: The Warehouse-Scale Computer App A: Instruction Set Principles App B: Review of Memory Hierarchy App C: Pipelining: Basic and Intermediate Concepts
- App D: Storage Systems App E: Embedded Systems Ann E: Interconnection Networks Ann G: Vector Processors App H: Hardware and Software for VLJW and EPIC App I: Large-Scale Multiprocessors and Scientific Applications App J: Computer Arithmetic App K: Survey of Instruction Set Architectures App L: Historical Perspectives



### **Recommended textbook** (1)

### Recommended textbook (2)



## **Understanding Performance**

- Algorithm + Data Structures
  - Determines number of operations executed
  - Determines how efficient data is assessed
- Programming language, compiler, architecture
  - Determine number of machine instructions executed per operation
- Processor and memory system
  - Determine how fast instructions are executed
- I/O system (including OS)
  - Determines how fast I/O operations are executed

COD: Chapter 1 — Computer Abstractions and Technology — 14

## **Response Time and Throughput**

- Response time
  - How long it takes to do a task
- Throughput
  - Total work done per unit time
    - e.g., tasks/transactions/... per hour
- How are response time and throughput affected by
  - Replacing the processor with a faster version?
  - Adding more processors?
- We'll focus on response time for now...

# **CPU Time**

### (single-core)

- CPU Time = CPU Clock Cycles×Clock Cycle Time
  - **CPU Clock Cycles**

**Clock Rate** 

- Performance improved by
  - Reducing number of clock cycles
  - Increasing clock rate
  - Hardware designer must often trade off clock rate against cycle count





## **Instruction Count and CPI**





Chapter 1 — Computer Abstractions and Technology — 17

# **Pipeline Summary**

### The BIG Picture

- Pipelining improves performance by increasing instruction throughput
  - Executes multiple instructions in parallel
  - Each instruction has the same latency
- Subject to hazards
  - Structure, data, control
- Instruction set design affects complexity of pipeline implementation

### Performance Summary (single-core)



## **Does Multiple Issue Work?**

### The BIG Picture

- Yes, but not as much as we'd like
- Programs have real dependencies that limit ILP
- Some dependencies are hard to eliminate
  - e.g., pointer aliasing
- Some parallelism is hard to expose
  - Limited window size during instruction issue
- Memory delays and limited bandwidth
  - Hard to keep pipelines full
- Speculation can help if done well





Chapter 4 — The Processor — 19

# **Memory Hierarchy Levels**



- Block (aka line): unit of copying
  May be multiple words
- If accessed data is present in upper level
  - Hit: access satisfied by upper levelHit ratio: hits/accesses
- If accessed data is absent
  - Miss: block copied from lower level
    - Time taken: miss penalty
       Miss ratio: misses/accesses
      - = 1 hit ratio
  - Then accessed data supplied from lower level

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 21

# **Multilevel Caches**

- Primary cache private to CPU/core
  - Small, but fast
- Level-2 cache services misses from primary cache
  - Larger, slower, but still faster than main memory
- High-end systems include L3 cache
- Main memory services L2/3 cache misses

## **The Memory Hierarchy**

### **The BIG Picture**

- Common principles apply at all levels of the memory hierarchy
  - Based on notions of caching
- Decisions at each level in the hierarchy
  - Block placement
  - Finding a block
  - Replacement on a miss
  - Write policy

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 22

Introduction

# **Memory Hierarchy**



Copyright © 2012, Elsevier Inc. All rights reserved.



### Homework

#### $\sim$

- Identify all Intel Xeon processors' microarchitecture from Core till the latest releases, and build a table with:
  - year, max clock frequency, # pipeline stages, degree of superscalarity, # simultaneous threads, vector support, # cores, type/bandwidth of external interfaces, ...
  - UMA/NUMA; for each cache level: size, latency, line size, direct/ associative, bandwidth to access lower memory hierarchy levels, ... (homework for following week)
- · Identify the CPU generations at the SeARCH cluster
- Suggestion: create GoogleDocs tables, shared by groups of students, and all <u>critically</u> contribute to build these tables

AJProença, Advanced Architectures, MiEl, UMinho, 2015/16