### MSc Informatics Eng.

2011/12

A.J.Proença

### Concepts from undegrad Computer Systems (1)

(most slides are borrowed)

AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12

**Computing Systems & Performance** 

3

2

### **Concepts from undegrad Computer Systems**

- most slides are borrowed from





### Key textbook for CSP

Computer Architecture, 5th Edition

Hennessy & Patterson

#### **Table of Contents**

#### Printed Text

Chap 1: Fundamentals of Quantitative Design and Analysis Chap 2: Memory Hierarchy Design Chap 3: Instruction-Level Parallelism and Its Exploitation Chap 4: Data-Level Parallelism in Vector, SIMD, and GPU Architectures Chap 5: Multiprocessors and Thread-Level Parallelism Chap 6: The Warehouse-Scale Computer App A: Instruction Set Principles App B: Review of Memory Hierarchy App C: Pipelining: Basic and Intermediate Concepts

#### Online

App D: Storage Systems App E: Embedded Systems App E: Interconnection Networks App G: Vector Processors App H: Hardware and Software for VLIW and EPIC App I: Large-Scale Multiprocessors and Scientific Applications App J: Computer Antimetic App K: Survey of Instruction Set Architectures App L: Historical Perspectives

### **Background for Computing Systems & Performance**

2

Key concepts to revise:

- numerical data representation (for error analysis)
- ISA (Instruction Set Architecture)
- how C compilers generate code (a look into assembly code)
  - how scalar and structured data are allocated
  - how control structures are implemented
  - how to call/return from function/procedures
  - what architecture features impact performance
  - improvements introduced to improve performance in a single CPU
    - ILP: pipeline, superscalarity, multi-threading, vector processing, ...
    - memory hierarchy: cache levels, ...

## **Understanding Performance**

- Algorithm
  - Determines number of operations executed
- Programming language, compiler, architecture
  - Determine number of machine instructions executed per operation
- Processor and memory system
  - Determine how fast instructions are executed
- I/O system (including OS)
  - Determines how fast I/O operations are executed



Chapter 1 — Computer Abstractions and Technology — 5

# **CPU** Time

CPU Time = CPU Clock Cycles × Clock Cycle Time

CPU Clock Cycles

**Clock Rate** 

- Performance improved by
  - Reducing number of clock cycles
  - Increasing clock rate
  - Hardware designer must often trade off clock rate against cycle count

## **Response Time and Throughput**

- Response time
  - How long it takes to do a task
- Throughput
  - Total work done per unit time
    - e.g., tasks/transactions/... per hour
- How are response time and throughput affected by
  - Replacing the processor with a faster version?
  - Adding more processors?
- We'll focus on response time for now...

Chapter 1 — Computer Abstractions and Technology — 6

# **Instruction Count and CPI**

Clock Cycles = Instruction Count × Cycles per Instruction

CPU Time = Instruction Count × CPI × Clock Cycle Time

Instruction Count × CPI Clock Rate

- Instruction Count for a program
  - Determined by program, ISA and compiler
- Average cycles per instruction
  - Determined by CPU hardware
  - If different instructions have different CPI
    - Average CPI affected by instruction mix



## **Performance Summary**

### The BIG Pleture CPU Time = Instructions Program × Clock cycles × Seconds Instruction × Clock cycle • Performance depends on • Algorithm: affects IC, possibly CPI • Programming language: affects IC, CPI • Compiler: affects IC, CPI • Instruction set architecture: affects IC, CPI, T<sub>c</sub>

Chapter 1 — Computer Abstractions and Technology — 9

# **Does Multiple Issue Work?**

### The BIG Picture

- Yes, but not as much as we'd like
- Programs have real dependencies that limit ILP
- Some dependencies are hard to eliminate
  - e.g., pointer aliasing
- Some parallelism is hard to expose
  - Limited window size during instruction issue
- Memory delays and limited bandwidth
  - Hard to keep pipelines full
- Speculation can help if done well

Chapter 4 — The Processor — 11

# **Pipeline Summary**

### **The BIG Picture**

- Pipelining improves performance by increasing instruction throughput
  - Executes multiple instructions in parallel
  - Each instruction has the same latency
- Subject to hazards
  - Structure, data, control
- Instruction set design affects complexity of pipeline implementation

MK®

Chapter 4 — The Processor — 10

# **Memory Hierarchy Levels**



### **The Memory Hierarchy**

### The BIG Picture

- Common principles apply at all levels of the memory hierarchy
  - Based on notions of caching
- At each level in the hierarchy
  - Block placement
  - Finding a block
  - Replacement on a miss

**Memory Hierarchy** 

evel 2

Cache

referenc

256 KB

3-10 ns

Level 1

Cache

reference

64 KB

2 ns

C Memory

Level 3

Cache

reference

2-4 MB

10-20 ns

Level 2

Cache

reference

256 KB

10-20 ns

(a) Memory hierarchy for server

lemory

(b) Memory hierarchy for a personal mobile device

bus

bus

Memory

Memory

4-16 GB

50-100 ns

Memory

Memory

reference

256-512 MB

50-100 ns

reference

/O bus /

Disk storage

Disk memory

reference

4-16 TB

5-10 ms

Storage

FLASH

memory

reference

4-8 GB

25-50 us

Write policy

CPU

Registers

Register

reference

1000 bytes

300 ps

Size

Speed:

l evel 1

Cache

reference

64 KB

1 ns

CPU

Registers

Register

reference

Size: 500 bytes

Speed: 500 ps



Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13



- Primary cache attached to CPU
  - Small, but fast
- Level-2 cache services misses from primary cache
  - Larger, slower, but still faster than main memory
- Main memory services L-2 cache misses
- Some high-end systems include L-3 cache

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 14

### Homework



- To identify all AMD and Intel processors' microarchitecture from Hammer and Core till the latest releases, and build a table with:
  - year, max clock frequency, # pipeline stages, degree of superscalarity, # simultaneous threads, vector support, # cores, type/bandwidth of external interfaces, ...
  - UMA/NUMA; for each cache level: size, latency, line size, direct/ associative, bandwidth to access lower memory hierarchy levels, ... (homework for following week)
- · To identify the CPU generations at the SeARCH cluster

AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12

 Suggestion: create a GoogleDocs table, shared by all students, and all <u>critically</u> contribute to build the table

Copyright © 2012, Elsevier Inc. All rights reserved.

Introduction

16