CS 3853 Architecture Notes on Chapter 1

Read Chapter 1

Introduction

What do the following mean:

X is faster than Y
X is n times faster than Y
X is n% faster than Y

You need to answer these in terms of execution time for a given task.

X is n times faster than Y:

Execution time_Y

Execution time_X

= n

X is n% faster than Y:

Execution time_Y

Execution time_X

= 1 +

100

We will never use the word slower in this context, only faster.

Section 1.1: Traditional Performance Growth

Moore's Law (1965): Transistor count doubles every 2 years
Actually closer to 18 months
Traditionally, more transistors meant smaller and faster transistors, so performance also grew rapidly.
Today's $500 laptop is faster, has more memory, and more disk space than the fastest multi-million dollar computer of 1975 (Cray 1).
Single processor performance grew at a rate close to 50% per year from the 1980's until about 2002.
Clock rates have stagnated at around 3 Ghz.
Recent performance growth has come from increased complexity and number of processors, and has been at a slower rate.
Here is Figure 1.1.
Performance increases no longer just come from clock rate and instruction-level parallelism (ILP).
Now data-level parallelism (DLP), thread-level parallelism (TLP) and request-level parallelism (RLP) are more important.

Section 1.2: Classes of Computers

Types of computers:

Personal Mobile Device (PMD): cell phones and tablets
price
energy efficiency
responsiveness
size
Desktop
price
performance
Server
price
throughput
availability
scalability
Clusters
price-performance
throughput
energy-proportionality
Embedded: cameras, automobiles, microwave ovens, etc.
price
energy efficiency
application-specific performance

Application Parallelism

Data-Level Parallelism (DLP): many data items operated on at the same time
Task-Level Parallelism (TLP): independent tasks of an application

Hardware Parallelism

Instruction-Level Parallelism (ILP):
exploits DLP with compiler help, e.g. pipelining, speculative execution
Vector Architectures and Graphic Processor Units (GPU):
exploit DLP by having a single instruction operate on a collection of data
Thread-Level Parallelism (TLP):
can exploit either DLP or TLP
Request-Level Parallelism (RLP):
exploits TLP as specified by the programmer or OS

Hardware Classifications

Single instruction stream, single data stream (SISD)
Traditional uniprocessor, can use ILP
Single instruction stream, multiple data stream (SIMD)
The same instruction executes on multiple processors using different data.
e.g. vector processors and GPUs
Multiple instruction stream, single data stream (MISD)
not useful, included only for completeness
Multiple instruction stream, multiple data stream (MIMD)
independent processors that can act on independent data.
e.g. multicore

Section 1.3: Defining Computer Architecture

In CS 3843 we concentrated on the ISA.
Computer Architecture has 3 main components:

Instruction Set Architecture (ISA)
Organization (also called microarchitecture): memory systems, bus structure, internal CPU design
Hardware: detailed logic design, packaging technology

Design the organization and hardware to meet goals
Example:

The Intel 8088 (1979) and 8086 (1978) had the same ISA.
The 8088 had an 8-bit memory bus while the 8086 had a 16-bit bus.
The 8088 was very successful because it was cheap build computers with an 8-bit bus (in 1979)

Instruction Set Architecture

This is what we used to define the architecture of the X86 and Y86 in CS 3843.

Class of ISA

usually general purpose register architectures with operands registers or memory.
classified as register-memory or load-store.

Memory Addressing

usually byte addressing

Addressing Modes

covered extensively in CS 3843

Types and sizes of operands

often: integers of 8, 16, 32, and 64 bits and floating point of 32 and 64 bits.

Operations

instructions: e.g. add, shift, and, branch

Control flow instructions

a subset of above, including conditional branches, call and return

ISA encoding

fixed length or variable length.
In CS 3843 you should have seen this for Y86 in detail and X86 examples.

Authors' View of Computer Architecture

ISA, organization, hardware

Organization

Also called microarchitecture
High level aspects of computer's design, e.g. memory system, memory interconnect, CPU design
Example: Intel and AMD use the same ISA, but different organizations.

Hardware

refers to detailed logic an packaging technology
Example: Intel Core i7 and Intel Xeon 7560

Some aspects of computer architectural design

Application area
Software compatibility: programming language or object code
OS requirements: address space, memory management, protection
Standards: e.g. IEEE floating point, SATA

Today's News: September 3

There are no recitations scheduled for the second week of class.
Recitations will start on September 10.

Section 1.4: Trends in Technology

Integrated Circuit Technology

transistor density: 35% per year
die size: 10 to 20% per year
combines to 40% to 50% per year: Moore's law
6-core i7 Sandy Bridge: 2.27 billion transistors
GPU about double this

Semiconductor Ram

density increased 60% per year in the 1990's
slowed to 25%-40% per year
August 2012 1GB: $17

Semiconductor Flash

Used in PMDs
capacity increasing 50% to 60% per year
15 to 20 times cheaper than DRAM
August 2012 512GB SSD: $400

Magnetic Disk

density increase was 60% to 100% per year
slowed to 40% per year
15 to 25 times cheaper than Flash
300 to 500 times cheaper than DRAM
August 2012 2TB: $100

Network Technology

Trends in Bandwidth and Latency

bandwidth or throughput is total work done in a unit of time

latency or response time is the time between start and completion of an event

bandwidth has increased faster than latency has decreased

Here is Figure 1.9.
ClassQue: Figure 1.9

Wires

transistor density improvements mainly from decrease in feature size usually given in nm (nanometers)
1980: 4000, 1990: 1000, 2000: 180, 2010: 65, 2012: 32
transistor count related to square of feature size
signal delay of a wire increases with resistance and capacitance
these get worse as wires shrink in size

Section 1.5: Trends in Power and Energy

Issues

maximum power
sustained power: metric is thermal design power (TDP)
used both for power supplies and cooling requirements

Energy vs. Power

power is energy per unit time
power is measured in watts
energy is measured in joules
1 watt = 1 joule per second

Microprocessor Energy and Power

Energy for a transition 0 → 1 or 1 → 0:
Energy_dynamic ∝ capacitive load × Voltage²
Power_dynamic ∝ capacitive load × Voltage² × frequency
For a fixed task, slowing the clock rate reduces the power, but not the energy
Lowering the voltage lowers both
Voltages have been reduced from 5 volts to 1 volt over 20 years, but can't go much lower

Techniques for improving energy efficiency

Do nothing well
Dynamic Voltage-Frequency Scaling (DVFS)
Design for the typical use
Overclocking: run one core faster when other cores not needed (turbo mode)

ClassQue: Power and Energy

Section 1.6: Trends in Cost

Cost of an Integrated Circuit

die and wafer: See Figure 1.15.
How big is 300mm?
See Figure 1.13 and Figure 1.14
cost of IC =
cost of die + cost of testing die + cost of packaging
final test yield
cost of die =
cost of wafer
dies per wafer × die yield
dies per wafer =
π × (wafer diameter/2)²
die area
-
π × wafer diameter
√2 × die area
die yield depends on complexity and die area
large fixed costs are important when quantities are not large

Section 1.7: Dependability

MTTF: mean time to failure
MTTR: mean time to repair
MTBF: mean time between failures = MTTF + MTTR
availability =
MTTF
MTTF + MTTR
failure rate = 1/MTTF
failure rates add (if small), MTTF's do not

Example 1:

A system has 2 disks, each with a MTTF of 1,000,000 hours and a power supply with a MTTF of 200,000 hours. What is the system MTTF?
Solution:

failure rate = 2 × 1/1,000,000 + 1/200,000 = 7/1,000,000
MTTF = 1,000,000/7 = 143,000 hours.

Example 2:

A disk drive has a MTTF of 1,000,000 hours. What is the probability that it will last 50 years without failure?
Solution:

Hours per year = 24*365.25 = 8766.
Probability of dying in a given year: 8766/1,000,000 = .008766
Probability of not dying in a given year: 1 - .008766 = .991234
Probability of not dying in 50 years (.9912234)⁵⁰ = .6439

What is wrong with this solution?

ClassQue: MTTF 1

Section 1.8: Measuring, Reporting, and Summarizing Performance

A computer user is interested in response time
An operator of a warehouse-scale computer is interested in throughput

Comparing two computers

X is n times faster than Y
Execution time_Y
Execution time_X
= n
Performance_X
Performance_Y
= n
Example: the throughput of X is 1.3 times higher than that of Y:

Throughput_X
Throughput_Y
= 1.3 so
Execution time_Y
Execution time_X
= 1.3
This is the same as X is 30% faster than Y.
X is n % faster than Y means
Execution time_Y
Execution time_X
= 1 +
n
100

We can also say that the performance of X is n % greater than the performance of Y.
These are not usually used to compare machines because they depend on the job being run.

Examples (Also see ClassQue Questions)

Machine A is 40% faster than B and B is 40% faster than C. How much faster is A than C?
Solution:
E_B
E_A
= 1.4 and
E_C
E_B
= 1.4 so
E_C
E_A
= 1.4 × 1.4 = 1.96
Machine performance increases by 40% per year for 10 years. By what percentage does performance increase over this period?
Solution: 1.4¹⁰ = 28.93. This is 1 + n/100 for n = 2793.

Today's News: September 5

There are no recitations scheduled for the second week of class.
Recitations will start on September 10.

Time

wall-clock time, response time, elapsed time: includes everything, e.g. disk and memory accesses, OS overhead
CPU time: does not include time waiting for I/O or time running other programs

ClassQue: About Time

Benchmarks

Types of benchmarks:

kernels: small pieces of an application
toy programs
synthetic benchmarks: like real programs

Tricks people play

benchmark-specific hardware
benchmark-specific compilers
benchmark-specific compiler flags
source code modification

SPEC: Standard Performance Evaluation Corporation

set of standardized benchmarks
have evolved over time
separate sets for integer, floating point, and server
performance results should be reproducible

Summarizing Performance Results

arithmetic means give more weight to programs with longer running times
use geometric means to combine results
SPECRatio: ratio of geometric mean to that of a standard reference computer
SPECRatio_A
SPECRatio_B
=
Execution time_B
Execution time_A
=
Performance_A
Performance_B
independent of reference computer
Example: Geometric mean of 4 numbers: 10, 20, 15, 17:
(10 × 20 × 15 × 17)^1/4 = 15.03

ClassQue: geometric mean

Section 1.9: Quantitative Principles of Computer Design

Take advantage of parallelism

Multiple processors
Pipelining
Caches

Principle of locality

programs tend to reuse data and instructions that have been used recently
a typical program will spend 90% of its execution time on 10% of the code
temporal locality: predict which instructions and data will be used in the near future
spacial locality: items with addresses near each other tend to be accessed close together in time

Focus on the common case

adds are more common than multiplies
for database, disk access may be most important

Amdahl's law

A given enhancement will only improve a part of a program.
Example: on multicore machine, some of the code will only user one core.

Speedup
Speedup =

Performance for entire task using the enhancement when possible

Performance for the entire task without using the enhancement

Speedup =

Execution time for entire task without using the enhancement

Execution time for the entire task using the enhancement when possible

Amdahl's law depends on: Fraction_enhanced and Speedup_enhanced
Note: Fraction_enhanced is relative to the original design.
Fraction_unenhanced = 1 - Fraction_enhanced
Execution time_old = Execution time_{old-unenhanced} + Execution time_old-enhanced
Execution time_old = Execution time_old × Fraction_unenhanced + Execution time_old × Fraction_enhanced
Execution time_new = Execution time_old × Fraction_unenhanced +

Execution time_old

Speedup_enhanced

× Fraction_enhanced
Execution time_new = Execution time_old × (Fraction_unenhanced +

Amdahl's Law:

Fraction_unenhanced +

Fraction_enhanced

Speedup_enhanced

Speedup_overall =

1 - Fraction_enhanced +

Fraction_enhanced

Speedup_enhanced

You must be able to use this formula, but you also need to understand when the formula cannot be used directly.
You can only use the formula directly when you know both the enhanced fraction and the enhanced speedup.

Important:
To use the formula as given, you must know the enhanced fraction
which is the fraction of the time spent on the enhanced part
when run on the old system.

ClassQue: speedup 1

Examples

A new design makes the floating point processor of the CPU 80% faster than before. What is the overall speedup for a task in which floating point operations took up 30% of the CPU time with the old design?
Solution:
Method 1: use the formula
The enhanced fraction is 30% = .3. The unenhanced fraction is .7. The enhanced speedup is 1.8. The overall speeup is

1
.7 + .3/1.8
= 1.1538.
Method 2: use the definition directly
Execution time_old = .7 × Execution time_old + .3 × Execution time_old
Execution time_new = .7 × Execution time_old + .3 × Execution time_old/1.8
Execution time_new = (.7 + .3/1.8) × Execution time_old = .86667 × Execution time_old
Speedup = 1/.86667 = 1.15385
A new design makes the floating point processor of the CPU by 80% faster than before. What is the overall speedup for a task in which floating point operations took up 10% of the CPU time with the new design?
Solution: We cannot use the formula directly since the enhanced fraction in the formula is based on the old design.
Method 1: first calculate Fraction_enhanced (relative to the old design).
Suppose the total execution time on the new systems is E.
The unenhanced time on the new system is .9E and the enhanced time is .1E.
The unenhanced time on the old system is still .9E and the enhanced time is .1E × 1.8 = .18E.
The fraction enhanced (relative to the old system) is
.18E
(.9E + .18E)
= .1667.
The speedup is
1
1-.1667 + .1667/1.8
= 1.08.
Method 2: Use the definition of Speedup directly
Relative to the new design, the enhanced fraction is .1 and the unenhanced fraction is .9.
Execution time_new = .9 × Execution time_new + .1 × Execution time_new
Execution time_old = .9 × Execution time_new + .1 × Execution time_new × 1.8
Speedup_overall =
Execution time_old
Execution time_new
=
.9 × Execution time_new + .1 × Execution time_new × 1.8
Execution time_new
= .9 + .18 = 1.08.

Processor Performance Equations

CPU time = CPU clock cycles for a program × Clock cycle time
CPU time =
CPU clock cycles for a program
Clock rate
CPI: average clock cycles per instruction
CPI =
CPU clock cycles for a program
Instruction count
CPU time = Instruction count × Cycles per instruction × Clock cycle time
CPU time =
Instructions
Program
×
Clock cycles
Instruction
×
Seconds
Clock cycle

Dependencies

Clock cycle time: hardware technology and organization
CPI: organization and ISA
Instruction count: ISA and compiler technology

More on CPI

CPU clock cycles =
_n
Σ
ⁱ⁼¹
IC_i × CPU_i
CPI =
_n
Σ
ⁱ⁼¹

IC_i
Instruction count
× CPU_i

Examples

In a particular task, 23% of the instructions are floating point instruction which each take 5 cycles to execute. All other instructions take 1 cycle to execute. What is the average CPI for this task?
Solution: CPI = .23 × 5 + .77 × 1 = 1.92
A new design can reduce the number of cycles for a floating point operation to 4, without changing the clock speed. What is the new CPI and what is the expected speedup?
Solution: CPI = .23 × 4 + .77 × 1 = 1.69
Speedup = 1.92/1.69 = 1.136
What is wrong with the following method of solving the above problem using the Amdahl's Law equation?
Fraction_enhanced = .23, Fraction_unenhanced = .77, and Speedup_enhanced = 5/4 = 1.25. Therefore
Speedup_overall =
1
.77 + .23/1.25
= 1.0482
Answer: .23 is the fraction of instructions that are enhanced, not the time spent executing the enhanceable part.
The correct enhanced fraction is (.23 × 5)/(.23 × 5 + .77) = 1.15/1.92 = .5990. This gives
Speedup_overall =
1
.4010 + .5990/1.25
= 1.136

Section 1.10: Putting It All Together

This section compares three server systems using the SPECPower benchmark.
These are configured with a small amount of memory and small disks.
Comparing just the cost of the hardware and software, the systems are ordered by performance (ops/$): System 1, System 2, System 3.
Comparing power requirements, the systems are ordered by performance (ops/joule): System 3, System 2, System 1.

Section 1.11: Fallacies and Pitfalls

Fallacy: a commonly held misbelief
Pitfall: an easily made mistake

Fallacies

Multiprocessors are a silver bullet.
Hardware enhancements that increase performance improve energy efficiency or are at least neutral.
Benchmarks remain valid indefinitely.
The rated MTTF of disks is almost 140 years, so disks rarely fail.
Peak performance tracks observed performance.

Pitfalls

Amdahl's law
Single point of failure
Fault detection can lower availability

Next Notes

Back to CS 3853 Notes Table of Contents