Read Chapter 1
Introduction
What do the following mean:
- X is faster than Y
- X is n times faster than Y
- X is n% faster than Y
You need to answer these in terms of execution time for a given task.
X is n times faster than Y:
Execution timeY
Execution timeX = n
X is n% faster than Y:
Execution timeY
Execution timeX =
1 +
n
100
We will never use the word
slower in this context, only
faster.
Section 1.1: Traditional Performance Growth
- Moore's Law (1965): Transistor count doubles every 2 years
Actually closer to 18 months
- Traditionally, more transistors meant smaller and faster transistors, so
performance also grew rapidly.
- Today's $500 laptop is faster, has more memory, and more disk space than
the fastest multi-million dollar computer of 1975 (Cray 1).
- Single processor performance grew at a rate close to 50% per year from the 1980's until about 2002.
- Clock rates have stagnated at around 3 Ghz.
- Recent performance growth has come from increased complexity and number of processors, and has been at a slower rate.
- Here is Figure 1.1.
- Performance increases no longer just come from clock rate and instruction-level parallelism (ILP).
- Now data-level parallelism (DLP), thread-level parallelism (TLP) and request-level parallelism (RLP)
are more important.
Section 1.2: Classes of Computers
Types of computers:
- Personal Mobile Device (PMD): cell phones and tablets
price
energy efficiency
responsiveness
size
- Desktop
price
performance
- Server
price
throughput
availability
scalability
- Clusters
price-performance
throughput
energy-proportionality
- Embedded: cameras, automobiles, microwave ovens, etc.
price
energy efficiency
application-specific performance
Application Parallelism
- Data-Level Parallelism (DLP): many data items operated on at the same time
- Task-Level Parallelism (TLP): independent tasks of an application
Hardware Parallelism
- Instruction-Level Parallelism (ILP):
exploits DLP with compiler help, e.g. pipelining, speculative execution
- Vector Architectures and Graphic Processor Units (GPU):
exploit DLP by having a single instruction operate on a collection of data
- Thread-Level Parallelism (TLP):
can exploit either DLP or TLP
- Request-Level Parallelism (RLP):
exploits TLP as specified by the programmer or OS
Hardware Classifications
- Single instruction stream, single data stream (SISD)
Traditional uniprocessor, can use ILP
- Single instruction stream, multiple data stream (SIMD)
The same instruction executes on multiple processors using different data.
e.g. vector processors and GPUs
- Multiple instruction stream, single data stream (MISD)
not useful, included only for completeness
- Multiple instruction stream, multiple data stream (MIMD)
independent processors that can act on independent data.
e.g. multicore
Section 1.3: Defining Computer Architecture
In CS 3843 we concentrated on the ISA.
Computer Architecture has 3 main components:
Instruction Set Architecture (ISA)
Organization (also called microarchitecture): memory systems, bus structure, internal CPU design
Hardware: detailed logic design, packaging technology
Design the organization and hardware to meet goals
Example:
The Intel 8088 (1979) and 8086 (1978) had the same ISA.
The 8088 had an 8-bit memory bus while the 8086 had a 16-bit bus.
The 8088 was very successful because it was cheap build computers with an 8-bit bus (in 1979)
Instruction Set Architecture
This is what we used to define the architecture of the X86 and Y86 in CS 3843.
Class of ISA
usually general purpose register architectures with operands registers or memory.
classified as register-memory or load-store.
Memory Addressing
usually byte addressing
Addressing Modes
covered extensively in CS 3843
Types and sizes of operands
often: integers of 8, 16, 32, and 64 bits and floating point of 32 and 64 bits.
Operations
instructions: e.g. add, shift, and, branch
Control flow instructions
a subset of above, including conditional branches, call and return
ISA encoding
fixed length or variable length.
In CS 3843 you should have seen this for Y86 in detail and X86 examples.
Authors' View of Computer Architecture
ISA, organization, hardware
Organization
Also called microarchitecture
High level aspects of computer's design, e.g. memory system, memory interconnect, CPU design
Example: Intel and AMD use the same ISA, but different organizations.
Hardware
refers to detailed logic an packaging technology
Example: Intel Core i7 and Intel Xeon 7560
Some aspects of computer architectural design
- Application area
- Software compatibility: programming language or object code
- OS requirements: address space, memory management, protection
- Standards: e.g. IEEE floating point, SATA
Today's News: September 3
There are no recitations scheduled for the second week of class.
Recitations will start on September 10.
Section 1.4: Trends in Technology
Integrated Circuit Technology
- transistor density: 35% per year
- die size: 10 to 20% per year
- combines to 40% to 50% per year: Moore's law
- 6-core i7 Sandy Bridge: 2.27 billion transistors
- GPU about double this
Semiconductor Ram
- density increased 60% per year in the 1990's
- slowed to 25%-40% per year
- August 2012 1GB: $17
Semiconductor Flash
- Used in PMDs
- capacity increasing 50% to 60% per year
- 15 to 20 times cheaper than DRAM
- August 2012 512GB SSD: $400
Magnetic Disk
- density increase was 60% to 100% per year
- slowed to 40% per year
- 15 to 25 times cheaper than Flash
- 300 to 500 times cheaper than DRAM
- August 2012 2TB: $100
Network Technology
Trends in Bandwidth and Latency
- bandwidth or throughput is total work done in a unit of time
- latency or response time is the time between start and completion of an event
- bandwidth has increased faster than latency has decreased
- Here is Figure 1.9.
ClassQue: Figure 1.9
Wires
- transistor density improvements mainly from decrease in feature size usually given in nm (nanometers)
1980: 4000, 1990: 1000, 2000: 180, 2010: 65, 2012: 32
- transistor count related to square of feature size
- signal delay of a wire increases with resistance and capacitance
- these get worse as wires shrink in size
Section 1.5: Trends in Power and Energy
Issues
- maximum power
- sustained power: metric is thermal design power (TDP)
- used both for power supplies and cooling requirements
Energy vs. Power
- power is energy per unit time
- power is measured in watts
- energy is measured in joules
- 1 watt = 1 joule per second
Microprocessor Energy and Power
- Energy for a transition 0 → 1 or 1 → 0:
Energydynamic ∝ capacitive load × Voltage2
- Powerdynamic ∝ capacitive load × Voltage2 × frequency
- For a fixed task, slowing the clock rate reduces the power, but not the energy
- Lowering the voltage lowers both
- Voltages have been reduced from 5 volts to 1 volt over 20 years, but can't go much lower
Techniques for improving energy efficiency
- Do nothing well
- Dynamic Voltage-Frequency Scaling (DVFS)
- Design for the typical use
- Overclocking: run one core faster when other cores not needed (turbo mode)
ClassQue: Power and Energy
Section 1.6: Trends in Cost
Cost of an Integrated Circuit
- die and wafer: See Figure 1.15.
How big is 300mm?
- See Figure 1.13 and
Figure 1.14
- cost of IC =
cost of die + cost of testing die + cost of packaging
final test yield
- cost of die =
cost of wafer
dies per wafer × die yield
- dies per wafer =
π × (wafer diameter/2)2
die area - π × wafer diameter
√2 × die area
- die yield depends on complexity and die area
- large fixed costs are important when quantities are not large
Section 1.7: Dependability
- MTTF: mean time to failure
- MTTR: mean time to repair
- MTBF: mean time between failures = MTTF + MTTR
- availability =
MTTF
MTTF + MTTR
- failure rate = 1/MTTF
- failure rates add (if small), MTTF's do not
Example 1:
A system has 2 disks, each with a MTTF of 1,000,000 hours and a power supply with a MTTF of 200,000 hours.
What is the system MTTF?
Solution:
failure rate = 2 × 1/1,000,000 + 1/200,000 = 7/1,000,000
MTTF = 1,000,000/7 = 143,000 hours.
Example 2:
A disk drive has a MTTF of 1,000,000 hours. What is the probability that it will last 50 years without failure?
Solution:
Hours per year = 24*365.25 = 8766.
Probability of dying in a given year: 8766/1,000,000 = .008766
Probability of not dying in a given year: 1 - .008766 = .991234
Probability of not dying in 50 years (.9912234)50 = .6439
What is wrong with this solution?
ClassQue: MTTF 1
Section 1.8: Measuring, Reporting, and Summarizing Performance
A computer user is interested in response time
An operator of a warehouse-scale computer is interested in throughput
Comparing two computers
Examples (Also see ClassQue Questions)
- Machine A is 40% faster than B and B is 40% faster than C. How much faster is A than C?
Solution: EB
EA = 1.4 and
EC
EB = 1.4 so
EC
EA = 1.4 × 1.4 = 1.96
- Machine performance increases by 40% per year for 10 years. By what percentage does performance increase
over this period?
Solution: 1.410 = 28.93. This is 1 + n/100 for n = 2793.
Today's News: September 5
There are no recitations scheduled for the second week of class.
Recitations will start on September 10.
Time
- wall-clock time, response time, elapsed time: includes everything, e.g. disk and memory accesses, OS overhead
- CPU time: does not include time waiting for I/O or time running other programs
ClassQue: About Time
Benchmarks
Types of benchmarks:
- kernels: small pieces of an application
- toy programs
- synthetic benchmarks: like real programs
Tricks people play
- benchmark-specific hardware
- benchmark-specific compilers
- benchmark-specific compiler flags
- source code modification
SPEC: Standard Performance Evaluation Corporation
- set of standardized benchmarks
- have evolved over time
- separate sets for integer, floating point, and server
- performance results should be reproducible
Summarizing Performance Results
ClassQue: geometric mean
Section 1.9: Quantitative Principles of Computer Design
Take advantage of parallelism
- Multiple processors
- Pipelining
- Caches
Principle of locality
- programs tend to reuse data and instructions that have been used recently
- a typical program will spend 90% of its execution time on 10% of the code
- temporal locality: predict which instructions and data will be used in the near future
- spacial locality: items with addresses near each other tend to be accessed close together in time
Focus on the common case
- adds are more common than multiplies
- for database, disk access may be most important
Amdahl's law
A given enhancement will only improve a part of a program.
Example: on multicore machine, some of the code will only user one core.
Speedup
Speedup =
Performance for entire task using the enhancement when possible
Performance for the entire task without using the enhancement
or
Speedup =
Execution time for entire task without using the enhancement
Execution time for the entire task using the enhancement when possible
Amdahl's law depends on: Fraction
enhanced and Speedup
enhanced
Note: Fraction
enhanced is relative to the original design.
Fraction
unenhanced = 1 - Fraction
enhanced
Execution time
old = Execution time
old-unenhanced + Execution time
old-enhanced
Execution time
old = Execution time
old × Fraction
unenhanced + Execution time
old × Fraction
enhanced
Execution time
new = Execution time
old × Fraction
unenhanced +
Execution timeold
Speedupenhanced × Fraction
enhanced
Execution time
new = Execution time
old ×
(Fraction
unenhanced +
Fractionenhanced
Speedupenhanced)
Speedup
overall =
Execution timeold
Execution timenew
Amdahl's Law:
Speedup
overall =
1
Fraction
unenhanced +
Fractionenhanced
Speedupenhanced
or
Speedup
overall =
1
1 - Fraction
enhanced +
Fractionenhanced
Speedupenhanced
You must be able to use this formula, but you also need to understand when the formula cannot be used directly.
You can only use the formula directly when you know both the enhanced fraction and the enhanced speedup.
Important:
To use the formula as given, you must know the enhanced fraction
which is the fraction of the time spent on the enhanced part
when run on the old system.
ClassQue: speedup 1
Examples
- A new design makes the floating point processor of the CPU 80% faster than before. What is the overall speedup for a task
in which floating point operations took up 30% of the CPU time with the old design?
Solution:
Method 1: use the formula
The enhanced fraction is 30% = .3. The unenhanced fraction is .7. The enhanced speedup is 1.8. The overall speeup is
1
.7 + .3/1.8 = 1.1538.
Method 2: use the definition directly
Execution timeold = .7 × Execution timeold + .3 × Execution timeold
Execution timenew = .7 × Execution timeold + .3 × Execution timeold/1.8
Execution timenew = (.7 + .3/1.8) × Execution timeold = .86667 × Execution timeold
Speedup = 1/.86667 = 1.15385
- A new design makes the floating point processor of the CPU by 80% faster than before. What is the overall speedup for a task
in which floating point operations took up 10% of the CPU time with the new design?
Solution: We cannot use the formula directly since the enhanced fraction in the formula is based on the old design.
Method 1: first calculate Fraction
enhanced (relative to the old design).
Suppose the total execution time on the new systems is E.
The unenhanced time on the new system is .9E and the enhanced time is .1E.
The unenhanced time on the old system is still .9E and the enhanced time is .1E × 1.8 = .18E.
The fraction enhanced (relative to the old system) is
.18E
(.9E + .18E) = .1667.
The speedup is
1
1-.1667 + .1667/1.8 = 1.08.
Method 2: Use the definition of Speedup directly
Relative to the new design, the enhanced fraction is .1 and the unenhanced fraction is .9.
Execution time
new = .9 × Execution time
new + .1 × Execution time
new
Execution time
old = .9 × Execution time
new + .1 × Execution time
new × 1.8
Speedup
overall =
Execution timeold
Execution timenew =
.9 × Execution timenew + .1 × Execution timenew × 1.8
Execution timenew = .9 + .18 = 1.08.
Processor Performance Equations
Dependencies
- Clock cycle time: hardware technology and organization
- CPI: organization and ISA
- Instruction count: ISA and compiler technology
More on CPI
- CPU clock cycles =
n
Σ
i=1
ICi × CPUi
- CPI =
n
Σ
i=1
ICi
Instruction count × CPUi
Examples
- In a particular task, 23% of the instructions are floating point instruction which each take 5 cycles to execute.
All other instructions take 1 cycle to execute. What is the average CPI for this task?
Solution: CPI = .23 × 5 + .77 × 1 = 1.92
- A new design can reduce the number of cycles for a floating point operation to 4, without changing the clock speed.
What is the new CPI and what is the expected speedup?
Solution: CPI = .23 × 4 + .77 × 1 = 1.69
Speedup = 1.92/1.69 = 1.136
- What is wrong with the following method of solving the above problem using the Amdahl's Law equation?
Fractionenhanced = .23, Fractionunenhanced = .77, and Speedupenhanced = 5/4 = 1.25.
Therefore
Speedupoverall = 1
.77 + .23/1.25 = 1.0482
Answer: .23 is the fraction of instructions that are enhanced, not the time spent executing the enhanceable part.
The correct enhanced fraction is (.23 × 5)/(.23 × 5 + .77) = 1.15/1.92 = .5990. This gives
Speedupoverall = 1
.4010 + .5990/1.25 = 1.136
Section 1.10: Putting It All Together
- This section compares three server systems using the SPECPower benchmark.
- These are configured with a small amount of memory and small disks.
- Comparing just the cost of the hardware and software, the systems are ordered by performance (ops/$): System 1, System 2, System 3.
- Comparing power requirements, the systems are ordered by performance (ops/joule): System 3, System 2, System 1.
Section 1.11: Fallacies and Pitfalls
Fallacy: a commonly held misbelief
Pitfall: an easily made mistake
Fallacies
- Multiprocessors are a silver bullet.
- Hardware enhancements that increase performance improve energy efficiency or are at least neutral.
- Benchmarks remain valid indefinitely.
- The rated MTTF of disks is almost 140 years, so disks rarely fail.
- Peak performance tracks observed performance.
Pitfalls
- Amdahl's law
- Single point of failure
- Fault detection can lower availability
Next Notes
Back to CS 3853 Notes Table of Contents