Introduction
- Goal: replacing large inefficient processors with many smaller, efficient processors to get better performance per joule
- Multiprocessors, cluster
- Scalability, availability, power efficiency
- Task-level (process-level) parallelism
- High throughput for independent jobs
- Parallel processing program
- Single program run on multiple processors
- Multicore microprocessors
- Chips with multiple processors (cores)
- Shared Memory Processors (SMP)
Hardware and Software
Challenge: hardware and software design that enables parallel processing programs, which can be efficiently executed (in performance and energy) when number of cores scales.
Hardware
- Serial: e.g., Pentium 4
- Parallel: e.g., quad-core Xeon e5345
Software
- Sequential: e.g., matrix multiplication
- Concurrent: e.g., operating system
Sequential/concurrent software can run on serial/parallel hardware
We use “parallel processing program” to mean either sequential or concurrent software running on parallel hardware
Parallel Programming
- It’s hard to create parallel software
- Parallel programming needs to achieve significant performance improvement
- Otherwise, just use a faster uniprocessor, since it’s easier!
- Difficulties of parallel programming:
- Partitioning
- Coordination
- Communications overhead
Amdahl’s Law
Sequential part can limit speedup
Example: 100 processors, 90× speedup?
- $T_{new} = \frac{T_{parallelizable}}{100} + T_{sequential}$
- $Speedup = \frac{1}{(1-F_{parallelizable}) + \frac{F_{parallelizable}}{100}}{90}$
- Solving: $F_{parallelizable} = 0.999$
Need sequential part to be 0.1% of original time
Parallel Processing
- The following techniques can enable parallel processing
- SIMD, vector (section 6.3)
- single instruction stream, single data stream
- Uniprocessor, Intel Pentium 4
- Multithreading (section 6.4)
- multiple instruction, multiple data
- Multi-core processor, Intel Core i7
- SMPs and clusters (section 6.5)
- single program, multiple data
- Typical way to write program on a multi-core processor
- One program run on multiple processors
- Different processors execute on different sections of code
- GPUs (section 6.6)