Multiprocessor Architecture Basics
Multiprocessor Architecture Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Multiprocessor Architecture
Art of Multiprocessor Programming 2 Multiprocessor Architecture Abstract models are (mostly) OK to understand algorithm correctness and progress To understand how concurrent algorithms actually perform You need to understand something about multiprocessor architectures
Pieces
Art of Multiprocessor Programming 3 Pieces Processors Threads Interconnect Memory Caches
Old-School Multiprocessor
Art of Multiprocessor Programming 4 cache Bus Old-School Multiprocessor Bus memory cache cache
Old School
Art of Multiprocessor Programming 5 Old School Processors on different chips Processors share off chip memory resources Communication between processors typically slow
Multicore Architecture
Art of Multiprocessor Programming 6 Multicore Architecture cache Bus Bus memory cache cache cache
Multicore
Art of Multiprocessor Programming 7 Multicore All Processors on same chip Processors share on chip memory resources Communication between processors now very fast
SMP vs NUMA
Art of Multiprocessor Programming 8 SMP vs NUMA SMP memory NUMA (1) SMP: symmetric multiprocessor NUMA: non-uniform memory access CC-NUMA: cache-coherent …
Future Multicores
Art of Multiprocessor Programming 9 Future Multicores Short term: SMP Long Term: most likely a combination of SMP and NUMA properties
Understanding the Pieces
Art of Multiprocessor Programming 10 Understanding the Pieces Lets try to understand what the pieces that make the multiprocessor machine are And how they fit together
Processors
Art of Multiprocessor Programming 11 Processors Cycle: Fetch and execute one instruction Cycle times change 1980: 10 million cycles/sec 2005: 3,000 million cycles/sec
Computer Architecture
Art of Multiprocessor Programming 12 Computer Architecture Measure time in cycles Absolute cycle times change Memory access: ~100s of cycles Changes slowly Mostly gets worse
Threads
Art of Multiprocessor Programming 13 Threads Execution of a sequential program Software , not hardware A processor can run a thread Put it aside Thread does I/O Thread runs out of time Run another thread
Analogy
Art of Multiprocessor Programming 14 Analogy You work in an office When you leave for lunch, someone else takes over your office. If you don’t take a break, a security guard shows up and escorts you to the cafeteria. When you return, you may get a different office
Interconnect
Art of Multiprocessor Programming 15 Interconnect Bus Like a tiny Ethernet Broadcast medium Connects Processors to memory Processors to processors Network Tiny LAN Mostly used on large machines SMP memory
Interconnect
Art of Multiprocessor Programming 16 Interconnect Interconnect is a finite resource Processors can be delayed if others are consuming too much Avoid algorithms that use too much bandwidth
Processor and Memory are Far Apart
Art of Multiprocessor Programming 17 Processor and Memory are Far Apart processor memory interconnect
Reading from Memory
Art of Multiprocessor Programming 18 Reading from Memory address
Reading from Memory
Art of Multiprocessor Programming 19 Reading from Memory zzz…
Reading from Memory
Art of Multiprocessor Programming 20 Reading from Memory value
Writing to Memory
Art of Multiprocessor Programming 21 Writing to Memory address, value
Writing to Memory
Art of Multiprocessor Programming 22 Writing to Memory zzz…
Writing to Memory
Art of Multiprocessor Programming 23 Writing to Memory ack
Cache: Reading from Memory
Art of Multiprocessor Programming 24 Cache: Reading from Memory address cache
Cache: Reading from Memory
Art of Multiprocessor Programming 25 Cache: Reading from Memory cache
Cache: Reading from Memory
Art of Multiprocessor Programming 26 Cache: Reading from Memory cache
Cache Hit
Art of Multiprocessor Programming 27 Cache Hit cache ?
Cache Hit
Art of Multiprocessor Programming 28 Cache Hit cache Yes!
Cache Miss
Art of Multiprocessor Programming 29 Cache Miss address cache ? No…
Cache Miss
Art of Multiprocessor Programming 30 Cache Miss cache
Cache Miss
Art of Multiprocessor Programming 31 Cache Miss cache
Local Spinning
Art of Multiprocessor Programming 32 Local Spinning With caches, spinning becomes practical First time Load flag bit into cache As long as it doesn’t change Hit in cache (no interconnect used) When it changes One-time cost See cache coherence below
Granularity
Art of Multiprocessor Programming 33 Granularity Caches operate at a larger granularity than a word Cache line : fixed-size block containing the address (today 64 or 128 bytes)
Locality
Art of Multiprocessor Programming 34 Locality If you use an address now, you will probably use it again soon Fetch from cache, not memory If you use an address now, you will probably use a nearby address soon In the same cache line
Hit Ratio
Art of Multiprocessor Programming 35 Hit Ratio Proportion of requests that hit in the cache Measure of effectiveness of caching mechanism Depends on locality of application
L1 and L2 Caches
Art of Multiprocessor Programming 36 L1 and L2 Caches L1 L2
L1 and L2 Caches
Art of Multiprocessor Programming 37 L1 and L2 Caches L1 L2 Small & fast 1 or 2 cycles
L1 and L2 Caches
Art of Multiprocessor Programming 38 L1 and L2 Caches L1 L2 Larger and slower 10s of cycles ~128 byte line
When a Cache Becomes Full…
Art of Multiprocessor Programming 39 When a Cache Becomes Full… Need to make room for new entry By evicting an existing entry Need a replacement policy Usually some kind of least recently used heuristic
Fully Associative Cache
Art of Multiprocessor Programming 40 Fully Associative Cache Any line can be anywhere in the cache Advantage: can replace any line Disadvantage: hard to find lines
Direct Mapped Cache
Art of Multiprocessor Programming 41 Direct Mapped Cache Every address has exactly 1 slot Advantage: easy to find a line Disadvantage: must replace fixed line
K-way Set Associative Cache
Art of Multiprocessor Programming 42 K-way Set Associative Cache Each slot holds k lines Advantage: pretty easy to find a line Advantage: some choice in replacing line
Multicore Set Associativity
Art of Multiprocessor Programming 43 Multicore Set Associativity k is 8 or even 16 and growing… Why? Because cores share sets Threads cut effective size if accessing different data
Cache Coherence
Art of Multiprocessor Programming 44 Cache Coherence A and B both cache address x A writes to x Updates cache How does B find out? Many cache coherence protocols in literature
MESI
Art of Multiprocessor Programming 45 MESI Modified Have modified cached data, must write back to memory
MESI
Art of Multiprocessor Programming 46 MESI Modified Have modified cached data, must write back to memory Exclusive Not modified, I have only copy
MESI
Art of Multiprocessor Programming 47 MESI Modified Have modified cached data, must write back to memory Exclusive Not modified, I have only copy Shared Not modified, may be cached elsewhere
MESI
Art of Multiprocessor Programming 48 MESI Modified Have modified cached data, must write back to memory Exclusive Not modified, I have only copy Shared Not modified, may be cached elsewhere Invalid Cache contents not meaningful
Processor Issues Load Request
Art of Multiprocessor Programming 49 Bus Processor Issues Load Request Bus cache memory cache cache data load x
Memory Responds
Art of Multiprocessor Programming 50 cache Bus Memory Responds Bus memory cache cache data Got it! data E
Processor Issues Load Request
Art of Multiprocessor Programming 51 Bus Processor Issues Load Request Bus memory cache cache data data Load x E
Other Processor Responds
Art of Multiprocessor Programming 52 Bus Other Processor Responds memory cache cache data Got it data data Bus E S S
Modify Cached Data
Art of Multiprocessor Programming 53 S Modify Cached Data Bus data memory cache data data data S
Write-Through Cache
Art of Multiprocessor Programming 54 S memory data data data data Bus Write-Through Cache Bus cache data Write x! S
Write-Through Caches
Art of Multiprocessor Programming 55 Write-Through Caches Immediately broadcast changes Good Memory, caches always agree More read hits, maybe Bad Bus traffic on all writes Most writes to unshared data For example, loop indexes …
Write-Through Caches
Art of Multiprocessor Programming 56 Write-Through Caches Immediately broadcast changes Good Memory, caches always agree More read hits, maybe Bad Bus traffic on all writes Most writes to unshared data For example, loop indexes … “show stoppers”
Write-Back Caches
Art of Multiprocessor Programming 57 Write-Back Caches Accumulate changes in cache Write back when line evicted Need the cache for something else Another processor wants it
Invalidate
Art of Multiprocessor Programming 58 Bus Invalidate Bus memory cache data data data cache Invalidate x S S M I
Recall: Real Memory is Relaxed
Art of Multiprocessor Programming 59 Recall: Real Memory is Relaxed Remember the flag principle? Alice and Bob’s flag variables false Alice writes true to her flag and reads Bob’s Bob writes true to his flag and reads Alice’s One must see the other’s flag true
Not Necessarily So
Art of Multiprocessor Programming 60 Not Necessarily So Sometimes the compiler reorders memory operations Can improve cache performance interconnect use But unexpected concurrent interactions
Write Buffers
Art of Multiprocessor Programming 61 Write Buffers address Absorbing Batching
Volatile
Art of Multiprocessor Programming 62 Volatile In Java, if a variable is declared volatile, operations won’t be reordered Write buffer always spilled to memory before thread is allowed to continue a write Expensive, so use it only when needed
Art of Multiprocessor Programming 63   creativecommons.org/licenses/by-sa/2.5/ Creative Commons Attribution-ShareAlike 2.5 License . You are free : to Share — to copy, distribute and transmit the work to Remix — to adapt the work Under the following conditions : Attribution . You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). Share Alike . If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to http://creativecommons.org/licenses/by-sa/3.0/. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights.