Shared Counters and Parallelism
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
A Shared Pool
Art of Multiprocessor Programming 2 A Shared Pool Put Inserts object blocks if full Remove Removes & returns an object blocks if empty public interface Pool { public void put(Object x); public Object remove(); } Unordered set of objects
A Shared Pool
Art of Multiprocessor Programming 3 A Shared Pool Put Insert item block if full Remove Remove & return item block if empty public interface Pool<T> { public void put(T x); public T remove(); }
Simple Locking Implementation
Art of Multiprocessor Programming 4 put Simple Locking Implementation put
Simple Locking Implementation
5 put Simple Locking Implementation put Problem: hot- spot contention
Simple Locking Implementation
6 put Simple Locking Implementation put Problem: hot- spot contention Problem: sequential bottleneck
Simple Locking Implementation
Art of Multiprocessor Programming 7 put Simple Locking Implementation put Problem: hot- spot contention Problem: sequential bottleneck Solution: Queue Lock
Simple Locking Implementation
Art of Multiprocessor Programming 8 put Simple Locking Implementation put Problem: hot- spot contention Problem: sequential bottleneck Solution: Queue Lock Solution :?
Counting Implementation
Art of Multiprocessor Programming 9 Counting Implementation 19 20 21 remove put 19 20 21
Counting Implementation
Art of Multiprocessor Programming 10 Counting Implementation 19 20 21 Only the counters are sequential remove put 19 20 21
Shared Counter
Art of Multiprocessor Programming 11 Shared Counter 3 2 1 0 1 2 3
Shared Counter
Art of Multiprocessor Programming 12 Shared Counter 3 2 1 0 1 2 3 No duplication
Shared Counter
Art of Multiprocessor Programming 13 Shared Counter 3 2 1 0 1 2 3 No duplication No Omission
Shared Counter
Art of Multiprocessor Programming 14 Shared Counter 3 2 1 0 1 2 3 Not necessarily linearizable No duplication No Omission
Shared Counters
Art of Multiprocessor Programming 15 Shared Counters Can we build a shared counter with Low memory contention, and Real parallelism? Locking Can use queue locks to reduce contention No help with parallelism issue …
Software Combining Tree
Art of Multiprocessor Programming 16 Software Combining Tree 4 Contention: All spinning local Parallelism: Potential n/log n speedup
Combining Trees
Art of Multiprocessor Programming 17 Combining Trees 0
Combining Trees
Art of Multiprocessor Programming 18 Combining Trees 0 +3
Combining Trees
Art of Multiprocessor Programming 19 Combining Trees 0 +3 +2
Combining Trees
Art of Multiprocessor Programming 20 Combining Trees 0 +3 +2 Two threads meet, combine sums
Combining Trees
Art of Multiprocessor Programming 21 Combining Trees 0 +3 +2 Two threads meet, combine sums +5
Combining Trees
Art of Multiprocessor Programming 22 Combining Trees 5 +3 +2 +5 Combined sum added to root
Combining Trees
Art of Multiprocessor Programming 23 Combining Trees 5 +3 +2 0 Result returned to children
Combining Trees
Art of Multiprocessor Programming 24 Combining Trees 5 0 0 3 0 Results returned to threads
What if?
Art of Multiprocessor Programming 25 What if? Threads don’t arrive together? Should I stay or should I go? How long to wait? Waiting times add up … Idea: Use multi-phase algorithm Where threads wait in parallel
Combining Status
Art of Multiprocessor Programming 26 Combining Status enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT };
Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT }; Art of Multiprocessor Programming 27 Combining Status Nothing going on
Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT }; Art of Multiprocessor Programming 28 Combining Status 1 st thread is a partner for combining, will return to check for 2 nd thread
Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT }; Art of Multiprocessor Programming 29 Combining Status 2 nd thread has arrived with value for combining
Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT }; Art of Multiprocessor Programming 30 Combining Status 1 st thread has deposited result for 2 nd thread
Combining Status
enum CStatus{ IDLE, FIRST, SECOND, RESULT, ROOT }; Art of Multiprocessor Programming 31 Combining Status Special case: root node
Node Synchronization
Art of Multiprocessor Programming 32 Node Synchronization Use “Meta Locking:” Short-term Synchronized methods Consistency during method call Long-term Boolean locked field Consistency across calls
Phases
Art of Multiprocessor Programming 33 Phases Precombining Set up combining rendez-vous
Phases
Art of Multiprocessor Programming 34 Phases Precombining Set up combining rendez-vous Combining Collect and combine operations
Phases
Art of Multiprocessor Programming 35 Phases Precombining Set up combining rendez-vous Combining Collect and combine operations Operation Hand off to higher thread
Phases
Art of Multiprocessor Programming 36 Phases Precombining Set up combining rendez-vous Combining Collect and combine operations Operation Hand off to higher thread Distribution Distribute results to waiting threads
Precombining Phase
Art of Multiprocessor Programming 37 Precombining Phase 0 Examine status IDLE
Precombining Phase
Art of Multiprocessor Programming 38 Precombining Phase 0 0 If IDLE, promise to return to look for partner FIRST
Precombining Phase
Art of Multiprocessor Programming 39 Precombining Phase At ROOT,turn back FIRST 0
Precombining Phase
Art of Multiprocessor Programming 40 Precombining Phase 0 FIRST
Precombining Phase
Art of Multiprocessor Programming 41 Precombining Phase 0 0 SECOND If FIRST, I’m willing to combine, but lock for now
Code
Art of Multiprocessor Programming 42 Code Tree class In charge of navigation Node class Combining state Synchronization state Bookkeeping
Precombining Navigation
Art of Multiprocessor Programming 43 Precombining Navigation Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node;
Precombining Navigation
Art of Multiprocessor Programming 44 Precombining Navigation Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node; Start at leaf
Precombining Navigation
Art of Multiprocessor Programming 45 Precombining Navigation Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node; Move up while instructed to do so
Precombining Navigation
Art of Multiprocessor Programming 46 Precombining Navigation Node node = myLeaf; while (node.precombine()) { node = node.parent; } Node stop = node; Remember where we stopped
Precombining Node
Art of Multiprocessor Programming 47 Precombining Node synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default : throw new PanicException() } }
Precombining Node
Art of Multiprocessor Programming 48 synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } } Precombining Node Short-term synchronization
Synchronization
Art of Multiprocessor Programming 49 synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } } Synchronization Wait while node is locked (in use by earlier combining phase)
Precombining Node
Art of Multiprocessor Programming 50 synchronized boolean precombine() { while (locked) wait(); switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } } Precombining Node Check combining status
Node was IDLE
Art of Multiprocessor Programming 51 Node was IDLE synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } } I will return to look for 2 nd thread’s input value
Precombining Node
Art of Multiprocessor Programming 52 Precombining Node synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } } Continue up the tree
I’m the 2nd Thread
Art of Multiprocessor Programming 53 I’m the 2 nd Thread synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } } If 1 st thread has promised to return, lock node so it won’t leave without me
Precombining Node
Art of Multiprocessor Programming 54 Precombining Node synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } } Prepare to deposit 2 nd thread’s input value
Precombining Node
Art of Multiprocessor Programming 55 Precombining Node synchronized boolean phase1() { while (sStatus==SStatus.BUSY) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } } End of precombining phase, don’t continue up tree
Node is the Root
Art of Multiprocessor Programming 56 Node is the Root synchronized boolean phase1() { while (sStatus==SStatus.BUSY) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default: throw new PanicException() } } If root, precombining phase ends, don’t continue up tree
Precombining Node
Art of Multiprocessor Programming 57 Precombining Node synchronized boolean precombine() { while (locked) {wait();} switch (cStatus) { case IDLE: cStatus = CStatus.FIRST; return true; case FIRST: locked = true; cStatus = CStatus.SECOND; return false; case ROOT: return false; default : throw new PanicException() } } Always check for unexpected values!
Combining Phase
Art of Multiprocessor Programming 58 Combining Phase 0 0 SECOND 2 nd thread locks out 1 st until 2 nd returns with value +3
Combining Phase
Art of Multiprocessor Programming 59 Combining Phase 0 0 SECOND 2 nd thread deposits value to be combined, unlocks node, & waits 2 +3 zzz
Combining Phase
Art of Multiprocessor Programming 60 Combining Phase +3 +2 +5 SECOND 2 0 1 st thread moves up the tree with combined value … zzz
Combining (reloaded)
Art of Multiprocessor Programming 61 Combining (reloaded) 0 0 2 nd thread has not yet deposited value … FIRST
Combining (reloaded)
Art of Multiprocessor Programming 62 Combining (reloaded) 0 +3 FIRST 1 st thread is alone, locks out late partner
Combining (reloaded)
Art of Multiprocessor Programming 63 Combining (reloaded) 0 +3 +3 FIRST Stop at root
Combining (reloaded)
Art of Multiprocessor Programming 64 Combining (reloaded) 0 +3 +3 FIRST 2 nd thread’s late precombining phase visit locked out
Combining Navigation
Art of Multiprocessor Programming 65 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; }
Combining Navigation
Art of Multiprocessor Programming 66 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Start at leaf
Combining Navigation
Art of Multiprocessor Programming 67 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Add 1
Combining Navigation
Art of Multiprocessor Programming 68 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Revisit nodes visited in precombining
Combining Navigation
Art of Multiprocessor Programming 69 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Accumulate combined values, if any
Combining Navigation
Art of Multiprocessor Programming 70 node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Combining Navigation We will retraverse path in reverse order …
Combining Navigation
Art of Multiprocessor Programming 71 Combining Navigation node = myLeaf; int combined = 1; while (node != stop) { combined = node.combine(combined); stack.push(node); node = node.parent; } Move up the tree
Combining Phase Node
Art of Multiprocessor Programming 72 Combining Phase Node synchronized int combine(int combined) { while (locked) wait(); locked = true ; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default : … } }
Combining Phase Node
Art of Multiprocessor Programming 73 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } } Combining Phase Node If node is locked by the 2 nd thread, wait until it deposits its value
Combining Phase Node
Art of Multiprocessor Programming 74 synchronized int combine(int combined) { while (locked) wait(); locked = true ; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } } Combining Phase Node How do we know that no thread acquires the lock between the two lines? Because the methods are synchronized
Combining Phase Node
Art of Multiprocessor Programming 75 synchronized int combine(int combined) { while (locked) wait(); locked = true ; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } } Combining Phase Node Lock out late attempts to combine (by threads still in precombining)
Combining Phase Node
Art of Multiprocessor Programming 76 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } } Combining Phase Node Remember my (1 st thread) contribution
Combining Phase Node
Art of Multiprocessor Programming 77 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } } Combining Phase Node Check status
Combining Phase Node
Art of Multiprocessor Programming 78 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } } Combining Phase Node I (1 st thread) am alone
Combining Node
Art of Multiprocessor Programming 79 synchronized int combine(int combined) { while (locked) wait(); locked = true; firstValue = combined; switch (cStatus) { case FIRST: return firstValue; case SECOND: return firstValue + secondValue; default: … } } Combining Node Not alone: combine with 2 nd thread
Operation Phase
Art of Multiprocessor Programming 80 Operation Phase 5 +3 +2 +5 Add combined value to root, start back down zzz
Operation Phase (reloaded)
Art of Multiprocessor Programming 81 Operation Phase (reloaded) 5 Leave value to be combined … SECOND 2
Operation Phase (reloaded)
Art of Multiprocessor Programming 82 Operation Phase (reloaded) 5 +2 Unlock, and wait SECOND 2 zzz
Operation Phase Navigation
Art of Multiprocessor Programming 83 Operation Phase Navigation prior = stop.op(combined);
Operation Phase Navigation
Art of Multiprocessor Programming 84 Operation Phase Navigation prior = stop.op(combined); The node where we stopped. Provide collected sum and wait for combining result
Operation on Stopped Node
Art of Multiprocessor Programming 85 Operation on Stopped Node synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default : …
Op States of Stop Node
Art of Multiprocessor Programming 86 synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Op States of Stop Node Only ROOT and SECOND possible. Why?
At Root
Art of Multiprocessor Programming 87 synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … At Root Add sum to root, return prior value
Intermediate Node
Art of Multiprocessor Programming 88 synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Intermediate Node Deposit value for later combining …
Intermediate Node
Art of Multiprocessor Programming 89 synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Intermediate Node Unlock node (locked in precombining ), then notify 1 st thread
Intermediate Node
Art of Multiprocessor Programming 90 synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Intermediate Node Wait for 1 st thread to deliver results
Intermediate Node
Art of Multiprocessor Programming 91 synchronized int op(int combined) { switch (cStatus) { case ROOT: int prior = result; result += combined; return prior; case SECOND: secondValue = combined; locked = false; notifyAll(); while (cStatus != CStatus.RESULT) wait(); locked = false; notifyAll(); cStatus = CStatus.IDLE; return result; default: … Intermediate Node Unlock node (locked by 1 st thread in combining phase) & return
Distribution Phase
Art of Multiprocessor Programming 92 Distribution Phase 5 0 zzz Move down with result SECOND
Distribution Phase
Art of Multiprocessor Programming 93 Distribution Phase 5 zzz Leave result for 2 nd thread & lock node SECOND 2
Distribution Phase
Art of Multiprocessor Programming 94 Distribution Phase 5 0 zzz Push result down tree SECOND 2
Distribution Phase
Art of Multiprocessor Programming 95 Distribution Phase 5 2 nd thread awakens, unlocks, takes value IDLE 3
Distribution Phase Navigation
Art of Multiprocessor Programming 96 Distribution Phase Navigation while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior;
Distribution Phase Navigation
Art of Multiprocessor Programming 97 Distribution Phase Navigation while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior; Traverse path in reverse order
Distribution Phase Navigation
Art of Multiprocessor Programming 98 Distribution Phase Navigation while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior; Distribute results to waiting 2 nd threads
Distribution Phase Navigation
Art of Multiprocessor Programming 99 Distribution Phase Navigation while (!stack.empty()) { node = stack.pop(); node.distribute(prior); } return prior; Return result to caller
Distribution Phase
Art of Multiprocessor Programming 100 Distribution Phase synchronized void distribute(int prior) { switch (cStatus) { case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return ; case SECOND: result = prior + firstValue; cStatus = CStatus.RESULT; notifyAll(); return ; default : …
Distribution Phase
Art of Multiprocessor Programming 101 Distribution Phase synchronized void distribute(int prior) { switch (cStatus) { case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return ; case SECOND: result = prior + firstValue; cStatus = CStatus.RESULT; notifyAll(); return; default: … No 2 nd thread to combine with me, unlock node & reset
Distribution Phase
Art of Multiprocessor Programming 102 Distribution Phase synchronized void distribute(int prior) { switch (cStatus) { case FIRST: cStatus = CStatus.IDLE; locked = false; notifyAll(); return; case SECOND: result = prior + firstValue; cStatus = CStatus.RESULT; notifyAll(); return ; default: … Notify 2 nd thread that result is available (2 nd thread will release lock)
Bad News: High Latency
Art of Multiprocessor Programming 103 Bad News: High Latency +2 +3 +5 Log n
Good News: Real Parallelism
Art of Multiprocessor Programming 104 Good News: Real Parallelism +2 +3 +5 2 threads 1 thread
Throughput Puzzles
Art of Multiprocessor Programming 105 Throughput Puzzles Ideal circumstances All n threads move together, combine n increments in O(log n) time Worst circumstances All n threads slightly skewed, locked out n increments in O(n · log n) time
Index Distribution Benchmark
Art of Multiprocessor Programming 106 Index Distribution Benchmark void indexBench( int iters, int work) { while ( int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }}
Index Distribution Benchmark
Art of Multiprocessor Programming 107 Index Distribution Benchmark void indexBench( int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }} How many iterations
Index Distribution Benchmark
Art of Multiprocessor Programming 108 Index Distribution Benchmark void indexBench(int iters, int work ) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }} Expected time between incrementing counter
Index Distribution Benchmark
Art of Multiprocessor Programming 109 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }} Take a number
Index Distribution Benchmark
Art of Multiprocessor Programming 110 Index Distribution Benchmark void indexBench(int iters, int work) { while (int i < iters) { i = r.getAndIncrement(); Thread.sleep(random() % work); }} Pretend to work (more work, less concurrency)
Performance
Performance Here are some graphs Throughput Average increments in 1 million cycles Latency Average cycles per inc
Performance (Simulated)
Performance (Simulated) Art of Multiprocessor Programming 112 Performance Latency: 90000 80000 60000 50000 30000 20000 0 0 50 100 150 200 250 300 Splock Ctree[n] concurrency cycles per operation Throughput: 50 100 150 200 250 300 90000 80000 70000 60000 50000 40000 30000 20000 10000 Splock Ctree[n] 0 0 operations per million cycles concurrency
Load Fluctuations
113 Load Fluctuations Combining is sensitive: if arrival rates drop … So do combining rates … & performance deteriorates! Test Vary “work” Duration between accessess …
Combining Rate vs Work
114 Combining Rate vs Work
Better to Wait Longer
115 Better to Wait Longer Short wait Indefinite wait Medium wait Latency processors
Conclusions
Art of Multiprocessor Programming 116 Conclusions Combining Trees Linearizable Counters Work well under high contention Sensitive to load fluctuations Can be used for getAndMumble() ops And now for something completely different …
A Balancer
Art of Multiprocessor Programming 117 A Balancer Input wires Output wires
Tokens Traverse Balancers
Art of Multiprocessor Programming 118 Tokens Traverse Balancers Token i enters on any wire leaves on wire i (mod 2)
Tokens Traverse Balancers
Art of Multiprocessor Programming 119 Tokens Traverse Balancers
Tokens Traverse Balancers
Art of Multiprocessor Programming 120 Tokens Traverse Balancers
Tokens Traverse Balancers
Art of Multiprocessor Programming 121 Tokens Traverse Balancers
Tokens Traverse Balancers
Art of Multiprocessor Programming 122 Tokens Traverse Balancers
Tokens Traverse Balancers
Art of Multiprocessor Programming 123 Tokens Traverse Balancers Arbitrary input distribution Balanced output distribution Quiescent State: all tokens have exited
Smoothing Network
Art of Multiprocessor Programming 124 1-smooth property Smoothing Network
Counting Network
Art of Multiprocessor Programming 125 step property Counting Network
Counting Networks Count!
Art of Multiprocessor Programming 126 Counting Networks Count! 0, 4, 8.... 1, 5, 9..... 2, 6, ... 3, 7 ... counters Multiple counters distribute load Step property guarantees no duplication or omissions, how?
Counting Networks Count!
127 Counting Networks Count! 0 1, 5, 9..... 2, 6, ... 3, 7 ... If 5 and 9 are taken before 4 and 8 Step property guarantees that in-flight tokens will take missing values
Counting Networks
Art of Multiprocessor Programming 128 Counting Networks Good for counting number of tokens low contention no sequential bottleneck high throughput practical networks depth
Counting Network
Art of Multiprocessor Programming 129 Counting Network 1
Counting Network
Art of Multiprocessor Programming 130 Counting Network 2 1
Counting Network
Art of Multiprocessor Programming 131 Counting Network 3 2 1
Counting Network
Art of Multiprocessor Programming 132 Counting Network 3 2 1 4
Counting Network
Art of Multiprocessor Programming 133 Counting Network 3 2 1 4 5
Counting Network
Art of Multiprocessor Programming 134 Counting Network 3 2 1 4 5
Bitonic[k] Counting Network
Art of Multiprocessor Programming 135 Bitonic[k] Counting Network
Bitonic[k] Counting Network
136 Bitonic[k] Counting Network
Bitonic[k] not Linearizable
Art of Multiprocessor Programming 137 Bitonic[k] not Linearizable
Bitonic[k] is not Linearizable
Art of Multiprocessor Programming 138 Bitonic[k] is not Linearizable
Bitonic[k] is not Linearizable
Art of Multiprocessor Programming 139 Bitonic[k] is not Linearizable 2
Bitonic[k] is not Linearizable
Art of Multiprocessor Programming 140 Bitonic[k] is not Linearizable 2 0
Bitonic[k] is not Linearizable
Art of Multiprocessor Programming 141 Bitonic[k] is not Linearizable 2 0 Problem is: Red finished before Yellow started Red took 2 Yellow took 0
But it is “Quiescently Consistent”
Art of Multiprocessor Programming 142 But it is “Quiescently Consistent” Has Step Property in any quiescent State (one in which all tokens have exited)
Shared Memory Implementation
Art of Multiprocessor Programming 143 Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this .toggle; this .toggle = ! this .toggle; return oldValue; }
Shared Memory Implementation
Art of Multiprocessor Programming 144 Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; } state
Shared Memory Implementation
Art of Multiprocessor Programming 145 Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this.toggle; this.toggle = !this.toggle; return oldValue; } Output connections to balancers
Shared Memory Implementation
Art of Multiprocessor Programming 146 Shared Memory Implementation class balancer { boolean toggle; balancer[] next; synchronized boolean flip() { boolean oldValue = this .toggle; this .toggle = ! this .toggle; return oldValue; } getAndComplement
Shared Memory Implementation
Art of Multiprocessor Programming 147 Shared Memory Implementation Balancer traverse (Balancer b) { while (!b.isLeaf()) { toggle = b.flip(); boolean (toggle) if b = b.next[0] else b = b.next[1] b; return }
Shared Memory Implementation
Art of Multiprocessor Programming 148 Shared Memory Implementation Balancer traverse (Balancer b) { while (!b.isLeaf()) { boolean toggle = b.flip(); if (toggle) b = b.next[0] else b = b.next[1] return b; } Stop when we exit the network
Shared Memory Implementation
Art of Multiprocessor Programming 149 Shared Memory Implementation Balancer traverse (Balancer b) { while(!b.isLeaf()) { toggle = b.flip(); boolean if (toggle) b = b.next[0] else b = b.next[1] return b; } Flip state
Shared Memory Implementation
Art of Multiprocessor Programming 150 Shared Memory Implementation Balancer traverse (Balancer b) { while(!b.isLeaf()) { boolean toggle = b.flip(); (toggle) if b = b.next[0] else b = b.next[1] b; return } Exit on wire
Bitonic[2k] Inductive Structure
Art of Multiprocessor Programming 152 Bitonic[2k] Inductive Structure Bitonic[k] Bitonic[k] Merger[2k]
Bitonic[4] Counting Network
Art of Multiprocessor Programming 153 Bitonic[4] Counting Network Merger[4] Bitonic[2] Bitonic[2]
Bitonic[8] Layout
Art of Multiprocessor Programming 154 Bitonic[8] Layout Merger[8] Bitonic[4] Bitonic[4]
Unfolded Bitonic[8] Network
Art of Multiprocessor Programming 155 Unfolded Bitonic[8] Network Merger[8]
Unfolded Bitonic[8] Network
Art of Multiprocessor Programming 156 Unfolded Bitonic[8] Network Merger[4] Merger[4]
Unfolded Bitonic[8] Network
Art of Multiprocessor Programming 157 Unfolded Bitonic[8] Network Merger[2] Merger[2] Merger[2] Merger[2]
Bitonic[k] Depth
Art of Multiprocessor Programming 158 Bitonic[k] Depth Width k Depth is (log 2 k)(log 2 k + 1)/2
Proof by Induction
Proof by Induction Base: Bitonic[2] is single balancer has step property by definition Step: If Bitonic[k] has step property … So does Bitonic[2k]
Bitonic[2k] Schematic
Art of Multiprocessor Programming 160 Bitonic[2k] Schematic Bitonic[k] Bitonic[k] Merger[2k]
Need to Prove only Merger[2k]
161 Need to Prove only Merger[2k] Merger[2k] Induction Hypothesis Need to prove
Merger[2k] Schematic
Art of Multiprocessor Programming 162 Merger[2k] Schematic Merger[k] Merger[k]
Merger[2k] Layout
Art of Multiprocessor Programming 163 Merger[2k] Layout
Induction Step
Induction Step Bitonic[k] has step property … Assume Merger[k] has step property if it gets 2 inputs with step property of size k/2 and prove Merger[2k] has step property
Assume Bitonic[k] and Merger[k] and Prove Merger[2k]
165 Assume Bitonic[k] and Merger[k] and Prove Merger[2k] Merger[2k] Induction Hypothesis Need to prove
Proof: Lemma 1
Art of Multiprocessor Programming 166 Proof: Lemma 1 If a sequence has the step property …
Lemma 1
Art of Multiprocessor Programming 167 Lemma 1 So does its even subsequence
Lemma 1
Art of Multiprocessor Programming 168 Lemma 1 Also its odd subsequence
Lemma 2
Art of Multiprocessor Programming 169 Lemma 2 Even + odd even even odd odd even Diff at most 1 Odd +
Bitonic[2k] Layout Details
Art of Multiprocessor Programming 170 Bitonic[2k] Layout Details Merger[k] Merger[k] Bitonic[k] Bitonic[k] even odd odd even Merger[2k]
By induction hypothesis
Art of Multiprocessor Programming 171 By induction hypothesis Merger[k] Merger[k] Bitonic[k] Bitonic[k] Outputs have step property
By Lemma 1
even odd odd even Merger[k] Art of Multiprocessor Programming 172 By Lemma 1 Merger[k] All subsequences have step property
By Lemma 2
even odd odd even Merger[k] Art of Multiprocessor Programming 173 By Lemma 2 Merger[k] Diff at most 1
By Induction Hypothesis
Merger[k] Art of Multiprocessor Programming 174 By Induction Hypothesis Merger[k] Outputs have step property
By Lemma 2
Merger[k] Art of Multiprocessor Programming 175 By Lemma 2 Merger[k] At most one diff
Last Row of Balancers
Art of Multiprocessor Programming 176 Last Row of Balancers Outputs of Merger[k] Outputs of last layer Merger[k] Merger[k]
Last Row of Balancers
Art of Multiprocessor Programming 177 Last Row of Balancers Merger[k] Merger[k] Wire i from one merger Wire i from other merger
Last Row of Balancers
Art of Multiprocessor Programming 178 Last Row of Balancers Outputs of Merger[k] Outputs of last layer Merger[k] Merger[k]
Last Row of Balancers
Art of Multiprocessor Programming 179 Last Row of Balancers Merger[k] Merger[k]
So Counting Networks Count
Art of Multiprocessor Programming 180 So Counting Networks Count QED Merger[k] Merger[k]
Periodic Network Block
Art of Multiprocessor Programming 181 Periodic Network Block
Periodic Network Block
Art of Multiprocessor Programming 182 Periodic Network Block
Periodic Network Block
Art of Multiprocessor Programming 183 Periodic Network Block
Periodic Network Block
Art of Multiprocessor Programming 184 Periodic Network Block
Block[2k] Schematic
Art of Multiprocessor Programming 185 Block[2k] Schematic Block[k] Block[k]
Block[2k] Layout
Art of Multiprocessor Programming 186 Block[2k] Layout
Periodic[8]
Art of Multiprocessor Programming 187 Periodic[8]
Network Depth
Art of Multiprocessor Programming 188 Network Depth Each block[k] has depth log 2 k Need log 2 k blocks Grand total of (log 2 k) 2
Lower Bound on Depth
Art of Multiprocessor Programming 189 Lower Bound on Depth Theorem : The depth of any width w counting network is at least Ω ( log w). Theorem : there exists a counting network of θ (log w) depth. Unfortunately, proof is non-constructive and constants in the 1000s.
Sequential Theorem
Art of Multiprocessor Programming 190 Sequential Theorem If a balancing network counts Sequentially, meaning that Tokens traverse one at a time Then it counts Even if tokens traverse concurrently
Red First, Blue Second
Art of Multiprocessor Programming 191 Red First, Blue Second (2)
Blue First, Red Second
Art of Multiprocessor Programming 192 Blue First, Red Second (2)
Either Way
Art of Multiprocessor Programming 193 Either Way Same balancer states
Order Doesn’t Matter
Art of Multiprocessor Programming 194 Order Doesn’t Matter Same balancer states Same output distribution
Index Distribution Benchmark
Art of Multiprocessor Programming 195 Index Distribution Benchmark void indexBench( int iters, int work) { while ( int i = 0 < iters) { i = fetch&inc(); Thread.sleep(random() % work); } }
Performance (Simulated)
Art of Multiprocessor Programming 196 Performance (Simulated) * All graphs taken from Herlihy,Lim,Shavit, copyright ACM. MCS queue lock Spin lock Number processors Throughput Higher is better!
Performance (Simulated)
Art of Multiprocessor Programming 197 Performance (Simulated) MCS queue lock Spin lock Number processors Throughput 64-leaf combining tree 80-balancer counting network Higher is better!
Performance (Simulated)
Art of Multiprocessor Programming 198 Performance (Simulated) MCS queue lock Spin lock Number processors Throughput 64-leaf combining tree 80-balancer counting network Combining and counting are pretty close
Performance (Simulated)
Art of Multiprocessor Programming 199 Performance (Simulated) MCS queue lock Spin lock Number processors Throughput 64-leaf combining tree 80-balancer counting network But they beat the hell out of the competition!
Saturation and Performance
Art of Multiprocessor Programming 200 Saturation and Performance Undersaturated P < w log w Saturated P = w log w Oversaturated P > w log w Optimal performance
Throughput vs. Size
Art of Multiprocessor Programming 201 Throughput vs. Size Bitonic[16] Bitonic[4] Bitonic[8] Number processors Throughput
Shared Pool
Art of Multiprocessor Programming 202 Shared Pool 19 20 21 remove put 19 20 21
Shared Pool
Art of Multiprocessor Programming 237 Shared Pool remove put Depth log 2 w Can we do better?
Counting Trees
Art of Multiprocessor Programming 238 Counting Trees Single input wire
Counting Trees
Art of Multiprocessor Programming 239 Counting Trees
Counting Trees
Art of Multiprocessor Programming 240 Counting Trees
Counting Trees
Art of Multiprocessor Programming 241 Counting Trees
Counting Trees
Art of Multiprocessor Programming 242 Counting Trees Step property in quiescent state
Counting Trees Interleaved output wires
Inductive Construction
Inductive Construction Tree[2k ] has step property in quiescent state. Tree[2k] = . . . . . . Tree 0 [k] Tree 1 [k] k even outputs k odd outputs At most 1 more token in top wire
Inductive Construction
Tree 1 [k] Inductive Construction Tree[2k] = . . . . . . Tree 0 [k] k even outputs k odd outputs Top step sequence has at most one extra on last wire of step Tree[2k ] has step property in quiescent state.
Implementing Counting Trees
Example
Example
Example
Example
Implementing Counting Trees Contention Sequential bottleneck
Diffraction Balancing
Diffraction Balancing If an even number of tokens visit a balancer , the toggle bit remains unchanged! balancer Prism Array
Diffracting Tree Diffracting balancer same as balancer. 1 2 k . . : : prism 1 2 k / 2 . . prism prism Diffracting Balancer 1 2 k / 2 . . Diffracting Balancer Diffracting Balancer
Diffracting Tree 1 2 k . . : : prism 1 2 k / 2 . . prism prism Diffracting Balancer 1 2 k / 2 . . Diffracting Balancer Diffracting Balancer High load Lots of Diffraction + Few Toggles Low load Low Diffraction + Few Toggles High Throuhput with Low Contention
Performance
Performance Throughput Latency Dtree Ctree Dtree Ctree MCS P=Concurrency P=Concurrency 0 50 100 150 200 250 300 0 50 100 150 200 250 300 120000 160000 60000 40000 80000 0 100000 20000 0 10000 8000 2000 6000 4000 140000 MCS
Summary
Art of Multiprocessor Programming 254 Summary Can build a linearizable parallel shared counter By relaxing our coherence requirements, we can build a shared counter with Low memory contention, and Real parallelism
Art of Multiprocessor Programming 255   creativecommons.org/licenses/by-sa/2.5/ Creative Commons Attribution-ShareAlike 2.5 License . You are free : to Share — to copy, distribute and transmit the work to Remix — to adapt the work Under the following conditions : Attribution . You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). Share Alike . If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to http://creativecommons.org/licenses/by-sa/3.0/. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights.