Futures, Scheduling, and Work Distribution
Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A
How to write Parallel Apps?
Art of Multiprocessor Programming 2 2 How to write Parallel Apps? split a program into parallel parts In an effective way Thread management
Matrix Multiplication
Art of Multiprocessor Programming 3 3 Matrix Multiplication
Matrix Multiplication
Art of Multiprocessor Programming 4 4 Matrix Multiplication
Matrix Multiplication
Art of Multiprocessor Programming 5 Matrix Multiplication class Worker extends Thread { int row, col; Worker( int row, int col) { this .row = row; this .col = col; } public void run() { double dotProduct = 0.0; for ( int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}}
Matrix Multiplication
Art of Multiprocessor Programming 6 Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} a thread
Matrix Multiplication
Art of Multiprocessor Programming 7 Matrix Multiplication class Worker extends Thread { int row, col; Worker ( int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} Which matrix entry to compute
Matrix Multiplication
Art of Multiprocessor Programming 8 Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} Actual computation
Matrix Multiplication
Art of Multiprocessor Programming 9 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for ( int row …) for ( int col …) worker[row][col] = new Worker(row,col); for ( int row …) for ( int col …) worker[row][col].start(); for ( int row …) for ( int col …) worker[row][col].join(); }
Matrix Multiplication
Art of Multiprocessor Programming 10 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for ( int row …) for ( int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Create n x n threads
Matrix Multiplication
Art of Multiprocessor Programming 11 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for ( int row …) for ( int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Start them
Matrix Multiplication
Art of Multiprocessor Programming 12 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for ( int row …) for ( int col …) worker[row][col].start(); for ( int row …) for ( int col …) worker[row][col].join(); } Wait for them to finish Start them
Matrix Multiplication
Art of Multiprocessor Programming 13 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Wait for them to finish Start them What’s wrong with this picture?
Thread Overhead
Art of Multiprocessor Programming 14 Thread Overhead Threads Require resources Memory for stacks Setup, teardown Scheduler overhead Short-lived threads Ratio of work versus overhead bad
Thread Pools
Art of Multiprocessor Programming 15 Thread Pools More sensible to keep a pool of long-lived threads Threads assigned short-lived tasks Run the task Rejoin pool Wait for next assignment
Thread Pool = Abstraction
Art of Multiprocessor Programming 16 Thread Pool = Abstraction Insulate programmer from platform Big machine, big pool Small machine, small pool Portable code Works across platforms Worry about algorithm, not platform
ExecutorService Interface
Art of Multiprocessor Programming 17 ExecutorService Interface In java.util.concurrent Task = Runnable object If no result value expected Calls run() method. Task = Callable<T> object If result value of type T expected Calls T call() method.
Future<T>
Art of Multiprocessor Programming 18 Future<T> Callable<T> task = …; Future<T> future = executor.submit(task); T value = future.get();
Future<T>
Art of Multiprocessor Programming 19 Future<T> Callable<T> task = …; Future<T> future = executor.submit(task); T value = future.get(); Submitting a Callable<T> task returns a Future<T> object
Future<T>
Art of Multiprocessor Programming 20 Future<T> Callable<T> task = …; Future<T> future = executor.submit(task); T value = future.get(); The Future’s get() method blocks until the value is available
Future<?>
Art of Multiprocessor Programming 21 Future<?> Runnable task = …; Future<?> future = executor.submit(task); future.get();
Future<?>
Art of Multiprocessor Programming 22 22 Future<?> Runnable task = …; Future<?> future = executor.submit(task); future.get(); Submitting a Runnable task returns a Future<?> object
Future<?>
Art of Multiprocessor Programming 23 23 Future<?> Runnable task = …; Future<?> future = executor.submit(task); future.get(); The Future’s get() method blocks until the computation is complete
Note
Art of Multiprocessor Programming 24 24 Note Executor Service submissions Like New England traffic signs Are purely advisory in nature The executor Like the Boston driver Is free to ignore any such advice And could execute tasks sequentially …
Matrix Addition
Art of Multiprocessor Programming 25 25 Matrix Addition
Matrix Addition
Art of Multiprocessor Programming 26 26 Matrix Addition 4 parallel additions
Matrix Addition Task
Art of Multiprocessor Programming 27 27 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // add this ! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future<?> f 00 = exec.submit(add(a 00 ,b 00 )); Future<?> f 11 = exec.submit(add(a 11 ,b 11 )); f 00 .get(); …; f 11 .get(); }}
Matrix Addition Task
Art of Multiprocessor Programming 28 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // add this ! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future<?> f 00 = exec.submit(add(a 00 ,b 00 )); Future<?> f 11 = exec.submit(add(a 11 ,b 11 )); f 00 .get(); …; f 11 .get(); }} Constant-time operation
Matrix Addition Task
Art of Multiprocessor Programming 29 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // add this ! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future<?> f 00 = exec.submit(add(a 00 ,b 00 )); Future<?> f 11 = exec.submit(add(a 11 ,b 11 )); f 00 .get(); …; f 11 .get(); }} Submit 4 tasks
Matrix Addition Task
Art of Multiprocessor Programming 30 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // add this ! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future<?> f 00 = exec.submit(add(a 00 ,b 00 )); Future<?> f 11 = exec.submit(add(a 11 ,b 11 )); f 00 .get(); …; f 11 .get(); }} Base case: add directly
Matrix Addition Task
Art of Multiprocessor Programming 31 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future<?> f 00 = exec.submit(add(a 00 ,b 00 )); Future<?> f 11 = exec.submit(add(a 11 ,b 11 )); f 00 .get(); …; f 11 .get(); }} Let them finish
Dependencies
Art of Multiprocessor Programming 32 Dependencies Matrix example is not typical Tasks are independent Don’t need results of one task … To complete another Often tasks are not independent
Fibonacci
Art of Multiprocessor Programming 33 33 Fibonacci 1 if n = 0 or 1 F(n) F(n-1) + F(n-2) otherwise Note Potential parallelism Dependencies
Disclaimer
Art of Multiprocessor Programming 34 34 Disclaimer This Fibonacci implementation is Egregiously inefficient So don’t try this at home or job! But illustrates our point How to deal with dependencies Exercise: Make this implementation efficient!
Multithreaded Fibonacci
Art of Multiprocessor Programming 35 class FibTask implements Callable < Integer > { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask( int n) { arg = n; } public Integer call() { if (arg > 2) { Future< Integer > left = exec.submit( new FibTask(arg-1)); Future< Integer > right = exec.submit( new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Multithreaded Fibonacci
Multithreaded Fibonacci
Art of Multiprocessor Programming 36 class FibTask implements Callable<Integer> { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future< Integer > left = exec.submit( new FibTask(arg-1)); Future< Integer > right = exec.submit( new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Multithreaded Fibonacci Parallel calls
Multithreaded Fibonacci
Art of Multiprocessor Programming 37 class FibTask implements Callable<Integer> { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future<Integer> left = exec.submit(new FibTask(arg-1)); Future<Integer> right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Multithreaded Fibonacci Pick up & combine results
Dynamic Behavior
Art of Multiprocessor Programming 38 38 Dynamic Behavior Multithreaded program is A directed acyclic graph (DAG) That unfolds dynamically Each node is A single unit of work
Fibonacci DAG
Art of Multiprocessor Programming 39 39 Fibonacci DAG fib(4)
Fibonacci DAG
Art of Multiprocessor Programming 40 40 Fibonacci DAG fib(4) fib(3)
Fibonacci DAG
Art of Multiprocessor Programming 41 41 Fibonacci DAG fib(4) fib(3) fib(2) fib(2)
Fibonacci DAG
Art of Multiprocessor Programming 42 42 Fibonacci DAG fib(4) fib(3) fib(2) fib(2) fib(1) fib(1) fib(1) fib(1) fib(1)
Fibonacci DAG
Art of Multiprocessor Programming 43 43 Fibonacci DAG fib(4) fib(3) fib(2) call get fib(2) fib(1) fib(1) fib(1) fib(1) fib(1)
How Parallel is That?
Art of Multiprocessor Programming 44 44 How Parallel is That? Define work : Total time on one processor Define critical-path length : Longest dependency path Can’t beat that!
Unfolded DAG
Unfolded DAG Art of Multiprocessor Programming 45
Parallelism?
Parallelism? Art of Multiprocessor Programming 46 Serial fraction = 3/18 = 1/6 Amdahl’s Law says speedup cannot exceed 6 .
Work?
Work? Art of Multiprocessor Programming 47 1 2 3 7 5 4 7 8 9 10 11 12 13 14 15 16 17 18 T 1 : time needed on one processor Just count the nodes …. T 1 = 18
Critical Path?
Critical Path? Art of Multiprocessor Programming 48 T 1 : time needed on as many processors as you like
Critical Path?
Critical Path? Art of Multiprocessor Programming 49 1 2 3 4 5 6 7 8 9 T 1 : time needed on as many processors as you like Longest path …. T 1 = 9
Notation Watch
Art of Multiprocessor Programming 50 50 Notation Watch T P = time on P processors T 1 = work (time on 1 processor) T = critical path length (time on processors)
Simple Laws
Art of Multiprocessor Programming 51 51 Simple Laws Work Law : T P ≥ T 1 /P In one step, can’t do more than P work Critical Path Law : T P ≥ T Can’t beat infinite resources
Performance Measures
Art of Multiprocessor Programming 52 52 Performance Measures Speedup on P processors Ratio T 1 /T P How much faster with P processors Linear speedup T 1 /T P = Θ (P) Max speedup (average parallelism) T 1 /T
Sequential Composition
Sequential Composition Art of Multiprocessor Programming 53 A B Work: T 1 (A) + T 1 (B) Critical Path: T 1 (A) + T 1 (B)
Parallel Composition
Parallel Composition Art of Multiprocessor Programming 54 Work: T 1 (A) + T 1 (B) Critical Path: max{T 1 (A), T 1 (B)} A B
Matrix Addition
Art of Multiprocessor Programming 55 55 Matrix Addition
Matrix Addition
Art of Multiprocessor Programming 56 56 Matrix Addition 4 parallel additions
Addition
Art of Multiprocessor Programming 57 57 Addition Let A P (n) be running time For n x n matrix on P processors For example A 1 (n) is work A (n) is critical path length
Addition
Art of Multiprocessor Programming 58 58 Addition Work is A 1 (n) = 4 A 1 (n/2) + Θ (1) 4 spawned additions Partition, synch, etc
Addition
Art of Multiprocessor Programming 59 59 Addition Work is A 1 (n) = 4 A 1 (n/2) + Θ (1) = Θ (n 2 ) Same as double-loop summation
Addition
Art of Multiprocessor Programming 60 60 Addition Critical Path length is A (n) = A (n/2) + Θ (1) spawned additions in parallel Partition, synch, etc
Addition
Art of Multiprocessor Programming 61 61 Addition Critical Path length is A (n) = A (n/2) + Θ (1) = Θ (log n)
Matrix Multiplication Redux
Art of Multiprocessor Programming 62 62 Matrix Multiplication Redux
Matrix Multiplication Redux
Art of Multiprocessor Programming 63 63 Matrix Multiplication Redux
First Phase …
Art of Multiprocessor Programming 64 64 First Phase … 8 multiplications
Second Phase …
Art of Multiprocessor Programming 65 65 Second Phase … 4 additions
Multiplication
Art of Multiprocessor Programming 66 66 Multiplication Work is M 1 (n) = 8 M 1 (n/2) + A 1 (n) 8 parallel multiplications Final addition
Multiplication
Art of Multiprocessor Programming 67 67 Multiplication Work is M 1 (n) = 8 M 1 (n/2) + Θ (n 2 ) = Θ (n 3 ) Same as serial triple-nested loop
Multiplication
Art of Multiprocessor Programming 68 68 Multiplication Critical path length is M (n) = M (n/2) + A (n) Half-size parallel multiplications Final addition
Multiplication
Art of Multiprocessor Programming 69 69 Multiplication Critical path length is M (n) = M (n/2) + A (n) = M (n/2) + Θ (log n) = Θ (log 2 n)
Parallelism
Art of Multiprocessor Programming 70 70 Parallelism M 1 (n)/ M (n) = Θ (n 3 /log 2 n) To multiply two 1000 x 1000 matrices 1000 3 /10 2 =10 7 Much more than number of processors on any real machine
Shared-Memory Multiprocessors
Art of Multiprocessor Programming 71 71 Shared-Memory Multiprocessors Parallel applications Do not have direct access to HW processors Mix of other jobs All run together Come & go dynamically
Ideal Scheduling Hierarchy
Art of Multiprocessor Programming 72 72 Ideal Scheduling Hierarchy Tasks Processors User-level scheduler
Realistic Scheduling Hierarchy
Art of Multiprocessor Programming 73 73 Realistic Scheduling Hierarchy Tasks Threads Processors User-level scheduler Kernel-level scheduler
For Example
Art of Multiprocessor Programming 74 For Example Initially, All P processors available for application Serial computation Takes over one processor Leaving P-1 for us Waits for I/O We get that processor back ….
Speedup
Art of Multiprocessor Programming 75 Speedup Map threads onto P processes Cannot get P -fold speedup What if the kernel doesn’t cooperate? Can try for speedup proportional to P
Scheduling Hierarchy
Art of Multiprocessor Programming 76 76 Scheduling Hierarchy User-level scheduler Tells kernel which threads are ready Kernel-level scheduler Synchronous (for analysis, not correctness!) Picks p i threads to schedule at step i
Greedy Scheduling
Greedy Scheduling A node is ready if predecessors are done Greedy : schedule as many ready nodes as possible Optimal scheduler is greedy (why?) But not every greedy scheduler is optimal Art of Multiprocessor Programming 77
Greedy Scheduling
Greedy Scheduling Art of Multiprocessor Programming 78 There are P processors Complete Step : ¸ P nodes ready run any P Incomplete Step: < P nodes ready run them all
Theorem
Art of Multiprocessor Programming 79 79 Theorem For any greedy scheduler, T P ≤ T 1 /P + T
Theorem
Art of Multiprocessor Programming 80 80 Theorem For any greedy scheduler, Actual time T P ≤ T 1 /P + T
Theorem
Art of Multiprocessor Programming 81 81 Theorem For any greedy scheduler, No better than work divided among processors T P ≤ T 1 /P + T
Theorem
Art of Multiprocessor Programming 82 82 Theorem For any greedy scheduler, No better than critical path length T P ≤ T 1 /P + T
TP ≤ T1/P + T∞
T P ≤ T 1 /P + T Art of Multiprocessor Programming 83 Proof : Number of complete steps < T 1 /P … because each performs P work. Number of incomplete steps < T 1 … because each shortens the unexecuted critical path by 1 QED
Near-Optimality
Art of Multiprocessor Programming 84 84 Near-Optimality Theorem : any greedy scheduler is within a factor of 2 of optimal. Remark : Optimality is NP-hard!
Proof of Near-Optimality
Art of Multiprocessor Programming 85 85 Proof of Near-Optimality Let T P * be the optimal time. T P * ¸ max{T i /P, T 1 } T P · T 1 /P + T 1 T P · 2 max{T 1 /P ,T 1 } T P · 2 T P * From work and critical path laws Theorem QED
Work Distribution
Art of Multiprocessor Programming 86 86 Work Distribution zzz…
Work Dealing
Art of Multiprocessor Programming 87 87 Work Dealing Yes!
The Problem with Work Dealing
Art of Multiprocessor Programming 88 88 The Problem with Work Dealing D’oh! D’oh! D’oh!
Work Stealing
Art of Multiprocessor Programming 89 89 Work Stealing No work… Yes!
Lock-Free Work Stealing
Art of Multiprocessor Programming 90 90 Lock-Free Work Stealing Each thread has a pool of ready work Remove work without synchronizing If you run out of work, steal someone else’s Choose victim at random
Local Work Pools
Art of Multiprocessor Programming 91 91 Local Work Pools Each work pool is a Double-Ended Queue
Work DEQueue1
Art of Multiprocessor Programming 92 92 Work DEQueue 1 pushBottom popBottom work 1. Double-Ended Queue
Obtain Work
Art of Multiprocessor Programming 93 93 Obtain Work Obtain work Run task until Blocks or terminates popBottom
New Work
Art of Multiprocessor Programming 94 94 New Work Unblock node Spawn node pushBottom
Whatcha Gonna do When the Well Runs Dry?
Art of Multiprocessor Programming 95 95 Whatcha Gonna do When the Well Runs Dry? @&%$!! empty
Steal Work from Others
Art of Multiprocessor Programming 96 96 Steal Work from Others Pick random thread’s DEQeueue
Steal this Task!
Art of Multiprocessor Programming 97 97 Steal this Task! popTop
Task DEQueue
Art of Multiprocessor Programming 98 98 Task DEQueue Methods pushBottom popBottom popTop Never happen concurrently
Task DEQueue
Art of Multiprocessor Programming 99 99 Task DEQueue Methods pushBottom popBottom popTop Most common – make them fast (minimize use of CAS)
Ideal
Art of Multiprocessor Programming 100 100 Ideal Wait-Free Linearizable Constant time Fortune Cookie: “It is better to be young, rich and beautiful, than old, poor, and ugly”
Compromise
Art of Multiprocessor Programming 101 101 Compromise Method popTop may fail if Concurrent popTop succeeds, or a Concurrent popBottom takes last task Blame the victim!
Dreaded ABA Problem
Art of Multiprocessor Programming 102 102 Dreaded ABA Problem top CAS
Dreaded ABA Problem
Art of Multiprocessor Programming 103 103 Dreaded ABA Problem top
Dreaded ABA Problem
Art of Multiprocessor Programming 104 104 Dreaded ABA Problem top
Dreaded ABA Problem
Art of Multiprocessor Programming 105 105 Dreaded ABA Problem top
Dreaded ABA Problem
Art of Multiprocessor Programming 106 106 Dreaded ABA Problem top
Dreaded ABA Problem
Art of Multiprocessor Programming 107 107 Dreaded ABA Problem top
Dreaded ABA Problem
Art of Multiprocessor Programming 108 108 Dreaded ABA Problem top
Dreaded ABA Problem
Art of Multiprocessor Programming 109 109 Dreaded ABA Problem top CAS Uh-Oh … Yes!
Fix for Dreaded ABA
Art of Multiprocessor Programming 110 110 Fix for Dreaded ABA stamp top bottom
Bounded DEQueue
Art of Multiprocessor Programming 111 111 Bounded DEQueue public class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] tasks; }
Bounded DEQueue
Art of Multiprocessor Programming 112 112 Bounded DEQueue public class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] tasks; } Index & Stamp (synchronized)
Bounded DEQueue
Art of Multiprocessor Programming 113 113 Bounded DEQueue public class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] deq; } index of bottom task no need to synchronize memory barrier needed
Bounded DEQueue
Art of Multiprocessor Programming 114 Bounded DEQueue public class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] tasks; } Array holding tasks
pushBottom()
Art of Multiprocessor Programming 115 pushBottom() public class BDEQueue { void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } }
pushBottom()
Art of Multiprocessor Programming 116 pushBottom() public class BDEQueue { void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } } Bottom is the index to store the new task in the array
pushBottom()
Art of Multiprocessor Programming 117 pushBottom() public class BDEQueue { void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } } Adjust the bottom index stamp top bottom
Steal Work
Art of Multiprocessor Programming 118 Steal Work public Runnable popTop() { int [] stamp = new int [1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null ; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null ; }
Steal Work
Art of Multiprocessor Programming 119 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Read top (value & stamp)
Steal Work
Art of Multiprocessor Programming 120 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Compute new value & stamp
Steal Work
Art of Multiprocessor Programming 121 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null ; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Quit if queue is empty stamp top bottom
Steal Work
Art of Multiprocessor Programming 122 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Try to steal the task stamp top CAS bottom
Steal Work
Art of Multiprocessor Programming 123 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null ; } Give up if conflict occurs
Take Work
Art of Multiprocessor Programming 124 Runnable popBottom() { if (bottom == 0) return null ; bottom--; Runnable r = tasks[bottom]; int [] stamp = new int [1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null ; bottom = 0; } Take Work
Take Work
Art of Multiprocessor Programming 125 Runnable popBottom() { if (bottom == 0) return null ; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; } 125 Take Work Make sure queue is non-empty
Take Work
Art of Multiprocessor Programming 126 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; } Take Work Prepare to grab bottom task
Take Work
Art of Multiprocessor Programming 127 127 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int [] stamp = new int [1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; } Take Work Read top, & prepare new values
Take Work
Art of Multiprocessor Programming 128 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} 128 Take Work If top & bottom one or more apart, no conflict stamp top bottom
Take Work
Art of Multiprocessor Programming 129 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} 129 Take Work At most one item left stamp top bottom
Take Work
Art of Multiprocessor Programming 130 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} 130 Take Work Try to steal last task. Reset bottom because the DEQueue will be empty even if unsuccessful (why?)
Take Work
Art of Multiprocessor Programming 131 131 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} Take Work stamp top CAS bottom I win CAS
Take Work
Art of Multiprocessor Programming 132 132 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} Take Work stamp top CAS bottom If I lose CAS, thief must have won…
Take Work
Art of Multiprocessor Programming 133 133 Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null ; bottom = 0;} Take Work Failed to get last task (bottom could be less than top) Must still reset top and bottom since deque is empty
Old English Proverb
Art of Multiprocessor Programming 134 134 Old English Proverb “May as well be hanged for stealing a sheep as a goat” From which we conclude: Stealing was punished severely Sheep were worth more than goats
Variations
Art of Multiprocessor Programming 135 135 Variations Stealing is expensive Pay CAS Only one task taken What if Move more than one task Randomly balance loads?
Work Balancing
Art of Multiprocessor Programming 136 136 Work Balancing 5 2 b 2+5 c / 2 = 3 d 2+5 e /2=4
Work-Balancing Thread
Art of Multiprocessor Programming 137 137 Work-Balancing Thread public void run() { int me = ThreadID.get(); while ( true ) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}}
Work-Balancing Thread
Art of Multiprocessor Programming 138 138 Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Keep running tasks
Work-Balancing Thread
Art of Multiprocessor Programming 139 139 Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} With probability 1/|queue|
Work-Balancing Thread
Art of Multiprocessor Programming 140 140 Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Choose random victim
Work-Balancing Thread
Art of Multiprocessor Programming 141 141 Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Lock queues in canonical order
Work-Balancing Thread
Art of Multiprocessor Programming 142 142 Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Rebalance queues
Work Stealing & Balancing
Art of Multiprocessor Programming 143 143 Work Stealing & Balancing Clean separation between app & scheduling layer Works well when number of processors fluctuates. Works on “black-box” operating systems
Art of Multiprocessor Programming 144 144 T O M M R A V L O O R I D L E D
Art of Multiprocessor Programming 145 145   creativecommons.org/licenses/by-sa/2.5/ Creative Commons Attribution-ShareAlike 2.5 License . You are free : to Share — to copy, distribute and transmit the work to Remix — to adapt the work Under the following conditions : Attribution . You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). Share Alike . If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to http://creativecommons.org/licenses/by-sa/3.0/. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights.