sched: Describe CFS load-balancer
Add some scribbles on how and why the load-balancer works.. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1341316406.23484.64.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
This commit is contained in:
parent
f4e26b120b
commit
e9c84cb8d5
@ -3456,8 +3456,122 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
|
||||
|
||||
#ifdef CONFIG_SMP
|
||||
/**************************************************
|
||||
* Fair scheduling class load-balancing methods:
|
||||
*/
|
||||
* Fair scheduling class load-balancing methods.
|
||||
*
|
||||
* BASICS
|
||||
*
|
||||
* The purpose of load-balancing is to achieve the same basic fairness the
|
||||
* per-cpu scheduler provides, namely provide a proportional amount of compute
|
||||
* time to each task. This is expressed in the following equation:
|
||||
*
|
||||
* W_i,n/P_i == W_j,n/P_j for all i,j (1)
|
||||
*
|
||||
* Where W_i,n is the n-th weight average for cpu i. The instantaneous weight
|
||||
* W_i,0 is defined as:
|
||||
*
|
||||
* W_i,0 = \Sum_j w_i,j (2)
|
||||
*
|
||||
* Where w_i,j is the weight of the j-th runnable task on cpu i. This weight
|
||||
* is derived from the nice value as per prio_to_weight[].
|
||||
*
|
||||
* The weight average is an exponential decay average of the instantaneous
|
||||
* weight:
|
||||
*
|
||||
* W'_i,n = (2^n - 1) / 2^n * W_i,n + 1 / 2^n * W_i,0 (3)
|
||||
*
|
||||
* P_i is the cpu power (or compute capacity) of cpu i, typically it is the
|
||||
* fraction of 'recent' time available for SCHED_OTHER task execution. But it
|
||||
* can also include other factors [XXX].
|
||||
*
|
||||
* To achieve this balance we define a measure of imbalance which follows
|
||||
* directly from (1):
|
||||
*
|
||||
* imb_i,j = max{ avg(W/P), W_i/P_i } - min{ avg(W/P), W_j/P_j } (4)
|
||||
*
|
||||
* We them move tasks around to minimize the imbalance. In the continuous
|
||||
* function space it is obvious this converges, in the discrete case we get
|
||||
* a few fun cases generally called infeasible weight scenarios.
|
||||
*
|
||||
* [XXX expand on:
|
||||
* - infeasible weights;
|
||||
* - local vs global optima in the discrete case. ]
|
||||
*
|
||||
*
|
||||
* SCHED DOMAINS
|
||||
*
|
||||
* In order to solve the imbalance equation (4), and avoid the obvious O(n^2)
|
||||
* for all i,j solution, we create a tree of cpus that follows the hardware
|
||||
* topology where each level pairs two lower groups (or better). This results
|
||||
* in O(log n) layers. Furthermore we reduce the number of cpus going up the
|
||||
* tree to only the first of the previous level and we decrease the frequency
|
||||
* of load-balance at each level inv. proportional to the number of cpus in
|
||||
* the groups.
|
||||
*
|
||||
* This yields:
|
||||
*
|
||||
* log_2 n 1 n
|
||||
* \Sum { --- * --- * 2^i } = O(n) (5)
|
||||
* i = 0 2^i 2^i
|
||||
* `- size of each group
|
||||
* | | `- number of cpus doing load-balance
|
||||
* | `- freq
|
||||
* `- sum over all levels
|
||||
*
|
||||
* Coupled with a limit on how many tasks we can migrate every balance pass,
|
||||
* this makes (5) the runtime complexity of the balancer.
|
||||
*
|
||||
* An important property here is that each CPU is still (indirectly) connected
|
||||
* to every other cpu in at most O(log n) steps:
|
||||
*
|
||||
* The adjacency matrix of the resulting graph is given by:
|
||||
*
|
||||
* log_2 n
|
||||
* A_i,j = \Union (i % 2^k == 0) && i / 2^(k+1) == j / 2^(k+1) (6)
|
||||
* k = 0
|
||||
*
|
||||
* And you'll find that:
|
||||
*
|
||||
* A^(log_2 n)_i,j != 0 for all i,j (7)
|
||||
*
|
||||
* Showing there's indeed a path between every cpu in at most O(log n) steps.
|
||||
* The task movement gives a factor of O(m), giving a convergence complexity
|
||||
* of:
|
||||
*
|
||||
* O(nm log n), n := nr_cpus, m := nr_tasks (8)
|
||||
*
|
||||
*
|
||||
* WORK CONSERVING
|
||||
*
|
||||
* In order to avoid CPUs going idle while there's still work to do, new idle
|
||||
* balancing is more aggressive and has the newly idle cpu iterate up the domain
|
||||
* tree itself instead of relying on other CPUs to bring it work.
|
||||
*
|
||||
* This adds some complexity to both (5) and (8) but it reduces the total idle
|
||||
* time.
|
||||
*
|
||||
* [XXX more?]
|
||||
*
|
||||
*
|
||||
* CGROUPS
|
||||
*
|
||||
* Cgroups make a horror show out of (2), instead of a simple sum we get:
|
||||
*
|
||||
* s_k,i
|
||||
* W_i,0 = \Sum_j \Prod_k w_k * ----- (9)
|
||||
* S_k
|
||||
*
|
||||
* Where
|
||||
*
|
||||
* s_k,i = \Sum_j w_i,j,k and S_k = \Sum_i s_k,i (10)
|
||||
*
|
||||
* w_i,j,k is the weight of the j-th runnable task in the k-th cgroup on cpu i.
|
||||
*
|
||||
* The big problem is S_k, its a global sum needed to compute a local (W_i)
|
||||
* property.
|
||||
*
|
||||
* [XXX write more on how we solve this.. _after_ merging pjt's patches that
|
||||
* rewrite all of this once again.]
|
||||
*/
|
||||
|
||||
static unsigned long __read_mostly max_load_balance_interval = HZ/10;
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user