• Peter Williams's avatar
    [PATCH] sched: implement smpnice · 2dd73a4f
    Peter Williams authored
    Problem:
    
    The introduction of separate run queues per CPU has brought with it "nice"
    enforcement problems that are best described by a simple example.
    
    For the sake of argument suppose that on a single CPU machine with a
    nice==19 hard spinner and a nice==0 hard spinner running that the nice==0
    task gets 95% of the CPU and the nice==19 task gets 5% of the CPU.  Now
    suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and
    2 nice==0 hard spinners running.  The user of this system would be entitled
    to expect that the nice==0 tasks each get 95% of a CPU and the nice==19
    tasks only get 5% each.  However, whether this expectation is met is pretty
    much down to luck as there are four equally likely distributions of the
    tasks to the CPUs that the load balancing code will consider to be balanced
    with loads of 2.0 for each CPU.  Two of these distributions involve one
    nice==0 and one nice==19 task per CPU and in these circumstances the users
    expectations will be met.  The other two distributions both involve both
    nice==0 tasks being on one CPU and both nice==19 being on the other CPU and
    each task will get 50% of a CPU and the user's expectations will not be
    met.
    
    Solution:
    
    The solution to this problem that is implemented in the attached patch is
    to use weighted loads when determining if the system is balanced and, when
    an imbalance is detected, to move an amount of weighted load between run
    queues (as opposed to a number of tasks) to restore the balance.  Once
    again, the easiest way to explain why both of these measures are necessary
    is to use a simple example.  Suppose that (in a slight variation of the
    above example) that we have a two CPU system with 4 nice==0 and 4 nice=19
    hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and
    the 4 nice==19 tasks are on the other CPU.  The weighted loads for the two
    CPUs would be 4.0 and 0.2 respectively and the load balancing code would
    move 2 tasks resulting in one CPU with a load of 2.0 and the other with
    load of 2.2.  If this was considered to be a big enough imbalance to
    justify moving a task and that task was moved using the current
    move_tasks() then it would move the highest priority task that it found and
    this would result in one CPU with a load of 3.0 and the other with a load
    of 1.2 which would result in the movement of a task in the opposite
    direction and so on -- infinite loop.  If, on the other hand, an amount of
    load to be moved is calculated from the imbalance (in this case 0.1) and
    move_tasks() skips tasks until it find ones whose contributions to the
    weighted load are less than this amount it would move two of the nice==19
    tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with
    loads of 2.1 for each CPU.
    
    One of the advantages of this mechanism is that on a system where all tasks
    have nice==0 the load balancing calculations would be mathematically
    identical to the current load balancing code.
    
    Notes:
    
    struct task_struct:
    
    has a new field load_weight which (in a trade off of space for speed)
    stores the contribution that this task makes to a CPU's weighted load when
    it is runnable.
    
    struct runqueue:
    
    has a new field raw_weighted_load which is the sum of the load_weight
    values for the currently runnable tasks on this run queue.  This field
    always needs to be updated when nr_running is updated so two new inline
    functions inc_nr_running() and dec_nr_running() have been created to make
    sure that this happens.  This also offers a convenient way to optimize away
    this part of the smpnice mechanism when CONFIG_SMP is not defined.
    
    int try_to_wake_up():
    
    in this function the value SCHED_LOAD_BALANCE is used to represent the load
    contribution of a single task in various calculations in the code that
    decides which CPU to put the waking task on.  While this would be a valid
    on a system where the nice values for the runnable tasks were distributed
    evenly around zero it will lead to anomalous load balancing if the
    distribution is skewed in either direction.  To overcome this problem
    SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
    or by the average load_weight per task for the queue in question (as
    appropriate).
    
    int move_tasks():
    
    The modifications to this function were complicated by the fact that
    active_load_balance() uses it to move exactly one task without checking
    whether an imbalance actually exists.  This precluded the simple
    overloading of max_nr_move with max_load_move and necessitated the addition
    of the latter as an extra argument to the function.  The internal
    implementation is then modified to move up to max_nr_move tasks and
    max_load_move of weighted load.  This slightly complicates the code where
    move_tasks() is called and if ever active_load_balance() is changed to not
    use move_tasks() the implementation of move_tasks() should be simplified
    accordingly.
    
    struct sched_group *find_busiest_group():
    
    Similar to try_to_wake_up(), there are places in this function where
    SCHED_LOAD_SCALE is used to represent the load contribution of a single
    task and the same issues are created.  A similar solution is adopted except
    that it is now the average per task contribution to a group's load (as
    opposed to a run queue) that is required.  As this value is not directly
    available from the group it is calculated on the fly as the queues in the
    groups are visited when determining the busiest group.
    
    A key change to this function is that it is no longer to scale down
    *imbalance on exit as move_tasks() uses the load in its scaled form.
    
    void set_user_nice():
    
    has been modified to update the task's load_weight field when it's nice
    value and also to ensure that its run queue's raw_weighted_load field is
    updated if it was runnable.
    
    From: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
    
    With smpnice, sched groups with highest priority tasks can mask the imbalance
    between the other sched groups with in the same domain.  This patch fixes some
    of the listed down scenarios by not considering the sched groups which are
    lightly loaded.
    
    a) on a simple 4-way MP system, if we have one high priority and 4 normal
       priority tasks, with smpnice we would like to see the high priority task
       scheduled on one cpu, two other cpus getting one normal task each and the
       fourth cpu getting the remaining two normal tasks.  but with current
       smpnice extra normal priority task keeps jumping from one cpu to another
       cpu having the normal priority task.  This is because of the
       busiest_has_loaded_cpus, nr_loaded_cpus logic..  We are not including the
       cpu with high priority task in max_load calculations but including that in
       total and avg_load calcuations..  leading to max_load < avg_load and load
       balance between cpus running normal priority tasks(2 Vs 1) will always show
       imbalanace as one normal priority and the extra normal priority task will
       keep moving from one cpu to another cpu having normal priority task..
    
    b) 4-way system with HT (8 logical processors).  Package-P0 T0 has a
       highest priority task, T1 is idle.  Package-P1 Both T0 and T1 have 1 normal
       priority task each..  P2 and P3 are idle.  With this patch, one of the
       normal priority tasks on P1 will be moved to P2 or P3..
    
    c) With the current weighted smp nice calculations, it doesn't always make
       sense to look at the highest weighted runqueue in the busy group..
       Consider a load balance scenario on a DP with HT system, with Package-0
       containing one high priority and one low priority, Package-1 containing one
       low priority(with other thread being idle)..  Package-1 thinks that it need
       to take the low priority thread from Package-0.  And find_busiest_queue()
       returns the cpu thread with highest priority task..  And ultimately(with
       help of active load balance) we move high priority task to Package-1.  And
       same continues with Package-0 now, moving high priority task from package-1
       to package-0..  Even without the presence of active load balance, load
       balance will fail to balance the above scenario..  Fix find_busiest_queue
       to use "imbalance" when it is lightly loaded.
    
    [kernel@kolivas.org: sched: store weighted load on up]
    [kernel@kolivas.org: sched: add discrete weighted cpu load function]
    [suresh.b.siddha@intel.com: sched: remove dead code]
    Signed-off-by: default avatarPeter Williams <pwil3058@bigpond.com.au>
    Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
    Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
    Acked-by: default avatarIngo Molnar <mingo@elte.hu>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Signed-off-by: default avatarCon Kolivas <kernel@kolivas.org>
    Cc: John Hawkes <hawkes@sgi.com>
    Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    2dd73a4f
sched.h 45.5 KB