• Paul Jackson's avatar
    [PATCH] cpuset semaphore depth check deadlock fix · 4247bdc6
    Paul Jackson authored
    The cpusets-formalize-intermediate-gfp_kernel-containment patch
    has a deadlock problem.
    
    This patch was part of a set of four patches to make more
    extensive use of the cpuset 'mem_exclusive' attribute to
    manage kernel GFP_KERNEL memory allocations and to constrain
    the out-of-memory (oom) killer.
    
    A task that is changing cpusets in particular ways on a system
    when it is very short of free memory could double trip over
    the global cpuset_sem semaphore (get the lock and then deadlock
    trying to get it again).
    
    The second attempt to get cpuset_sem would be in the routine
    cpuset_zone_allowed().  This was discovered by code inspection.
    I can not reproduce the problem except with an artifically
    hacked kernel and a specialized stress test.
    
    In real life you cannot hit this unless you are manipulating
    cpusets, and are very unlikely to hit it unless you are rapidly
    modifying cpusets on a memory tight system.  Even then it would
    be a rare occurence.
    
    If you did hit it, the task double tripping over cpuset_sem
    would deadlock in the kernel, and any other task also trying
    to manipulate cpusets would deadlock there too, on cpuset_sem.
    Your batch manager would be wedged solid (if it was cpuset
    savvy), but classic Unix shells and utilities would work well
    enough to reboot the system.
    
    The unusual condition that led to this bug is that unlike most
    semaphores, cpuset_sem _can_ be acquired while in the page
    allocation code, when __alloc_pages() calls cpuset_zone_allowed.
    So it easy to mistakenly perform the following sequence:
      1) task makes system call to alter a cpuset
      2) take cpuset_sem
      3) try to allocate memory
      4) memory allocator, via cpuset_zone_allowed, trys to take cpuset_sem
      5) deadlock
    
    The reason that this is not a serious bug for most users
    is that almost all calls to allocate memory don't require
    taking cpuset_sem.  Only some code paths off the beaten
    track require taking cpuset_sem -- which is good.  Taking
    a global semaphore on the main code path for allocating
    memory would not scale well.
    
    This patch fixes this deadlock by wrapping the up() and down()
    calls on cpuset_sem in kernel/cpuset.c with code that tracks
    the nesting depth of the current task on that semaphore, and
    only does the real down() if the task doesn't hold the lock
    already, and only does the real up() if the nesting depth
    (number of unmatched downs) is exactly one.
    
    The previous required use of refresh_mems(), anytime that
    the cpuset_sem semaphore was acquired and the code executed
    while holding that semaphore might try to allocate memory, is
    no longer required.  Two refresh_mems() calls were removed
    thanks to this.  This is a good change, as failing to get
    all the necessary refresh_mems() calls placed was a primary
    source of bugs in this cpuset code.  The only remaining call
    to refresh_mems() is made while doing a memory allocation,
    if certain task memory placement data needs to be updated
    from its cpuset, due to the cpuset having been changed behind
    the tasks back.
    Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
    Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    4247bdc6
sched.h 41.7 KB