• Greg Banks's avatar
    knfsd: avoid overloading the CPU scheduler with enormous load averages · 59a252ff
    Greg Banks authored
    Avoid overloading the CPU scheduler with enormous load averages
    when handling high call-rate NFS loads.  When the knfsd bottom half
    is made aware of an incoming call by the socket layer, it tries to
    choose an nfsd thread and wake it up.  As long as there are idle
    threads, one will be woken up.
    
    If there are lot of nfsd threads (a sensible configuration when
    the server is disk-bound or is running an HSM), there will be many
    more nfsd threads than CPUs to run them.  Under a high call-rate
    low service-time workload, the result is that almost every nfsd is
    runnable, but only a handful are actually able to run.  This situation
    causes two significant problems:
    
    1. The CPU scheduler takes over 10% of each CPU, which is robbing
       the nfsd threads of valuable CPU time.
    
    2. At a high enough load, the nfsd threads starve userspace threads
       of CPU time, to the point where daemons like portmap and rpc.mountd
       do not schedule for tens of seconds at a time.  Clients attempting
       to mount an NFS filesystem timeout at the very first step (opening
       a TCP connection to portmap) because portmap cannot wake up from
       select() and call accept() in time.
    
    Disclaimer: these effects were observed on a SLES9 kernel, modern
    kernels' schedulers may behave more gracefully.
    
    The solution is simple: keep in each svc_pool a counter of the number
    of threads which have been woken but have not yet run, and do not wake
    any more if that count reaches an arbitrary small threshold.
    
    Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
    synthetic client threads simulating an rsync (i.e. recursive directory
    listing) workload reading from an i386 RH9 install image (161480
    regular files in 10841 directories) on the server.  That tree is small
    enough to fill in the server's RAM so no disk traffic was involved.
    This setup gives a sustained call rate in excess of 60000 calls/sec
    before being CPU-bound on the server.  The server was running 128 nfsds.
    
    Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
    taking 5.2%.  This patch drops those contributions to 3.0% and 2.2%.
    Load average was over 120 before the patch, and 20.9 after.
    
    This patch is a forward-ported version of knfsd-avoid-nfsd-overload
    which has been shipping in the SGI "Enhanced NFS" product since 2006.
    It has been posted before:
    
    http://article.gmane.org/gmane.linux.nfs/10374Signed-off-by: default avatarGreg Banks <gnb@sgi.com>
    Signed-off-by: default avatarJ. Bruce Fields <bfields@citi.umich.edu>
    59a252ff
svc_xprt.c 29.8 KB