1. 12 Mar, 2010 40 commits
    • Christoph Hellwig's avatar
      um: use generic ptrace_resume code · 1bd09508
      Christoph Hellwig authored
      Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT,
      PTRACE_KILL and PTRACE_SINGLESTEP.  This implies defining
      arch_has_single_step in <asm/ptrace.h> and implementing the
      user_enable_single_step and user_disable_single_step functions, which also
      causes the breakpoint information to be cleared on fork, which could be
      considered a bug fix.
      
      Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
      it previously wasn't which is consistent with all architectures using the
      modern ptrace code.
      
      XXX: I'm not sure arch_has_single_step() is placed in the exactly correct
      location, please verify in which of the ptrace headers it should really
      be.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1bd09508
    • Christoph Hellwig's avatar
      mips: use generic ptrace_resume code · 55436c91
      Christoph Hellwig authored
      Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT and
      PTRACE_KILL.
      
      Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
      it previously wasn't which is consistent with all architectures using the
      modern ptrace code.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Acked-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55436c91
    • Christoph Hellwig's avatar
      microblaze: use generic ptrace_resume code · fa1ac57a
      Christoph Hellwig authored
      Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT and
      PTRACE_KILL.  This also makes PTRACE_SINGLESTEP return -EIO while it
      previously succeeded despite not actually causing any kind of single
      stepping.
      
      Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
      it previously wasn't which is consistent with all architectures using the
      modern ptrace code.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Acked-by: default avatarMichal Simek <monstr@monstr.eu>
      Cc: John Williams <john.williams@petalogix.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa1ac57a
    • Christoph Hellwig's avatar
      m68knommu: use generic ptrace_resume code · 7a0fde8b
      Christoph Hellwig authored
      Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT,
      PTRACE_KILL and PTRACE_SINGLESTEP.  m68knommu already defines the
      nessecary user_enable_single_step and user_disable_single_step functions
      for this.
      
      Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
      it previously wasn't which is consistent with all architectures using the
      modern ptrace code.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Acked-by: default avatarGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a0fde8b
    • Christoph Hellwig's avatar
      h8300: use generic ptrace_resume code · 857fb252
      Christoph Hellwig authored
      Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT,
      PTRACE_KILL and PTRACE_SINGLESTEP.  This implies defining
      arch_has_single_step in <asm/ptrace.h> and implementing the
      user_enable_single_step and user_disable_single_step functions, which also
      causes the breakpoint information to be cleared on fork, which could be
      considered a bug fix.
      
      Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
      it previously wasn't which is consistent with all architectures using the
      modern ptrace code.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      857fb252
    • Christoph Hellwig's avatar
      avr32: use generic ptrace_resume code · 1d839317
      Christoph Hellwig authored
      Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT,
      PTRACE_KILL and PTRACE_SINGLESTEP.  This implies defining
      arch_has_single_step in <asm/ptrace.h> and implementing the
      user_enable_single_step and user_disable_single_step functions, which also
      causes the breakpoint information to be cleared on fork, which could be
      considered a bug fix.
      
      Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
      it previously wasn't which is consistent with all architectures using the
      modern ptrace code.
      
      Currently avr32 doesn't implement any code to disable single stepping when
      one of the non-syscall requests is called which seems wrong, but I've left
      it as-is for now.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Acked-by: default avatarHaavard Skinnemoen <hskinnemoen@atmel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d839317
    • Christoph Hellwig's avatar
      arm: use generic ptrace_resume code · 440e6ca7
      Christoph Hellwig authored
      Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT,
      PTRACE_KILL and PTRACE_SINGLESTEP.  This implies defining
      arch_has_single_step in <asm/ptrace.h> and implementing the
      user_enable_single_step and user_disable_single_step functions, which also
      causes the breakpoint information to be cleared on fork, which could be
      considered a bug fix.
      
      Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
      it previously wasn't and the single stepping disable only happens if the
      tracee process isn't a zombie yet, which is consistent with all
      architectures using the modern ptrace code.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      440e6ca7
    • Christoph Hellwig's avatar
      alpha: use generic ptrace_resume code · fd341abb
      Christoph Hellwig authored
      Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT,
      PTRACE_KILL and PTRACE_SINGLESTEP.  This implies defining
      arch_has_single_step in <asm/ptrace.h> and implementing the
      user_enable_single_step and user_disable_single_step functions, which also
      causes the breakpoint information to be cleared on fork, which could be
      considered a bug fix.
      
      Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
      it previously wasn't, which is consistent with all architectures using the
      modern ptrace code.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Acked-by: default avatarMatt Turner <mattst88@gmail.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Richard Henderson <rth@twiddle.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd341abb
    • Christoph Hellwig's avatar
      ptrace: move user_enable_single_step & co prototypes to linux/ptrace.h · dacbe41f
      Christoph Hellwig authored
      While in theory user_enable_single_step/user_disable_single_step/
      user_enable_blockstep could also be provided as an inline or macro there's
      no good reason to do so, and having the prototype in one places keeps code
      size and confusion down.
      
      Roland said:
      
        The original thought there was that user_enable_single_step() et al
        might well be only an instruction or three on a sane machine (as if we
        have any of those!), and since there is only one call site inlining
        would be beneficial.  But I agree that there is no strong reason to care
        about inlining it.
      
        As to the arch changes, there is only one thought I'd add to the
        record.  It was always my thinking that for an arch where
        PTRACE_SINGLESTEP does text-modifying breakpoint insertion,
        user_enable_single_step() should not be provided.  That is,
        arch_has_single_step()=>true means that there is an arch facility with
        "pure" semantics that does not have any unexpected side effects.
        Inserting a breakpoint might do very unexpected strange things in
        multi-threaded situations.  Aside from that, it is a peculiar side
        effect that user_{enable,disable}_single_step() should cause COW
        de-sharing of text pages and so forth.  For PTRACE_SINGLESTEP, all these
        peculiarities are the status quo ante for that arch, so having
        arch_ptrace() itself do those is one thing.  But for building other
        things in the future, it is nicer to have a uniform "pure" semantics
        that arch-independent code can expect.
      
        OTOH, all such arch issues are really up to the arch maintainer.  As
        of today, there is nothing but ptrace using user_enable_single_step() et
        al so it's a distinction without a practical difference.  If/when there
        are other facilities that use user_enable_single_step() and might care,
        the affected arch's can revisit the question when someone cares about
        the quality of the arch support for said new facility.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dacbe41f
    • Christoph Hellwig's avatar
      ptrace: use ptrace_request() in the remaining architectures · b3c1e01a
      Christoph Hellwig authored
      Use ptrace_request() in the three remaining architectures that didn't use it
      (m68knommu, h8300, microblaze).  This means:
      
       - ptrace_request now handles PTRACE_{PEEK,POKE}{TEXT,DATA} and PTRACE_DETATCH
         calls that were previously called directly, or in case of h8300 even open
         coded.
       - adds new support for PTRACE_SETOPTIONS/PTRACE_GETEVENTMSG/
         PTRACE_GETSIGINFO/PTRACE_SETSIGINFO
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Acked-by: default avatarGreg Ungerer <gerg@uclinux.org>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3c1e01a
    • Miao Xie's avatar
      nodemask: fix the declaration of NODEMASK_ALLOC() · 7baab93f
      Miao Xie authored
      we can't declarate two variable at the same scope by NODEMASK_ALLOC().
      
      This patch fixes it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Menage <menage@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7baab93f
    • KAMEZAWA Hiroyuki's avatar
      memcg: update maintainer list · a38374b8
      KAMEZAWA Hiroyuki authored
      Nishimura-san have been working for memcg very good.  His review and tests
      give us much improvements and account migraiton which he is now
      challenging is really important.
      
      He is a stakeholder.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a38374b8
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix oom kill behavior · 867578cb
      KAMEZAWA Hiroyuki authored
      In current page-fault code,
      
      	handle_mm_fault()
      		-> ...
      		-> mem_cgroup_charge()
      		-> map page or handle error.
      	-> check return code.
      
      If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is
      called.  But if it's caused by memcg, OOM should have been already
      invoked.
      
      Then, I added a patch: a636b327.  That
      patch records last_oom_jiffies for memcg's sub-hierarchy and prevents
      page_fault_out_of_memory from being invoked in near future.
      
      But Nishimura-san reported that check by jiffies is not enough when the
      system is terribly heavy.
      
      This patch changes memcg's oom logic as.
       * If memcg causes OOM-kill, continue to retry.
       * remove jiffies check which is used now.
       * add memcg-oom-lock which works like perzone oom lock.
       * If current is killed(as a process), bypass charge.
      
      Something more sophisticated can be added but this pactch does
      fundamental things.
      TODO:
       - add oom notifier
       - add permemcg disable-oom-kill flag and freezer at oom.
       - more chances for wake up oom waiter (when changing memory limit etc..)
      Reviewed-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Tested-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      867578cb
    • Kirill A. Shutemov's avatar
    • Kirill A. Shutemov's avatar
      memcg: update memcg_test.txt to describe memory thresholds · 1e111452
      Kirill A. Shutemov authored
      Decription of sanity check for memory thresholds.
      Signed-off-by: default avatarKirill A.  Shutemov <kirill@shutemov.name>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e111452
    • Kirill A. Shutemov's avatar
      cgroups: add simple listener of cgroup events to documentation · 1d8fd973
      Kirill A. Shutemov authored
      An example of cgroup notification API usage.
      Signed-off-by: default avatarKirill A.  Shutemov <kirill@shutemov.name>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d8fd973
    • Kirill A. Shutemov's avatar
      cgroups: remove events before destroying subsystem state objects · a0a4db54
      Kirill A. Shutemov authored
      Events should be removed after rmdir of cgroup directory, but before
      destroying subsystem state objects.  Let's take reference to cgroup
      directory dentry to do that.
      Signed-off-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a0a4db54
    • Kirill A. Shutemov's avatar
      cgroups: fix race between userspace and kernelspace · 4ab78683
      Kirill A. Shutemov authored
      Notify userspace about cgroup removing only after rmdir of cgroup
      directory to avoid race between userspace and kernelspace.
      
      eventfd are used to notify about two types of event:
       - control file-specific, like crossing memory threshold;
       - cgroup removing.
      
      To understand what really happen, userspace can check if the cgroup still
      exists.  To avoid race beetween userspace and kernelspace we have to
      notify userspace about cgroup removing only after rmdir of cgroup
      directory.
      Signed-off-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ab78683
    • KAMEZAWA Hiroyuki's avatar
      memcg: handle panic_on_oom=always case · daaf1e68
      KAMEZAWA Hiroyuki authored
      Presently, if panic_on_oom=2, the whole system panics even if the oom
      happend in some special situation (as cpuset, mempolicy....).  Then,
      panic_on_oom=2 means painc_on_oom_always.
      
      Now, memcg doesn't check panic_on_oom flag. This patch adds a check.
      
      BTW, how it's useful ?
      
      kdump+panic_on_oom=2 is the last tool to investigate what happens in
      oom-ed system.  When a task is killed, the sysytem recovers and there will
      be few hint to know what happnes.  In mission critical system, oom should
      never happen.  Then, panic_on_oom=2+kdump is useful to avoid next OOM by
      knowing precise information via snapshot.
      
      TODO:
       - For memcg, it's for isolate system's memory usage, oom-notiifer and
         freeze_at_oom (or rest_at_oom) should be implemented. Then, management
         daemon can do similar jobs (as kdump) or taking snapshot per cgroup.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Reviewed-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      daaf1e68
    • Daisuke Nishimura's avatar
      memcg: update memcg_test.txt · 1080d7a3
      Daisuke Nishimura authored
      Update memcg_test.txt to describe how to test the move-charge feature.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1080d7a3
    • KAMEZAWA Hiroyuki's avatar
      memcg : share event counter rather than duplicate · d2265e6f
      KAMEZAWA Hiroyuki authored
      Memcg has 2 eventcountes which counts "the same" event.  Just usages are
      different from each other.  This patch tries to reduce event counter.
      
      Now logic uses "only increment, no reset" counter and masks for each
      checks.  Softlimit chesk was done per 1000 evetns.  So, the similar check
      can be done by !(new_counter & 0x3ff).  Threshold check was done per 100
      events.  So, the similar check can be done by (!new_counter & 0x7f)
      
      ALL event checks are done right after EVENT percpu counter is updated.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2265e6f
    • KAMEZAWA Hiroyuki's avatar
      memcg: update threshold and softlimit at commit · 430e4863
      KAMEZAWA Hiroyuki authored
      Presently, move_task does "batched" precharge.  Because res_counter or
      css's refcnt are not-scalable jobs for memcg, try_charge_()..  tend to be
      done in batched manner if allowed.
      
      Now, softlimit and threshold check their event counter in try_charge, but
      the charge is not a per-page event.  And event counter is not updated at
      charge().  Moreover, precharge doesn't pass "page" to try_charge() and
      softlimit tree will be never updated until uncharge() causes an event."
      
      So the best place to check the event counter is commit_charge().  This is
      per-page event by its nature.  This patch move checks to there.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      430e4863
    • KAMEZAWA Hiroyuki's avatar
      memcg: use generic percpu instead of private implementation · c62b1a3b
      KAMEZAWA Hiroyuki authored
      When per-cpu counter for memcg was implemneted, dynamic percpu allocator
      was not very good.  But now, we have good one and useful macros.  This
      patch replaces memcg's private percpu counter implementation with generic
      dynamic percpu allocator.
      
      The benefits are
      	- We can remove private implementation.
      	- The counters will be NUMA-aware. (Current one is not...)
      	- This patch makes sizeof struct mem_cgroup smaller. Then,
      	  struct mem_cgroup may be fit in page size on small config.
              - About basic performance aspects, see below.
      
       [Before]
       # size mm/memcontrol.o
         text    data     bss     dec     hex filename
        24373    2528    4132   31033    7939 mm/memcontrol.o
      
       [page-fault-throuput test on 8cpu/SMP in root cgroup]
       # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8
      
       Performance counter stats for './multi-fault-fork 8' (5 runs):
      
             45878618  page-faults                ( +-   0.110% )
            602635826  cache-misses               ( +-   0.105% )
      
         61.005373262  seconds time elapsed   ( +-   0.004% )
      
       Then cache-miss/page fault = 13.14
      
       [After]
       #size mm/memcontrol.o
         text    data     bss     dec     hex filename
        23913    2528    4132   30573    776d mm/memcontrol.o
       # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8
      
       Performance counter stats for './multi-fault-fork 8' (5 runs):
      
             48179400  page-faults                ( +-   0.271% )
            588628407  cache-misses               ( +-   0.136% )
      
         61.004615021  seconds time elapsed   ( +-   0.004% )
      
        Then cache-miss/page fault = 12.22
      
       Text size is reduced.
       This performance improvement is not big and will be invisible in real world
       applications. But this result shows this patch has some good effect even
       on (small) SMP.
      
      Here is a test program I used.
      
       1. fork() processes on each cpus.
       2. do page fault repeatedly on each process.
       3. after 60secs, kill all childredn and exit.
      
      (3 is necessary for getting stable data, this is improvement from previous one.)
      
      #define _GNU_SOURCE
      #include <stdio.h>
      #include <sched.h>
      #include <sys/mman.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <signal.h>
      #include <stdlib.h>
      
      /*
       * For avoiding contention in page table lock, FAULT area is
       * sparse. If FAULT_LENGTH is too large for your cpus, decrease it.
       */
      #define FAULT_LENGTH	(2 * 1024 * 1024)
      #define PAGE_SIZE	4096
      #define MAXNUM		(128)
      
      void alarm_handler(int sig)
      {
      }
      
      void *worker(int cpu, int ppid)
      {
      	void *start, *end;
      	char *c;
      	cpu_set_t set;
      	int i;
      
      	CPU_ZERO(&set);
      	CPU_SET(cpu, &set);
      	sched_setaffinity(0, sizeof(set), &set);
      
      	start = mmap(NULL, FAULT_LENGTH, PROT_READ|PROT_WRITE,
      			MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      	if (start == MAP_FAILED) {
      		perror("mmap");
      		exit(1);
      	}
      	end = start + FAULT_LENGTH;
      
      	pause();
      	//fprintf(stderr, "run%d", cpu);
      	while (1) {
      		for (c = (char*)start; (void *)c < end; c += PAGE_SIZE)
      			*c = 0;
      		madvise(start, FAULT_LENGTH, MADV_DONTNEED);
      	}
      	return NULL;
      }
      
      int main(int argc, char *argv[])
      {
      	int num, i, ret, pid, status;
      	int pids[MAXNUM];
      
      	if (argc < 2)
      		return 0;
      
      	setpgid(0, 0);
      	signal(SIGALRM, alarm_handler);
      	num = atoi(argv[1]);
      	pid = getpid();
      
      	for (i = 0; i < num; ++i) {
      		ret = fork();
      		if (!ret) {
      			worker(i, pid);
      			exit(0);
      		}
      		pids[i] = ret;
      	}
      	sleep(1);
      	kill(-pid, SIGALRM);
      	sleep(60);
      	for (i = 0; i < num; i++)
      		kill(pids[i], SIGKILL);
      	for (i = 0; i < num; i++)
      		waitpid(pids[i], &status, 0);
      	return 0;
      }
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c62b1a3b
    • Kirill A. Shutemov's avatar
      memcg: typo in comment to mem_cgroup_print_oom_info() · 6a6135b6
      Kirill A. Shutemov authored
      s/mem_cgroup_print_mem_info/mem_cgroup_print_oom_info/
      Signed-off-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a6135b6
    • Kirill A. Shutemov's avatar
      memcg: implement memory thresholds · 2e72b634
      Kirill A. Shutemov authored
      It allows to register multiple memory and memsw thresholds and gets
      notifications when it crosses.
      
      To register a threshold application need:
      - create an eventfd;
      - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
      - write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to
        cgroup.event_control.
      
      Application will be notified through eventfd when memory usage crosses
      threshold in any direction.
      
      It's applicable for root and non-root cgroup.
      
      It uses stats to track memory usage, simmilar to soft limits. It checks
      if we need to send event to userspace on every 100 page in/out. I guess
      it's good compromise between performance and accuracy of thresholds.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [nishimura@mxp.nes.nec.co.jp: fix documentation merge issue]
      Signed-off-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Vladislav Buzov <vbuzov@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Alexander Shishkin <virtuoso@slind.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e72b634
    • Kirill A. Shutemov's avatar
      memcg: rework usage of stats by soft limit · 378ce724
      Kirill A. Shutemov authored
      Instead of incrementing counter on each page in/out and comparing it with
      constant, we set counter to constant, decrement counter on each page
      in/out and compare it with zero.  We want to make comparing as fast as
      possible.  On many RISC systems (probably not only RISC) comparing with
      zero is more effective than comparing with a constant, since not every
      constant can be immediate operand for compare instruction.
      
      Also, I've renamed MEM_CGROUP_STAT_EVENTS to MEM_CGROUP_STAT_SOFTLIMIT,
      since really it's not a generic counter.
      Signed-off-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Vladislav Buzov <vbuzov@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Alexander Shishkin <virtuoso@slind.org>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      378ce724
    • Kirill A. Shutemov's avatar
      memcg: extract mem_group_usage() from mem_cgroup_read() · 104f3928
      Kirill A. Shutemov authored
      Helper to get memory or mem+swap usage of the cgroup.
      Signed-off-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Vladislav Buzov <vbuzov@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Alexander Shishkin <virtuoso@slind.org>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      104f3928
    • Kirill A. Shutemov's avatar
      cgroup: implement eventfd-based generic API for notifications · 0dea1168
      Kirill A. Shutemov authored
      This patchset introduces eventfd-based API for notifications in cgroups
      and implements memory notifications on top of it.
      
      It uses statistics in memory controler to track memory usage.
      
      Output of time(1) on building kernel on tmpfs:
      
      Root cgroup before changes:
      	make -j2  506.37 user 60.93s system 193% cpu 4:52.77 total
      Non-root cgroup before changes:
      	make -j2  507.14 user 62.66s system 193% cpu 4:54.74 total
      Root cgroup after changes (0 thresholds):
      	make -j2  507.13 user 62.20s system 193% cpu 4:53.55 total
      Non-root cgroup after changes (0 thresholds):
      	make -j2  507.70 user 64.20s system 193% cpu 4:55.70 total
      Root cgroup after changes (1 thresholds, never crossed):
      	make -j2  506.97 user 62.20s system 193% cpu 4:53.90 total
      Non-root cgroup after changes (1 thresholds, never crossed):
      	make -j2  507.55 user 64.08s system 193% cpu 4:55.63 total
      
      This patch:
      
      Introduce the write-only file "cgroup.event_control" in every cgroup.
      
      To register new notification handler you need:
      - create an eventfd;
      - open a control file to be monitored. Callbacks register_event() and
        unregister_event() must be defined for the control file;
      - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
        Interpretation of args is defined by control file implementation;
      
      eventfd will be woken up by control file implementation or when the
      cgroup is removed.
      
      To unregister notification handler just close eventfd.
      
      If you need notification functionality for a control file you have to
      implement callbacks register_event() and unregister_event() in the
      struct cftype.
      
      [kamezawa.hiroyu@jp.fujitsu.com: Kconfig fix]
      Signed-off-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Vladislav Buzov <vbuzov@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Alexander Shishkin <virtuoso@slind.org>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0dea1168
    • Daisuke Nishimura's avatar
      memcg: improve performance in moving swap charge · 483c30b5
      Daisuke Nishimura authored
      Try to reduce overheads in moving swap charge by:
      
      - Adds a new function(__mem_cgroup_put), which takes "count" as a arg and
        decrement mem->refcnt by "count".
      - Removed res_counter_uncharge, css_put, and mem_cgroup_put from the path
        of moving swap account, and consolidate all of them into mem_cgroup_clear_mc.
        We cannot do that about mc.to->refcnt.
      
      These changes reduces the overhead from 1.35sec to 0.9sec to move charges
      of 1G anonymous memory(including 500MB swap) in my test environment.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      483c30b5
    • Daisuke Nishimura's avatar
      memcg: move charges of anonymous swap · 02491447
      Daisuke Nishimura authored
      This patch is another core part of this move-charge-at-task-migration
      feature.  It enables moving charges of anonymous swaps.
      
      To move the charge of swap, we need to exchange swap_cgroup's record.
      
      In current implementation, swap_cgroup's record is protected by:
      
        - page lock: if the entry is on swap cache.
        - swap_lock: if the entry is not on swap cache.
      
      This works well in usual swap-in/out activity.
      
      But this behavior make the feature of moving swap charge check many
      conditions to exchange swap_cgroup's record safely.
      
      So I changed modification of swap_cgroup's recored(swap_cgroup_record())
      to use xchg, and define a new function to cmpxchg swap_cgroup's record.
      
      This patch also enables moving charge of non pte_present but not uncharged
      swap caches, which can be exist on swap-out path, by getting the target
      pages via find_get_page() as do_mincore() does.
      
      [kosaki.motohiro@jp.fujitsu.com: fix ia64 build]
      [akpm@linux-foundation.org: fix typos]
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      02491447
    • Daisuke Nishimura's avatar
      memcg: avoid oom during moving charge · 8033b97c
      Daisuke Nishimura authored
      This move-charge-at-task-migration feature has extra charges on
      "to"(pre-charges) and "from"(left-over charges) during moving charge.
      This means unnecessary oom can happen.
      
      This patch tries to avoid such oom.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8033b97c
    • Daisuke Nishimura's avatar
      memcg: improve performance in moving charge · 854ffa8d
      Daisuke Nishimura authored
      Try to reduce overheads in moving charge by:
      
      - Instead of calling res_counter_uncharge() against the old cgroup in
        __mem_cgroup_move_account() everytime, call res_counter_uncharge() at the end
        of task migration once.
      - removed css_get(&to->css) from __mem_cgroup_move_account() because callers
        should have already called css_get(). And removed css_put(&to->css) too,
        which was called by callers of move_account on success of move_account.
      - Instead of calling __mem_cgroup_try_charge(), i.e. res_counter_charge(),
        repeatedly, call res_counter_charge(PAGE_SIZE * count) in can_attach() if
        possible.
      - Instead of calling css_get()/css_put() repeatedly, make use of coalesce
        __css_get()/__css_put() if possible.
      
      These changes reduces the overhead from 1.7sec to 0.6sec to move charges
      of 1G anonymous memory in my test environment.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      854ffa8d
    • Daisuke Nishimura's avatar
      memcg: move charges of anonymous page · 4ffef5fe
      Daisuke Nishimura authored
      This patch is the core part of this move-charge-at-task-migration feature.
       It implements functions to move charges of anonymous pages mapped only by
      the target task.
      
      Implementation:
      - define struct move_charge_struct and a valuable of it(mc) to remember the
        count of pre-charges and other information.
      - At can_attach(), get anon_rss of the target mm, call __mem_cgroup_try_charge()
        repeatedly and count up mc.precharge.
      - At attach(), parse the page table, find a target page to be move, and call
        mem_cgroup_move_account() about the page.
      - Cancel all precharges if mc.precharge > 0 on failure or at the end of
        task move.
      
      [akpm@linux-foundation.org: a little simplification]
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ffef5fe
    • Daisuke Nishimura's avatar
      memcg: add interface to move charge at task migration · 7dc74be0
      Daisuke Nishimura authored
      In current memcg, charges associated with a task aren't moved to the new
      cgroup at task migration.  Some users feel this behavior to be strange.
      These patches are for this feature, that is, for charging to the new
      cgroup and, of course, uncharging from the old cgroup at task migration.
      
      This patch adds "memory.move_charge_at_immigrate" file, which is a flag
      file to determine whether charges should be moved to the new cgroup at
      task migration or not and what type of charges should be moved.  This
      patch also adds read and write handlers of the file.
      
      This patch also adds no-op handlers for this feature.  These handlers will
      be implemented in later patches.  And you cannot write any values other
      than 0 to move_charge_at_immigrate yet.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7dc74be0
    • Li Zefan's avatar
      cgroups: clean up cgroup_pidlist_find() a bit · b70cc5fd
      Li Zefan authored
      Don't call get_pid_ns() before we locate/alloc the ns.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Acked-by: default avatarPaul Menage <menage@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b70cc5fd
    • Ben Blum's avatar
      cgroups: blkio subsystem as module · 67523c48
      Ben Blum authored
      Modify the Block I/O cgroup subsystem to be able to be built as a module.
      As the CFQ disk scheduler optionally depends on blk-cgroup, config options
      in block/Kconfig, block/Kconfig.iosched, and block/blk-cgroup.h are
      enhanced to support the new module dependency.
      Signed-off-by: default avatarBen Blum <bblum@andrew.cmu.edu>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67523c48
    • Kirill A. Shutemov's avatar
      cgroups: fix CONTENTS in cgroups documentation · 8ca712ea
      Kirill A. Shutemov authored
      Add a forgotten item into CONTENTS.
      Signed-off-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: default avatarPaul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8ca712ea
    • Ben Blum's avatar
      cgroups: subsystem module unloading · cf5d5941
      Ben Blum authored
      Provides support for unloading modular subsystems.
      
      This patch adds a new function cgroup_unload_subsys which is to be used
      for removing a loaded subsystem during module deletion.  Reference
      counting of the subsystems' modules is moved from once (at load time) to
      once per attached hierarchy (in parse_cgroupfs_options and
      rebind_subsystems) (i.e., 0 or 1).
      Signed-off-by: default avatarBen Blum <bblum@andrew.cmu.edu>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf5d5941
    • Ben Blum's avatar
      cgroups: subsystem module loading interface · e6a1105b
      Ben Blum authored
      Add interface between cgroups subsystem management and module loading
      
      This patch implements rudimentary module-loading support for cgroups -
      namely, a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a
      module initcall, and a struct module pointer in struct cgroup_subsys.
      
      Several functions that might be wanted by modules have had EXPORT_SYMBOL
      added to them, but it's unclear exactly which functions want it and which
      won't.
      Signed-off-by: default avatarBen Blum <bblum@andrew.cmu.edu>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6a1105b
    • Ben Blum's avatar
      cgroups: revamp subsys array · aae8aab4
      Ben Blum authored
      This patch series provides the ability for cgroup subsystems to be
      compiled as modules both within and outside the kernel tree.  This is
      mainly useful for classifiers and subsystems that hook into components
      that are already modules.  cls_cgroup and blkio-cgroup serve as the
      example use cases for this feature.
      
      It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
      which modular subsystems can use to register and depart during runtime.
      The net_cls classifier subsystem serves as the example for a subsystem
      which can be converted into a module using these changes.
      
      Patch #1 sets up the subsys[] array so its contents can be dynamic as
      modules appear and (eventually) disappear.  Iterations over the array are
      modified to handle when subsystems are absent, and the dynamic section of
      the array is protected by cgroup_mutex.
      
      Patch #2 implements an interface for modules to load subsystems, called
      cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
      pointer in struct cgroup_subsys.
      
      Patch #3 adds a mechanism for unloading modular subsystems, which includes
      a more advanced rework of the rudimentary reference counting introduced in
      patch 2.
      
      Patch #4 modifies the net_cls subsystem, which already had some module
      declarations, to be configurable as a module, which also serves as a
      simple proof-of-concept.
      
      Part of implementing patches 2 and 4 involved updating css pointers in
      each css_set when the module appears or leaves.  In doing this, it was
      discovered that css_sets always remain linked to the dummy cgroup,
      regardless of whether or not any subsystems are actually bound to it
      (i.e., not mounted on an actual hierarchy).  The subsystem loading and
      unloading code therefore should keep in mind the special cases where the
      added subsystem is the only one in the dummy cgroup (and therefore all
      css_sets need to be linked back into it) and where the removed subsys was
      the only one in the dummy cgroup (and therefore all css_sets should be
      unlinked from it) - however, as all css_sets always stay attached to the
      dummy cgroup anyway, these cases are ignored.  Any fix that addresses this
      issue should also make sure these cases are addressed in the subsystem
      loading and unloading code.
      
      This patch:
      
      Make subsys[] able to be dynamically populated to support modular
      subsystems
      
      This patch reworks the way the subsys[] array is used so that subsystems
      can register themselves after boot time, and enables the internals of
      cgroups to be able to handle when subsystems are not present or may
      appear/disappear.
      Signed-off-by: default avatarBen Blum <bblum@andrew.cmu.edu>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aae8aab4