1. 03 Apr, 2009 40 commits
    • Oleg Nesterov's avatar
      reparent_thread: don't call kill_orphaned_pgrp() if task_detached() · 0a967a04
      Oleg Nesterov authored
      If task_detached(p) == T, then either
      
        a) p is not the main thread, we will find the group leader on the
           ->children list.
      
      or
      
        b) p is the group leader but its ->exit_state = EXIT_DEAD.  This
           can only happen when the last sub-thread has died, but in that case
           that thread has already called kill_orphaned_pgrp() from
           exit_notify().
      
      In both cases kill_orphaned_pgrp() looks bogus.
      
      Move the task_detached() check up and simplify the code, this is also
      right from the "common sense" pov: we should do nothing with the detached
      childs, except move them to the new parent's ->children list.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a967a04
    • Oleg Nesterov's avatar
      ptrace: fix possible zombie leak on PTRACE_DETACH · 4576145c
      Oleg Nesterov authored
      When ptrace_detach() takes tasklist, the tracee can be SIGKILL'ed.  If it
      has already passed exit_notify() we can leak a zombie, because a) ptracing
      disables the auto-reaping logic, and b) ->real_parent was not notified
      about the child's death.
      
      ptrace_detach() should follow the ptrace_exit's logic, change the code
      accordingly.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Tested-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4576145c
    • Oleg Nesterov's avatar
      ptrace: reintroduce __ptrace_detach() as a callee of ptrace_exit() · b1b4c679
      Oleg Nesterov authored
      No functional changes, preparation for the next patch.
      
      Move the "should we release this child" logic into the separate handler,
      __ptrace_detach().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1b4c679
    • Oleg Nesterov's avatar
      ptrace: simplify ptrace_exit()->ignoring_children() path · 6d69cb87
      Oleg Nesterov authored
      ignoring_children() takes parent->sighand->siglock and checks
      k_sigaction[SIGCHLD] atomically.  But this buys nothing, we can't get the
      "really" wrong result even if we race with sigaction(SIGCHLD).  If we read
      the "stale" sa_handler/sa_flags we can pretend it was changed right after
      the check.
      
      Remove spin_lock(->siglock), and kill "int ign" which caches the result of
      ignoring_children() which becomes rather trivial.
      
      Perhaps it makes sense to export this helper, do_notify_parent() can use
      it too.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d69cb87
    • Oleg Nesterov's avatar
      ptrace: kill __ptrace_detach(), fix ->exit_state check · 95c3eb76
      Oleg Nesterov authored
      Move the code from __ptrace_detach() to its single caller and kill this
      helper.
      
      Also, fix the ->exit_state check, we shouldn't wake up EXIT_DEAD tasks.
      Actually, I think task_is_stopped_or_traced() makes more sense, but this
      needs another patch.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95c3eb76
    • Sukadev Bhattiprolu's avatar
      signals: SI_USER: Masquerade si_pid when crossing pid ns boundary · 6588c1e3
      Sukadev Bhattiprolu authored
      When sending a signal to a descendant namespace, set ->si_pid to 0 since
      the sender does not have a pid in the receiver's namespace.
      
      Note:
      	- If rt_sigqueueinfo() sets si_code to SI_USER when sending a
      	  signal across a pid namespace boundary, the value in ->si_pid
      	  will be cleared to 0.
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Daniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6588c1e3
    • Sukadev Bhattiprolu's avatar
      signals: protect cinit from blocked fatal signals · b3bfa0cb
      Sukadev Bhattiprolu authored
      Normally SIG_DFL signals to global and container-init are dropped early.
      But if a signal is blocked when it is posted, we cannot drop the signal
      since the receiver may install a handler before unblocking the signal.
      Once this signal is queued however, the receiver container-init has no way
      of knowing if the signal was sent from an ancestor or descendant
      namespace.  This patch ensures that contianer-init drops all SIG_DFL
      signals in get_signal_to_deliver() except SIGKILL/SIGSTOP.
      
      If SIGSTOP/SIGKILL originate from a descendant of container-init they are
      never queued (i.e dropped in sig_ignored() in an earler patch).
      
      If SIGSTOP/SIGKILL originate from parent namespace, the signal is queued
      and container-init processes the signal.
      
      IOW, if get_signal_to_deliver() sees a sig_kernel_only() signal for global
      or container-init, the signal must have been generated internally or must
      have come from an ancestor ns and we process the signal.
      
      Further, the signal_group_exit() check was needed to cover the case of a
      multi-threaded init sending SIGKILL to other threads when doing an exit()
      or exec().  But since the new sig_kernel_only() check covers the SIGKILL,
      the signal_group_exit() check is no longer needed and can be removed.
      
      Finally, now that we have all pieces in place, set SIGNAL_UNKILLABLE for
      container-inits.
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Daniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3bfa0cb
    • Sukadev Bhattiprolu's avatar
      signals: zap_pid_ns_process() should use force_sig() · e4da026f
      Sukadev Bhattiprolu authored
      send_signal() assumes that signals with SEND_SIG_PRIV are generated from
      within the same namespace.  So any nested container-init processes become
      immune to the SIGKILL generated by kill_proc_info() in
      zap_pid_ns_processes().
      
      Use force_sig() in zap_pid_ns_processes() instead - force_sig() clears the
      SIGNAL_UNKILLABLE flag ensuring the signal is processed by
      container-inits.
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Daniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4da026f
    • Sukadev Bhattiprolu's avatar
      signals: protect cinit from unblocked SIG_DFL signals · 921cf9f6
      Sukadev Bhattiprolu authored
      Drop early any SIG_DFL or SIG_IGN signals to container-init from within
      the same container.  But queue SIGSTOP and SIGKILL to the container-init
      if they are from an ancestor container.
      
      Blocked, fatal signals (i.e when SIG_DFL is to terminate) from within the
      container can still terminate the container-init.  That will be addressed
      in the next patch.
      
      Note:	To be bisect-safe, SIGNAL_UNKILLABLE will be set for container-inits
         	in a follow-on patch. Until then, this patch is just a preparatory
      	step.
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Daniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      921cf9f6
    • Sukadev Bhattiprolu's avatar
      signals: add from_ancestor_ns parameter to send_signal() · 7978b567
      Sukadev Bhattiprolu authored
      send_signal() (or its helper) needs to determine the pid namespace of the
      sender.  But a signal sent via kill_pid_info_as_uid() comes from within
      the kernel and send_signal() does not need to determine the pid namespace
      of the sender.  So define a helper for send_signal() which takes an
      additional parameter, 'from_ancestor_ns' and have kill_pid_info_as_uid()
      use that helper directly.
      
      The 'from_ancestor_ns' parameter will be used in a follow-on patch.
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Daniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7978b567
    • Oleg Nesterov's avatar
      signals: protect init from unwanted signals more · f008faff
      Oleg Nesterov authored
      (This is a modified version of the patch submitted by Oleg Nesterov
      http://lkml.org/lkml/2008/11/18/249 and tries to address comments that
      came up in that discussion)
      
      init ignores the SIG_DFL signals but we queue them anyway, including
      SIGKILL.  This is mostly OK, the signal will be dropped silently when
      dequeued, but the pending SIGKILL has 2 bad implications:
      
              - it implies fatal_signal_pending(), so we confuse things
                like wait_for_completion_killable/lock_page_killable.
      
              - for the sub-namespace inits, the pending SIGKILL can
                mask (legacy_queue) the subsequent SIGKILL from the
                parent namespace which must kill cinit reliably.
                (preparation, cinits don't have SIGNAL_UNKILLABLE yet)
      
      The patch can't help when init is ptraced, but ptracing of init is not
      "safe" anyway.
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Daniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f008faff
    • Oleg Nesterov's avatar
      signals: remove 'handler' parameter to tracehook functions · 43918f2b
      Oleg Nesterov authored
      Container-init must behave like global-init to processes within the
      container and hence it must be immune to unhandled fatal signals from
      within the container (i.e SIG_DFL signals that terminate the process).
      
      But the same container-init must behave like a normal process to processes
      in ancestor namespaces and so if it receives the same fatal signal from a
      process in ancestor namespace, the signal must be processed.
      
      Implementing these semantics requires that send_signal() determine pid
      namespace of the sender but since signals can originate from workqueues/
      interrupt-handlers, determining pid namespace of sender may not always be
      possible or safe.
      
      This patchset implements the design/simplified semantics suggested by
      Oleg Nesterov.  The simplified semantics for container-init are:
      
      	- container-init must never be terminated by a signal from a
      	  descendant process.
      
      	- container-init must never be immune to SIGKILL from an ancestor
      	  namespace (so a process in parent namespace must always be able
      	  to terminate a descendant container).
      
      	- container-init may be immune to unhandled fatal signals (like
      	  SIGUSR1) even if they are from ancestor namespace. SIGKILL/SIGSTOP
      	  are the only reliable signals to a container-init from ancestor
      	  namespace.
      
      This patch:
      
      Based on an earlier patch submitted by Oleg Nesterov and comments from
      Roland McGrath (http://lkml.org/lkml/2008/11/19/258).
      
      The handler parameter is currently unused in the tracehook functions.
      Besides, the tracehook functions are called with siglock held, so the
      functions can check the handler if they later need to.
      
      Removing the parameter simiplifies changes to sig_ignored() in a follow-on
      patch.
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Daniel Lezcano <daniel.lezcano@free.fr>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43918f2b
    • Oleg Nesterov's avatar
      do_wait: fix waiting for the group stop with the dead leader · 90bc8d8b
      Oleg Nesterov authored
      do_wait(WSTOPPED) assumes that p->state must be == TASK_STOPPED, this is
      not true if the leader is already dead.  Check SIGNAL_STOP_STOPPED instead
      and use signal->group_exit_code.
      
      Trivial test-case:
      
      	void *tfunc(void *arg)
      	{
      		pause();
      		return NULL;
      	}
      
      	int main(void)
      	{
      		pthread_t thr;
      		pthread_create(&thr, NULL, tfunc, NULL);
      		pthread_exit(NULL);
      		return 0;
      	}
      
      It doesn't react to ^Z (and then to ^C or ^\). The task is stopped, but
      bash can't see this.
      
      The bug is very old, and it was reported multiple times. This patch was sent
      more than a year ago (http://marc.info/?t=119713920000003) but it was ignored.
      
      This change also fixes other oddities (but not all) in this area.  For
      example, before this patch:
      
      	$ sleep 100
      	^Z
      	[1]+  Stopped                 sleep 100
      	$ strace -p `pidof sleep`
      	Process 11442 attached - interrupt to quit
      
      strace hangs in do_wait(), because ->exit_code was already consumed by
      bash.  After this patch, strace happily proceeds:
      
      	--- SIGTSTP (Stopped) @ 0 (0) ---
      	restart_syscall(<... resuming interrupted call ...>
      
      To me, this looks much more "natural" and correct.
      
      Another example.  Let's suppose we have the main thread M and sub-thread
      T, the process is stopped, and its parent did wait(WSTOPPED).  Now we can
      ptrace T but not M.  This looks at least strange to me.
      
      Imho, do_wait() should not confuse the per-thread ptrace stops with the
      per-process job control stops.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Kaz Kylheku <kkylheku@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90bc8d8b
    • David Rientjes's avatar
      cpusets: prevent PF_THREAD_BOUND tasks from attaching to non-root cpusets · 6d7b2f5f
      David Rientjes authored
      Kthreads that have the PF_THREAD_BOUND bit set in their flags are bound to a
      specific cpu.  Thus, their set of allowed cpus shall not change.
      
      This patch prevents such threads from attaching to non-root cpusets.  They do
      not have mempolicies that restrict them to a subset of system nodes and, since
      their cpumask may never change, they cannot use any of the features of
      cpusets.
      
      The tasks will forever be a member of the root cpuset and will be returned
      when listing the tasks attached to that cpuset.
      
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d7b2f5f
    • Paul Menage's avatar
      cpusets: allow cpusets to be configured/built on non-SMP systems · db7f47cf
      Paul Menage authored
      Allow cpusets to be configured/built on non-SMP systems
      
      Currently it's impossible to build cpusets under UML on x86-64, since
      cpusets depends on SMP and x86-64 UML doesn't support SMP.
      
      There's code in cpusets that doesn't depend on SMP.  This patch surrounds
      the minimum amount of cpusets code with #ifdef CONFIG_SMP in order to
      allow cpusets to build/run on UP systems (for testing purposes under UML).
      Reviewed-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db7f47cf
    • David Rientjes's avatar
      cpusets: replace zone allowed functions with node allowed · a1bc5a4e
      David Rientjes authored
      The cpuset_zone_allowed() variants are actually only a function of the
      zone's node.
      
      Cc: Paul Menage <menage@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1bc5a4e
    • Li Zefan's avatar
      cpuset: remove struct cpuset_hotplug_scanner · 7f81b1ae
      Li Zefan authored
      Use cgroup_scanner.data, instead of introducing cpuset_hotplug_scanner.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f81b1ae
    • Li Zefan's avatar
      cpuset: avoid changing cpuset's mems when errno returned · 010cfac4
      Li Zefan authored
      When writing to cpuset.mems, cpuset has to update its mems_allowed before
      calling update_tasks_nodemask(), but this function might return -ENOMEM.
      
      To avoid this rare case, we allocate the memory before changing
      mems_allowed, and then pass to update_tasks_nodemask().  Similar to what
      update_cpumask() does.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      010cfac4
    • Li Zefan's avatar
      cpuset: rewrite update_tasks_nodemask() · 3b6766fe
      Li Zefan authored
      This patch uses cgroup_scan_tasks() to rebind tasks' vmas to new cpuset's
      mems_allowed.
      
      Not only simplify the code largely, but also avoid allocating an array to
      hold mm pointers of all the tasks in the cpuset.  This array can be big
      (size > PAGESIZE) if we have lots of tasks in that cpuset, thus has a
      chance to fail the allocation when under memory stress.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b6766fe
    • Li Zefan's avatar
      cgroups: add 'data' field to struct cgroup_scanner · bd1a8ab7
      Li Zefan authored
      We need to pass some data to test_task() or process_task() in some cases.
      Will be used later.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd1a8ab7
    • Li Zefan's avatar
      cpuset: fix possible races in cpu/memory hotplug · 0b4217b3
      Li Zefan authored
      Change to cpuset->cpus_allowed and cpuset->mems_allowed should be protected
      by callback_mutex, otherwise the reader may read wrong cpus/mems. This is
      cpuset's lock rule.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b4217b3
    • Daisuke Nishimura's avatar
      memcg: cleanup cache_charge · 83aae4c7
      Daisuke Nishimura authored
      Current mem_cgroup_cache_charge is a bit complicated especially
      in the case of shmem's swap-in.
      
      This patch cleans it up by using try_charge_swapin and commit_charge_swapin.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83aae4c7
    • KAMEZAWA Hiroyuki's avatar
      memcg: remove redundant message at swapon · 627991a2
      KAMEZAWA Hiroyuki authored
      It's pointed out that swap_cgroup's message at swapon() is nonsense.
      Because
      
        * It can be calculated very easily if all necessary information is
          written in Kconfig.
      
        * It's not necessary to annoying people at every swapon().
      
      In other view, now, memory usage per swp_entry is reduced to 2bytes from
      8bytes(64bit) and I think it's reasonably small.
      Reported-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      627991a2
    • KAMEZAWA Hiroyuki's avatar
      cgroups: use css id in swap cgroup for saving memory v5 · a3b2d692
      KAMEZAWA Hiroyuki authored
      Try to use CSS ID for records in swap_cgroup.  By this, on 64bit machine,
      size of swap_cgroup goes down to 2 bytes from 8bytes.
      
      This means, when 2GB of swap is equipped, (assume the page size is 4096bytes)
      
      	From size of swap_cgroup = 2G/4k * 8 = 4Mbytes.
      	To   size of swap_cgroup = 2G/4k * 2 = 1Mbytes.
      
      Reduction is large.  Of course, there are trade-offs.  This CSS ID will
      add overhead to swap-in/swap-out/swap-free.
      
      But in general,
        - swap is a resource which the user tend to avoid use.
        - If swap is never used, swap_cgroup area is not used.
        - Reading traditional manuals, size of swap should be proportional to
          size of memory. Memory size of machine is increasing now.
      
      I think reducing size of swap_cgroup makes sense.
      
      Note:
        - ID->CSS lookup routine has no locks, it's under RCU-Read-Side.
        - memcg can be obsolete at rmdir() but not freed while refcnt from
          swap_cgroup is available.
      
      Changelog v4->v5:
       - reworked on to memcg-charge-swapcache-to-proper-memcg.patch
      Changlog ->v4:
       - fixed not configured case.
       - deleted unnecessary comments.
       - fixed NULL pointer bug.
       - fixed message in dmesg.
      
      [nishimura@mxp.nes.nec.co.jp: css_tryget can be called twice in !PageCgroupUsed case]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3b2d692
    • Daisuke Nishimura's avatar
      memcg: charge swapcache to proper memcg · 3c776e64
      Daisuke Nishimura authored
      memcg_test.txt says at 4.1:
      
      	This swap-in is one of the most complicated work. In do_swap_page(),
      	following events occur when pte is unchanged.
      
      	(1) the page (SwapCache) is looked up.
      	(2) lock_page()
      	(3) try_charge_swapin()
      	(4) reuse_swap_page() (may call delete_swap_cache())
      	(5) commit_charge_swapin()
      	(6) swap_free().
      
      	Considering following situation for example.
      
      	(A) The page has not been charged before (2) and reuse_swap_page()
      	    doesn't call delete_from_swap_cache().
      	(B) The page has not been charged before (2) and reuse_swap_page()
      	    calls delete_from_swap_cache().
      	(C) The page has been charged before (2) and reuse_swap_page() doesn't
      	    call delete_from_swap_cache().
      	(D) The page has been charged before (2) and reuse_swap_page() calls
      	    delete_from_swap_cache().
      
      	    memory.usage/memsw.usage changes to this page/swp_entry will be
      	 Case          (A)      (B)       (C)     (D)
               Event
             Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
                ===========================================
                (3)        +1/+1    +1/+1     +1/+1   +1/+1
                (4)          -       0/ 0       -     -1/ 0
                (5)         0/-1     0/ 0     -1/-1    0/ 0
                (6)          -       0/-1       -      0/-1
                ===========================================
             Result         1/ 1     1/ 1      1/ 1    1/ 1
      
             In any cases, charges to this page should be 1/ 1.
      
      In case of (D), mem_cgroup_try_get_from_swapcache() returns NULL
      (because lookup_swap_cgroup() returns NULL), so "+1/+1" at (3) means
      charges to the memcg("foo") to which the "current" belongs.
      OTOH, "-1/0" at (4) and "0/-1" at (6) means uncharges from the memcg("baa")
      to which the page has been charged.
      
      So, if the "foo" and "baa" is different(for example because of task move),
      this charge will be moved from "baa" to "foo".
      
      I think this is an unexpected behavior.
      
      This patch fixes this by modifying mem_cgroup_try_get_from_swapcache()
      to return the memcg to which the swapcache has been charged if PCG_USED bit
      is set.
      IIUC, checking PCG_USED bit of swapcache is safe under page lock.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c776e64
    • KOSAKI Motohiro's avatar
      memcg: remove mem_cgroup_reclaim_imbalance() remnants · 3918b96e
      KOSAKI Motohiro authored
      commit 4f98a2fe (vmscan: split LRU lists
      into anon & file sets) removed mem_cgroup_reclaim_imbalance(), but there
      are some leftovers in memcontrol.h.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3918b96e
    • KOSAKI Motohiro's avatar
      memcg: remove mem_cgroup_calc_mapped_ratio() · c137b5ec
      KOSAKI Motohiro authored
      Currently, mem_cgroup_calc_mapped_ratio() is unused at all.  it can be
      removed and KAMEZAWA-san suggested it.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c137b5ec
    • Balbir Singh's avatar
      memcg: show memcg information during OOM · e222432b
      Balbir Singh authored
      Add RSS and swap to OOM output from memcg
      
      Display memcg values like failcnt, usage and limit when an OOM occurs due
      to memcg.
      
      Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki,
      Daisuke Nishimura and KOSAKI Motohiro for review.
      
      Sample output
      -------------
      
      Task in /a/x killed as a result of limit of /a
      memory: usage 1048576kB, limit 1048576kB, failcnt 4183
      memory+swap: usage 1400964kB, limit 9007199254740991kB, failcnt 0
      
      [akpm@linux-foundation.org: compilation fix]
      [akpm@linux-foundation.org: fix kerneldoc and whitespace]
      [akpm@linux-foundation.org: add printk facility level]
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e222432b
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix OOM killer under memcg · 0b7f569e
      KAMEZAWA Hiroyuki authored
      This patch tries to fix OOM Killer problems caused by hierarchy.
      Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
      kill a task in memcg.
      
      But, when hierarchy is used, it's broken and correct task cannot
      be killed. For example, in following cgroup
      
      	/groupA/	hierarchy=1, limit=1G,
      		01	nolimit
      		02	nolimit
      All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
      groupA's 1Gbytes but OOM Killer just kills tasks in groupA.
      
      This patch provides makes the bad process be selected from all tasks
      under hierarchy. BTW, currently, oom_jiffies is updated against groupA
      in above case. oom_jiffies of tree should be updated.
      
      To see how oom_jiffies is used, please check mem_cgroup_oom_called()
      callers.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: const fix]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b7f569e
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix shrinking memory to return -EBUSY by fixing retry algorithm · 81d39c20
      KAMEZAWA Hiroyuki authored
      As pointed out, shrinking memcg's limit should return -EBUSY after
      reasonable retries.  This patch tries to fix the current behavior of
      shrink_usage.
      
      Before looking into "shrink should return -EBUSY" problem, we should fix
      hierarchical reclaim code.  It compares current usage and current limit,
      but it only makes sense when the kernel reclaims memory because hit
      limits.  This is also a problem.
      
      What this patch does are.
      
        1. add new argument "shrink" to hierarchical reclaim. If "shrink==true",
           hierarchical reclaim returns immediately and the caller checks the kernel
           should shrink more or not.
           (At shrinking memory, usage is always smaller than limit. So check for
            usage < limit is useless.)
      
        2. For adjusting to above change, 2 changes in "shrink"'s retry path.
           2-a. retry_count depends on # of children because the kernel visits
      	  the children under hierarchy one by one.
           2-b. rather than checking return value of hierarchical_reclaim's progress,
      	  compares usage-before-shrink and usage-after-shrink.
      	  If usage-before-shrink <= usage-after-shrink, retry_count is
      	  decremented.
      Reported-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81d39c20
    • KAMEZAWA Hiroyuki's avatar
      memcg: hierarchical stat · 14067bb3
      KAMEZAWA Hiroyuki authored
      Clean up memory.stat file routine and show "total" hierarchical stat.
      
      This patch does
        - renamed get_all_zonestat to be get_local_zonestat.
        - remove old mem_cgroup_stat_desc, which is only for per-cpu stat.
        - add mcs_stat to cover both of per-cpu/per-lru stat.
        - add "total" stat of hierarchy (*)
        - add a callback system to scan all memcg under a root.
      == "total" is added.
      [kamezawa@localhost ~]$ cat /opt/cgroup/xxx/memory.stat
      cache 0
      rss 0
      pgpgin 0
      pgpgout 0
      inactive_anon 0
      active_anon 0
      inactive_file 0
      active_file 0
      unevictable 0
      hierarchical_memory_limit 50331648
      hierarchical_memsw_limit 9223372036854775807
      total_cache 65536
      total_rss 192512
      total_pgpgin 218
      total_pgpgout 155
      total_inactive_anon 0
      total_active_anon 135168
      total_inactive_file 61440
      total_active_file 4096
      total_unevictable 0
      ==
      (*) maybe the user can do calc hierarchical stat by his own program
         in userland but if it can be written in clean way, it's worth to be
         shown, I think.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14067bb3
    • KAMEZAWA Hiroyuki's avatar
      memcg: use CSS ID · 04046e1a
      KAMEZAWA Hiroyuki authored
      Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy.
      
      	Assume folloing tree.
      
      	group_A (ID=3)
      		/01 (ID=4)
      		   /0A (ID=7)
      		/02 (ID=10)
      	group_B (ID=5)
      	and task in group_A/01/0A hits limit at group_A.
      
      	reclaim will be done in following order (round-robin).
      	group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
      	-> group_A -> .....
      
      	Round robin by ID. The last visited cgroup is recorded and restart
      	from it when it start reclaim again.
      	(More smart algorithm can be implemented..)
      
      	No cgroup_mutex or hierarchy_mutex is required.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04046e1a
    • Li Zefan's avatar
      devcgroup: avoid using cgroup_lock · b4046f00
      Li Zefan authored
      There is nothing special that has to be protected by cgroup_lock,
      so introduce devcgroup_mtuex for it's own use.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4046f00
    • Li Zefan's avatar
      debug cgroup: remove unneeded cgroup_lock · d969fbe6
      Li Zefan authored
      Since we are in cgroup write handler, so the cgrp is valid, so we don't
      have to hold cgroup_mutex when calling cgroup_task_count().  One similar
      example is in cgroup_tasks_open().
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: default avatarPaul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d969fbe6
    • Li Zefan's avatar
      cgroups: don't change release_agent when remount failed · 0670e08b
      Li Zefan authored
      Remount can fail in either case:
        - wrong mount options is specified, or option 'noprefix' is changed.
        - a to-be-added subsys is already mounted/active.
      
      When using remount to change 'release_agent', for the above former failure
      case, remount will return errno with release_agent unchanged, but for the
      latter case, remount will return EBUSY with relase_agent changed, which is
      unexpected I think:
      
       # mount -t cgroup -o cpu xxx /cgrp1
       # mount -t cgroup -o cpuset,release_agent=agent1 yyy /cgrp2
       # cat /cgrp2/release_agent
       agent1
       # mount -t cgroup -o remount,cpuset,noprefix,release_agent=agent2 yyy /cgrp2
       mount: /cgrp2 not mounted already, or bad option
       # cat /cgrp2/release_agent
       agent1     <-- ok
       # mount -t cgroup -o remount,cpu,cpuset,release_agent=agent2 yyy /cgrp2
       mount: /cgrp2 is busy
       # cat /cgrp2/release_agent
       agent2     <-- unexpected!
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0670e08b
    • Li Zefan's avatar
      cgroups: show correct file mode · 099fca32
      Li Zefan authored
      We have some read-only files and write-only files, but currently they are
      all set to 0644, which is counter-intuitive and cause trouble for some
      cgroup tools like libcgroup.
      
      This patch adds 'mode' to struct cftype to allow cgroup subsys to set it's
      own files' file mode, and for the most cases cft->mode can be default to 0
      and cgroup will figure out proper mode.
      Acked-by: default avatarPaul Menage <menage@google.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      099fca32
    • Li Zefan's avatar
      cgroups: more documentation for remount and release_agent · b6719ec1
      Li Zefan authored
      This won't remove cpuacct from the mounted hierachy:
       # mount -t cgroup -o cpu,cpuacct xxx /mnt
       # mount -o remount,cpu /mnt
      
      Because for this usage mount(8) will append the new options to the original
      options.
      
      And this will get you right:
       # mount [-t cgroup] -o remount,cpu xxx /mnt
      
      Also document how to specify or change release_agent.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Reviewd-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6719ec1
    • Jesper Juhl's avatar
      kernel/cgroup.c: kfree(NULL) is legal · 66bdc9cf
      Jesper Juhl authored
      Reduces object file size a bit:
      
      Before:
      $ size kernel/cgroup.o
         text    data     bss     dec     hex filename
        21593    7804    4924   34321    8611 kernel/cgroup.o
      After:
      $ size kernel/cgroup.o
         text    data     bss     dec     hex filename
        21537    7744    4924   34205    859d kernel/cgroup.o
      Signed-off-by: default avatarJesper Juhl <jj@chaosbits.net>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66bdc9cf
    • KAMEZAWA Hiroyuki's avatar
      cgroup: fix frequent -EBUSY at rmdir · ec64f515
      KAMEZAWA Hiroyuki authored
      In following situation, with memory subsystem,
      
      	/groupA use_hierarchy==1
      		/01 some tasks
      		/02 some tasks
      		/03 some tasks
      		/04 empty
      
      When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
      is triggered and the kernel walks tree under groupA. In this case,
      rmdir /groupA/04 fails with -EBUSY frequently because of temporal
      refcnt from the kernel.
      
      In general. cgroup can be rmdir'd if there are no children groups and
      no tasks. Frequent fails of rmdir() is not useful to users.
      (And the reason for -EBUSY is unknown to users.....in most cases)
      
      This patch tries to modify above behavior, by
      	- retries if css_refcnt is got by someone.
      	- add "return value" to pre_destroy() and allows subsystem to
      	  say "we're really busy!"
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec64f515
    • KAMEZAWA Hiroyuki's avatar
      cgroup: CSS ID support · 38460b48
      KAMEZAWA Hiroyuki authored
      Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.
      
      This patch attaches unique ID to each css and provides following.
      
       - css_lookup(subsys, id)
         returns pointer to struct cgroup_subysys_state of id.
       - css_get_next(subsys, id, rootid, depth, foundid)
         returns the next css under "root" by scanning
      
      When cgroup_subsys->use_id is set, an id for css is maintained.
      
      The cgroup framework only parepares
      	- css_id of root css for subsys
      	- id is automatically attached at creation of css.
      	- id is *not* freed automatically. Because the cgroup framework
      	  don't know lifetime of cgroup_subsys_state.
      	  free_css_id() function is provided. This must be called by subsys.
      
      There are several reasons to develop this.
      	- Saving space .... For example, memcg's swap_cgroup is array of
      	  pointers to cgroup. But it is not necessary to be very fast.
      	  By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
      	  reduce much amount of memory usage.
      
      	- Scanning without lock.
      	  CSS_ID provides "scan id under this ROOT" function. By this, scanning
      	  css under root can be written without locks.
      	  ex)
      	  do {
      		rcu_read_lock();
      		next = cgroup_get_next(subsys, id, root, &found);
      		/* check sanity of next here */
      		css_tryget();
      		rcu_read_unlock();
      		id = found + 1
      	 } while(...)
      
      Characteristics:
      	- Each css has unique ID under subsys.
      	- Lifetime of ID is controlled by subsys.
      	- css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
      	- Allowed ID is 1-65535, ID 0 is UNUSED ID.
      
      Design Choices:
      	- scan-by-ID v.s. scan-by-tree-walk.
      	  As /proc's pid scan does, scan-by-ID is robust when scanning is done
      	  by following kind of routine.
      	  scan -> rest a while(release a lock) -> conitunue from interrupted
      	  memcg's hierarchical reclaim does this.
      
      	- When subsys->use_id is set, # of css in the system is limited to
      	  65535.
      
      [bharata@linux.vnet.ibm.com: remove rcu_read_lock() from css_get_next()]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarPaul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarBharata B Rao <bharata@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38460b48