1. 13 Jul, 2009 1 commit
  2. 19 Aug, 2009 1 commit
  3. 17 Aug, 2009 1 commit
    • Peter Zijlstra's avatar
      In order to direct the SIGIO signal to a particular thread of a · 29e1ea2e
      Peter Zijlstra authored
      multi-threaded application we cannot, like suggested by the manpage, put a
      TID into the regular fcntl(F_SETOWN) call.  It will still be send to the
      whole process of which that thread is part.
      
      Since people do want to properly direct SIGIO we introduce F_SETOWN_EX.
      
      The need to direct SIGIO comes from self-monitoring profiling such as with
      perf-counters.  Perf-counters uses SIGIO to notify that new sample data is
      available.  If the signal is delivered to the same task that generated the
      new sample it can augment that data by inspecting the task's user-space
      state right after it returns from the kernel.  This is esp.  convenient
      for interpreted or virtual machine driven environments.
      
      Both F_SETOWN_EX and F_GETOWN_EX take a pointer to a struct f_owner_ex
      as argument:
      
      struct f_owner_ex {
      	int   type;
      	pid_t pid;
      };
      
      Where type is one of F_OWNER_TID, F_OWNER_PID or F_OWNER_GID.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Tested-by: default avatarstephane eranian <eranian@googlemail.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      29e1ea2e
  4. 04 Aug, 2009 2 commits
  5. 21 Jul, 2009 2 commits
  6. 13 Jul, 2009 3 commits
    • Oleg Nesterov's avatar
      Kill the unused "parent" argument in wait_consider_task(), it was never used. · 7fe62721
      Oleg Nesterov authored
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ratan Nalumasu <rnalumasu@gmail.com>
      Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7fe62721
    • Oleg Nesterov's avatar
      Thanks to Roland who pointed out de_thread() issues. · f62e589f
      Oleg Nesterov authored
      Currently we add sub-threads to ->real_parent->children list.  This buys
      nothing but slows down do_wait().
      
      With this patch ->children contains only main threads (group leaders). 
      The only complication is that forget_original_parent() should iterate over
      sub-threads by hand, and de_thread() needs another list_replace() when it
      changes ->group_leader.
      
      Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
      tasks, we can remove this check (and we can unify do_wait_thread() and
      ptrace_do_wait()).
      
      This change can confuse the optimistic search in mm_update_next_owner(),
      but this is fixable and minor.
      
      Perhaps badness() and oom_kill_process() should be updated, but they
      should be fixed in any case.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ratan Nalumasu <rnalumasu@gmail.com>
      Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f62e589f
    • Oleg Nesterov's avatar
      Suggested by Roland. · f444d269
      Oleg Nesterov authored
      do_wait(__WNOTHREAD) can only succeed if the caller is either ptracer, or
      it is ->real_parent and the child is not traced. IOW, caller == p->parent
      otherwise we should not wake up.
      
      Change child_wait_callback() to check this. Ratan reports the workload with
      CPU load >99% caused by unnecessary wakeups, should be fixed by this patch.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ratan Nalumasu <rnalumasu@gmail.com>
      Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f444d269
  7. 18 Jul, 2009 1 commit
  8. 13 Jul, 2009 3 commits
    • Oleg Nesterov's avatar
      Ratan Nalumasu reported that in a process with many threads doing · ff0c2ef4
      Oleg Nesterov authored
      unnecessary wakeups.  Every waiting thread in the process wakes up to loop
      through the children and see that the only ones it cares about are still
      not ready.
      
      Now that we have struct wait_opts we can change do_wait/__wake_up_parent
      to use filtered wakeups.
      
      We can make child_wait_callback() more clever later, right now it only
      checks eligible_child().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ratan Nalumasu <rnalumasu@gmail.com>
      Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ff0c2ef4
    • Oleg Nesterov's avatar
      Preparation, no functional changes. · 8fe5acbc
      Oleg Nesterov authored
      eligible_child() has a single caller, wait_consider_task(). We can move
      security_task_wait() out from eligible_child(), this allows us to use it
      for filtered wake_up().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ratan Nalumasu <rnalumasu@gmail.com>
      Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8fe5acbc
    • Oleg Nesterov's avatar
      The bug is old, it wasn't cause by recent changes. · b134054b
      Oleg Nesterov authored
      Test case:
      
      	static void *tfunc(void *arg)
      	{
      		int pid = (long)arg;
      
      		assert(ptrace(PTRACE_ATTACH, pid, NULL, NULL) == 0);
      		kill(pid, SIGKILL);
      
      		sleep(1);
      		return NULL;
      	}
      
      	int main(void)
      	{
      		pthread_t th;
      		long pid = fork();
      
      		if (!pid)
      			pause();
      
      		signal(SIGCHLD, SIG_IGN);
      		assert(pthread_create(&th, NULL, tfunc, (void*)pid) == 0);
      
      		int r = waitpid(-1, NULL, __WNOTHREAD);
      		printf("waitpid: %d %m\n", r);
      
      		return 0;
      	}
      
      Before the patch this program hangs, after this patch waitpid() correctly
      fails with errno == -ECHILD.
      
      The problem is, __ptrace_detach() reaps the EXIT_ZOMBIE tracee if its
      ->real_parent is our sub-thread and we ignore SIGCHLD.  But in this case
      we should wake up other threads which can sleep in do_wait().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b134054b
  9. 14 Aug, 2009 1 commit
  10. 11 Aug, 2009 1 commit
    • Andrew Morton's avatar
      ERROR: spaces required around that '?' (ctx:VxW) · 9c251cc7
      Andrew Morton authored
      #50: FILE: mm/memcontrol.c:485:
      +	int val = (charge)? 1 : -1;
       	                  ^
      
      total: 1 errors, 0 warnings, 171 lines checked
      
      ./patches/memcg-improve-resource-counter-scalability.patch has style problems, please review.  If any of these errors
      are false positives report them to the maintainer, see
      CHECKPATCH in MAINTAINERS.
      
      Please run checkpatch prior to sending patches
      
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c251cc7
  11. 14 Aug, 2009 1 commit
    • Balbir Singh's avatar
      Reduce the resource counter overhead (mostly spinlock) associated with the · 688985ed
      Balbir Singh authored
      root cgroup.  This is a part of the several patches to reduce mem cgroup
      overhead.  I had posted other approaches earlier (including using percpu
      counters).  Those patches will be a natural addition and will be added
      iteratively on top of these.
      
      The patch stops resource counter accounting for the root cgroup.  The data
      for display is derived from the statisitcs we maintain via
      mem_cgroup_charge_statistics (which is more scalable).  What happens today
      is that, we do double accounting, once using res_counter_charge() and once
      using memory_cgroup_charge_statistics().  For the root, since we don't
      implement limits any more, we don't need to track every charge via
      res_counter_charge() and check for limit being exceeded and reclaim.
      
      The main mem->res usage_in_bytes can be derived by summing the cache and
      rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and
      MEM_CGROUP_STAT_CACHE).  However, for memsw->res usage_in_bytes, we need
      additional data about swapped out memory.  This patch adds a
      MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and
      MEM_CGROUP_STAT_CACHE to derive the memsw data.  This data is computed
      recursively when hierarchy is enabled.
      
      The tests results I see on a 24 way show that
      
      1. The lock contention disappears from /proc/lock_stats
      2. The results of the test are comparable to running with
         cgroup_disable=memory.
      
      Here is a sample of my program runs
      
      Without Patch
      
       Performance counter stats for '/home/balbir/parallel_pagefault':
      
       7192804.124144  task-clock-msecs         #     23.937 CPUs
               424691  context-switches         #      0.000 M/sec
                  267  CPU-migrations           #      0.000 M/sec
             28498113  page-faults              #      0.004 M/sec
        5826093739340  cycles                   #    809.989 M/sec
         408883496292  instructions             #      0.070 IPC
           7057079452  cache-references         #      0.981 M/sec
           3036086243  cache-misses             #      0.422 M/sec
      
        300.485365680  seconds time elapsed
      
      With cgroup_disable=memory
      
       Performance counter stats for '/home/balbir/parallel_pagefault':
      
       7182183.546587  task-clock-msecs         #     23.915 CPUs
               425458  context-switches         #      0.000 M/sec
                  203  CPU-migrations           #      0.000 M/sec
             92545093  page-faults              #      0.013 M/sec
        6034363609986  cycles                   #    840.185 M/sec
         437204346785  instructions             #      0.072 IPC
           6636073192  cache-references         #      0.924 M/sec
           2358117732  cache-misses             #      0.328 M/sec
      
        300.320905827  seconds time elapsed
      
      With this patch applied
      
       Performance counter stats for '/home/balbir/parallel_pagefault':
      
       7191619.223977  task-clock-msecs         #     23.955 CPUs
               422579  context-switches         #      0.000 M/sec
                   88  CPU-migrations           #      0.000 M/sec
             91946060  page-faults              #      0.013 M/sec
        5957054385619  cycles                   #    828.333 M/sec
        1058117350365  instructions             #      0.178 IPC
           9161776218  cache-references         #      1.274 M/sec
           1920494280  cache-misses             #      0.267 M/sec
      
        300.218764862  seconds time elapsed
      
      Data from Prarit (kernel compile with make -j64 on a 64
      CPU/32G machine)
      
      For a single run
      
      Without patch
      
      real 27m8.988s
      user 87m24.916s
      sys 382m6.037s
      
      With patch
      
      real    4m18.607s
      user    84m58.943s
      sys     50m52.682s
      
      With config turned off
      
      real    4m54.972s
      user    90m13.456s
      sys     50m19.711s
      
      NOTE: The data looks counterintuitive due to the increased performance
      with the patch, even over the config being turned off. We probably need
      more runs, but so far all testing has shown that the patches definitely
      help.
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      688985ed
  12. 31 Jul, 2009 1 commit
    • Andrew Morton's avatar
      build fix · efe9d9b2
      Andrew Morton authored
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      efe9d9b2
  13. 21 Jul, 2009 2 commits
    • Balbir Singh's avatar
      Implement reclaim from groups over their soft limit · 932286d8
      Balbir Singh authored
      Permit reclaim from memory cgroups on contention (via the direct reclaim
      path).
      
      memory cgroup soft limit reclaim finds the group that exceeds its soft
      limit by the largest number of pages and reclaims pages from it and then
      reinserts the cgroup into its correct place in the rbtree.
      
      Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
      loops in case all swap is turned off.  The code has been refactored and
      the loop check (loop < 2) has been enhanced for soft limits.  For soft
      limits, we try to do more targetted reclaim.  Instead of bailing out after
      two loops, the routine now reclaims memory proportional to the size by
      which the soft limit is exceeded.  The proportion has been empirically
      determined.
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      932286d8
    • Balbir Singh's avatar
      Refactor mem_cgroup_hierarchical_reclaim() · 34d1aa5f
      Balbir Singh authored
      Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into
      flags, so that new parameters don't have to be passed as we make the
      reclaim routine more flexible
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      34d1aa5f
  14. 04 Aug, 2009 1 commit
  15. 21 Jul, 2009 3 commits
    • Balbir Singh's avatar
      Organize cgroups over soft limit in a RB-Tree · fd9213af
      Balbir Singh authored
      Introduce an RB-Tree for storing memory cgroups that are over their soft
      limit.  The overall goal is to
      
      1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
         We are careful about updates, updates take place only after a particular
         time interval has passed
      2. We remove the node from the RB-Tree when the usage goes below the soft
         limit
      
      The next set of patches will exploit the RB-Tree to get the group that is
      over its soft limit by the largest amount and reclaim from it, when we
      face memory contention.
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fd9213af
    • Balbir Singh's avatar
      Add an interface to allow get/set of soft limits. Soft limits for memory · 4f02e2ec
      Balbir Singh authored
      plus swap controller (memsw) is currently not supported.  Resource
      counters have been enhanced to support soft limits and new type
      RES_SOFT_LIMIT has been added.  Unlike hard limits, soft limits can be
      directly set and do not need any reclaim or checks before setting them to
      a newer value.
      
      Kamezawa-San raised a question as to whether soft limit should belong to
      res_counter.  Since all resources understand the basic concepts of hard
      and soft limits, it is justified to add soft limits here.  Soft limits are
      a generic resource usage feature, even file system quotas support soft
      limits.
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4f02e2ec
    • Balbir Singh's avatar
      Soft limits is a new feature for the memory resource controller, something · c09550ff
      Balbir Singh authored
      similar has existed in the group scheduler in the form of shares.  The CPU
      controllers interpretation of shares is very different though.  
      
      Soft limits are the most useful feature to have for environments where the
      administrator wants to overcommit the system, such that only on memory
      contention do the limits become active.  The current soft limits
      implementation provides a soft_limit_in_bytes interface for the memory
      controller and not for memory+swap controller.  The implementation
      maintains an RB-Tree of groups that exceed their soft limit and starts
      reclaiming from the group that exceeds this limit by the maximum amount.
      
      This patch:
      
      Add documentation for soft limits
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c09550ff
  16. 26 Jun, 2009 2 commits
    • Andrew Morton's avatar
      ERROR: code indent should use tabs where possible · 13575798
      Andrew Morton authored
      #22: FILE: mm/memcontrol.c:1138:
      + ^I * We access a page_cgroup asynchronously without lock_page_cgroup().$
      
      ERROR: code indent should use tabs where possible
      #23: FILE: mm/memcontrol.c:1139:
      + ^I * Especially when a page_cgroup is taken from a page, pc->mem_cgroup$
      
      ERROR: code indent should use tabs where possible
      #24: FILE: mm/memcontrol.c:1140:
      + ^I * is accessed after testing USED bit. To make pc->mem_cgroup visible$
      
      ERROR: code indent should use tabs where possible
      #25: FILE: mm/memcontrol.c:1141:
      + ^I * before USED bit, we need memory barrier here.$
      
      ERROR: code indent should use tabs where possible
      #26: FILE: mm/memcontrol.c:1142:
      + ^I * See mem_cgroup_add_lru_list(), etc.$
      
      ERROR: code indent should use tabs where possible
      #27: FILE: mm/memcontrol.c:1143:
      + ^I */$
      
      total: 6 errors, 0 warnings, 13 lines checked
      
      ./patches/memcg-add-comments-explaining-memory-barriers.patch has style problems, please review.  If any of these errors
      are false positives report them to the maintainer, see
      CHECKPATCH in MAINTAINERS.
      
      Please run checkpatch prior to sending patches
      
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      13575798
    • KAMEZAWA Hiroyuki's avatar
      Add comments for the reason of smp_wmb() in mem_cgroup_commit_charge(). · 9c983a9e
      KAMEZAWA Hiroyuki authored
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c983a9e
  17. 29 Jun, 2009 1 commit
  18. 23 Jun, 2009 2 commits
  19. 20 Aug, 2009 9 commits
    • Ben Blum's avatar
      Add functionality that enables users to move all threads in a threadgroup · aba7244f
      Ben Blum authored
      at once to a cgroup by writing the tgid to the 'cgroup.procs' file.  This
      current implementation makes use of a per-threadgroup rwsem that's taken
      for reading in the fork() path to prevent newly forking threads within the
      threadgroup from "escaping" while the move is in progress.
      
      Cgroups subsystems that need to perform per-thread actions in their
      "attach" callback are (currently) responsible for doing their own
      synchronization, since this occurs outside of the critical section that
      locks against cloning within a thread group.
      Signed-off-by: default avatarBen Blum <bblum@google.com>
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aba7244f
    • Andrew Morton's avatar
      Cc: "Eric W. Biederman" <ebiederm@xmission.com> · bd3246a9
      Andrew Morton authored
      Cc: Ben Blum <bblum@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bd3246a9
    • Ben Blum's avatar
      Add an rwsem that lives in a threadgroup's sighand_struct (next to the · d70f0c1f
      Ben Blum authored
      sighand's atomic count, to piggyback on its cacheline), and two functions
      in kernel/cgroup.c (for now) for easily+safely obtaining and releasing it.
      
      If another part of the kernel later wants to use such a locking mechanism,
      the CONFIG_CGROUPS ifdefs should be changed to a higher-up flag that
      CGROUPS and the other system would both depend on, and the lock/unlock
      functions could be moved to sched.c or so.
      Signed-off-by: default avatarBen Blum <bblum@google.com>
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d70f0c1f
    • Ben Blum's avatar
      Alter the ss->can_attach and ss->attach functions to be able to deal with · abb5bfda
      Ben Blum authored
      a whole threadgroup at a time, for use in cgroup_attach_proc.  (This is a
      pre-patch to cgroup-procs-writable.patch.)
      
      Currently, new mode of the attach function can only tell the subsystem
      about the old cgroup of the threadgroup leader.  No subsystem currently
      needs that information for each thread that's being moved, but if one were
      to be added (for example, one that counts tasks within a group) this bit
      would need to be reworked a bit to tell the subsystem the right
      information.
      Signed-off-by: default avatarBen Blum <bblum@google.com>
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Reviewed-by: default avatarMatt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      abb5bfda
    • Ben Blum's avatar
      Changes css_set freeing mechanism to be under RCU · eabd4e3a
      Ben Blum authored
      This is a prepatch for making the procs file writable. In order to free the
      old css_sets for each task to be moved as they're being moved, the freeing
      mechanism must be RCU-protected, or else we would have to have a call to
      synchronize_rcu() for each task before freeing its old css_set.
      Signed-off-by: default avatarBen Blum <bblum@google.com>
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eabd4e3a
    • Ben Blum's avatar
      Separates all pidlist allocation requests to a separate function that · bd43d26a
      Ben Blum authored
      judges based on the requested size whether or not the array needs to be
      vmalloced or can be gotten via kmalloc, and similar for kfree/vfree.
      Signed-off-by: default avatarBen Blum <bblum@google.com>
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bd43d26a
    • Ben Blum's avatar
      Previously there was the problem in which two processes from different pid · 532a060d
      Ben Blum authored
      namespaces reading the tasks or procs file could result in one process
      seeing results from the other's namespace.  Rather than one pidlist for
      each file in a cgroup, we now keep a list of pidlists keyed by namespace
      and file type (tasks versus procs) in which entries are placed on demand. 
      Each pidlist has its own lock, and that the pidlists themselves are passed
      around in the seq_file's private pointer means we don't have to touch the
      cgroup or its master list except when creating and destroying entries.
      Signed-off-by: default avatarBen Blum <bblum@google.com>
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      532a060d
    • Ben Blum's avatar
      struct cgroup used to have a bunch of fields for keeping track of the · 69f3e71d
      Ben Blum authored
      pidlist for the tasks file.  Those are now separated into a new struct
      cgroup_pidlist, of which two are had, one for procs and one for tasks. 
      The way the seq_file operations are set up is changed so that just the
      pidlist struct gets passed around as the private data.
      
      Interface example: Suppose a multithreaded process has pid 1000 and other
      threads with ids 1001, 1002, 1003:
      $ cat tasks
      1000
      1001
      1002
      1003
      $ cat cgroup.procs
      1000
      $
      Signed-off-by: default avatarBen Blum <bblum@google.com>
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      69f3e71d
    • Paul Menage's avatar
      The following series adds a "cgroup.procs" file to each cgroup that · f4db74a9
      Paul Menage authored
      reports unique tgids rather than pids, and allows all threads in a
      threadgroup to be atomically moved to a new cgroup.
      
      The subsystem "attach" interface is modified to support attaching whole
      threadgroups at a time, which could introduce potential problems if any
      subsystem were to need to access the old cgroup of every thread being
      moved.  The attach interface may need to be revised if this becomes the
      case.
      
      Also added is functionality for read/write locking all CLONE_THREAD
      fork()ing within a threadgroup, by means of an rwsem that lives in the
      sighand_struct, for per-threadgroup-ness and also for sharing a cacheline
      with the sighand's atomic count.  This scheme should introduce no extra
      overhead in the fork path when there's no contention.
      
      The final patch reveals potential for a race when forking before a
      subsystem's attach function is called - one potential solution in case any
      subsystem has this problem is to hang on to the group's fork mutex through
      the attach() calls, though no subsystem yet demonstrates need for an
      extended critical section.
      
      
      
      This patch:
      
      Revert
      
      commit 096b7fe0
      Author:     Li Zefan <lizf@cn.fujitsu.com>
      AuthorDate: Wed Jul 29 15:04:04 2009 -0700
      Commit:     Linus Torvalds <torvalds@linux-foundation.org>
      CommitDate: Wed Jul 29 19:10:35 2009 -0700
      
          cgroups: fix pid namespace bug
      
      
      This is in preparation for some clashing cgroups changes that subsume the
      original commit's functionaliy.
      
      The original commit fixed a pid namespace bug which Ben Blum fixed
      independently (in the same way, but with different code) as part of a
      series of patches.  I played around with trying to reconcile Ben's patch
      series with Li's patch, but concluded that it was simpler to just revert
      Li's, given that Ben's patch series contained essentially the same fix.
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f4db74a9
  20. 30 Jul, 2009 2 commits
    • Paul Menage's avatar
      This patch removes the restriction that a cgroup hierarchy must have at · f83c0e34
      Paul Menage authored
      least one bound subsystem.  The mount option "none" is treated as an
      explicit request for no bound subsystems.
      
      A hierarchy with no subsystems can be useful for plain task tracking, and
      is also a step towards the support for multiply-bindable subsystems.
      
      As part of this change, the hierarchy id is no longer calculated from the
      bitmask of subsystems in the hierarchy (since this is not guaranteed to be
      unique) but is allocated via an ida.  Reference counts on cgroups from
      css_set objects are now taken explicitly one per hierarchy, rather than
      one per subsystem.
      
      Example usage:
      
      mount -t cgroup -o none,name=foo cgroup /mnt/cgroup
      
      Based on the "no-op"/"none" subsystem concept proposed by
      kamezawa.hiroyu@jp.fujitsu.com
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Reviewed-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f83c0e34
    • Paul Menage's avatar
      Currently the cgroups code makes the assumption that the subsystem · 41a6731a
      Paul Menage authored
      pointers in a struct css_set uniquely identify the hierarchy->cgroup
      mappings associated with the css_set; and there's no way to directly
      identify the associated set of cgroups other than by indirecting through
      the appropriate subsystem state pointers.
      
      This patch removes the need for that assumption by adding a back-pointer
      from struct cg_cgroup_link object to its associated cgroup; this allows
      the set of cgroups to be determined by traversing the cg_links list in the
      struct css_set.
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Reviewed-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      41a6731a