1. 21 Jul, 2009 1 commit
  2. 04 Aug, 2009 1 commit
  3. 21 Jul, 2009 3 commits
    • Balbir Singh's avatar
      Organize cgroups over soft limit in a RB-Tree · fd9213af
      Balbir Singh authored
      Introduce an RB-Tree for storing memory cgroups that are over their soft
      limit.  The overall goal is to
      
      1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
         We are careful about updates, updates take place only after a particular
         time interval has passed
      2. We remove the node from the RB-Tree when the usage goes below the soft
         limit
      
      The next set of patches will exploit the RB-Tree to get the group that is
      over its soft limit by the largest amount and reclaim from it, when we
      face memory contention.
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fd9213af
    • Balbir Singh's avatar
      Add an interface to allow get/set of soft limits. Soft limits for memory · 4f02e2ec
      Balbir Singh authored
      plus swap controller (memsw) is currently not supported.  Resource
      counters have been enhanced to support soft limits and new type
      RES_SOFT_LIMIT has been added.  Unlike hard limits, soft limits can be
      directly set and do not need any reclaim or checks before setting them to
      a newer value.
      
      Kamezawa-San raised a question as to whether soft limit should belong to
      res_counter.  Since all resources understand the basic concepts of hard
      and soft limits, it is justified to add soft limits here.  Soft limits are
      a generic resource usage feature, even file system quotas support soft
      limits.
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4f02e2ec
    • Balbir Singh's avatar
      Soft limits is a new feature for the memory resource controller, something · c09550ff
      Balbir Singh authored
      similar has existed in the group scheduler in the form of shares.  The CPU
      controllers interpretation of shares is very different though.  
      
      Soft limits are the most useful feature to have for environments where the
      administrator wants to overcommit the system, such that only on memory
      contention do the limits become active.  The current soft limits
      implementation provides a soft_limit_in_bytes interface for the memory
      controller and not for memory+swap controller.  The implementation
      maintains an RB-Tree of groups that exceed their soft limit and starts
      reclaiming from the group that exceeds this limit by the maximum amount.
      
      This patch:
      
      Add documentation for soft limits
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c09550ff
  4. 26 Jun, 2009 2 commits
    • Andrew Morton's avatar
      ERROR: code indent should use tabs where possible · 13575798
      Andrew Morton authored
      #22: FILE: mm/memcontrol.c:1138:
      + ^I * We access a page_cgroup asynchronously without lock_page_cgroup().$
      
      ERROR: code indent should use tabs where possible
      #23: FILE: mm/memcontrol.c:1139:
      + ^I * Especially when a page_cgroup is taken from a page, pc->mem_cgroup$
      
      ERROR: code indent should use tabs where possible
      #24: FILE: mm/memcontrol.c:1140:
      + ^I * is accessed after testing USED bit. To make pc->mem_cgroup visible$
      
      ERROR: code indent should use tabs where possible
      #25: FILE: mm/memcontrol.c:1141:
      + ^I * before USED bit, we need memory barrier here.$
      
      ERROR: code indent should use tabs where possible
      #26: FILE: mm/memcontrol.c:1142:
      + ^I * See mem_cgroup_add_lru_list(), etc.$
      
      ERROR: code indent should use tabs where possible
      #27: FILE: mm/memcontrol.c:1143:
      + ^I */$
      
      total: 6 errors, 0 warnings, 13 lines checked
      
      ./patches/memcg-add-comments-explaining-memory-barriers.patch has style problems, please review.  If any of these errors
      are false positives report them to the maintainer, see
      CHECKPATCH in MAINTAINERS.
      
      Please run checkpatch prior to sending patches
      
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      13575798
    • KAMEZAWA Hiroyuki's avatar
      Add comments for the reason of smp_wmb() in mem_cgroup_commit_charge(). · 9c983a9e
      KAMEZAWA Hiroyuki authored
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c983a9e
  5. 29 Jun, 2009 1 commit
  6. 23 Jun, 2009 2 commits
  7. 20 Aug, 2009 9 commits
    • Ben Blum's avatar
      Add functionality that enables users to move all threads in a threadgroup · aba7244f
      Ben Blum authored
      at once to a cgroup by writing the tgid to the 'cgroup.procs' file.  This
      current implementation makes use of a per-threadgroup rwsem that's taken
      for reading in the fork() path to prevent newly forking threads within the
      threadgroup from "escaping" while the move is in progress.
      
      Cgroups subsystems that need to perform per-thread actions in their
      "attach" callback are (currently) responsible for doing their own
      synchronization, since this occurs outside of the critical section that
      locks against cloning within a thread group.
      Signed-off-by: default avatarBen Blum <bblum@google.com>
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aba7244f
    • Andrew Morton's avatar
      Cc: "Eric W. Biederman" <ebiederm@xmission.com> · bd3246a9
      Andrew Morton authored
      Cc: Ben Blum <bblum@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bd3246a9
    • Ben Blum's avatar
      Add an rwsem that lives in a threadgroup's sighand_struct (next to the · d70f0c1f
      Ben Blum authored
      sighand's atomic count, to piggyback on its cacheline), and two functions
      in kernel/cgroup.c (for now) for easily+safely obtaining and releasing it.
      
      If another part of the kernel later wants to use such a locking mechanism,
      the CONFIG_CGROUPS ifdefs should be changed to a higher-up flag that
      CGROUPS and the other system would both depend on, and the lock/unlock
      functions could be moved to sched.c or so.
      Signed-off-by: default avatarBen Blum <bblum@google.com>
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d70f0c1f
    • Ben Blum's avatar
      Alter the ss->can_attach and ss->attach functions to be able to deal with · abb5bfda
      Ben Blum authored
      a whole threadgroup at a time, for use in cgroup_attach_proc.  (This is a
      pre-patch to cgroup-procs-writable.patch.)
      
      Currently, new mode of the attach function can only tell the subsystem
      about the old cgroup of the threadgroup leader.  No subsystem currently
      needs that information for each thread that's being moved, but if one were
      to be added (for example, one that counts tasks within a group) this bit
      would need to be reworked a bit to tell the subsystem the right
      information.
      Signed-off-by: default avatarBen Blum <bblum@google.com>
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Reviewed-by: default avatarMatt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      abb5bfda
    • Ben Blum's avatar
      Changes css_set freeing mechanism to be under RCU · eabd4e3a
      Ben Blum authored
      This is a prepatch for making the procs file writable. In order to free the
      old css_sets for each task to be moved as they're being moved, the freeing
      mechanism must be RCU-protected, or else we would have to have a call to
      synchronize_rcu() for each task before freeing its old css_set.
      Signed-off-by: default avatarBen Blum <bblum@google.com>
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eabd4e3a
    • Ben Blum's avatar
      Separates all pidlist allocation requests to a separate function that · bd43d26a
      Ben Blum authored
      judges based on the requested size whether or not the array needs to be
      vmalloced or can be gotten via kmalloc, and similar for kfree/vfree.
      Signed-off-by: default avatarBen Blum <bblum@google.com>
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bd43d26a
    • Ben Blum's avatar
      Previously there was the problem in which two processes from different pid · 532a060d
      Ben Blum authored
      namespaces reading the tasks or procs file could result in one process
      seeing results from the other's namespace.  Rather than one pidlist for
      each file in a cgroup, we now keep a list of pidlists keyed by namespace
      and file type (tasks versus procs) in which entries are placed on demand. 
      Each pidlist has its own lock, and that the pidlists themselves are passed
      around in the seq_file's private pointer means we don't have to touch the
      cgroup or its master list except when creating and destroying entries.
      Signed-off-by: default avatarBen Blum <bblum@google.com>
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      532a060d
    • Ben Blum's avatar
      struct cgroup used to have a bunch of fields for keeping track of the · 69f3e71d
      Ben Blum authored
      pidlist for the tasks file.  Those are now separated into a new struct
      cgroup_pidlist, of which two are had, one for procs and one for tasks. 
      The way the seq_file operations are set up is changed so that just the
      pidlist struct gets passed around as the private data.
      
      Interface example: Suppose a multithreaded process has pid 1000 and other
      threads with ids 1001, 1002, 1003:
      $ cat tasks
      1000
      1001
      1002
      1003
      $ cat cgroup.procs
      1000
      $
      Signed-off-by: default avatarBen Blum <bblum@google.com>
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      69f3e71d
    • Paul Menage's avatar
      The following series adds a "cgroup.procs" file to each cgroup that · f4db74a9
      Paul Menage authored
      reports unique tgids rather than pids, and allows all threads in a
      threadgroup to be atomically moved to a new cgroup.
      
      The subsystem "attach" interface is modified to support attaching whole
      threadgroups at a time, which could introduce potential problems if any
      subsystem were to need to access the old cgroup of every thread being
      moved.  The attach interface may need to be revised if this becomes the
      case.
      
      Also added is functionality for read/write locking all CLONE_THREAD
      fork()ing within a threadgroup, by means of an rwsem that lives in the
      sighand_struct, for per-threadgroup-ness and also for sharing a cacheline
      with the sighand's atomic count.  This scheme should introduce no extra
      overhead in the fork path when there's no contention.
      
      The final patch reveals potential for a race when forking before a
      subsystem's attach function is called - one potential solution in case any
      subsystem has this problem is to hang on to the group's fork mutex through
      the attach() calls, though no subsystem yet demonstrates need for an
      extended critical section.
      
      
      
      This patch:
      
      Revert
      
      commit 096b7fe0
      Author:     Li Zefan <lizf@cn.fujitsu.com>
      AuthorDate: Wed Jul 29 15:04:04 2009 -0700
      Commit:     Linus Torvalds <torvalds@linux-foundation.org>
      CommitDate: Wed Jul 29 19:10:35 2009 -0700
      
          cgroups: fix pid namespace bug
      
      
      This is in preparation for some clashing cgroups changes that subsume the
      original commit's functionaliy.
      
      The original commit fixed a pid namespace bug which Ben Blum fixed
      independently (in the same way, but with different code) as part of a
      series of patches.  I played around with trying to reconcile Ben's patch
      series with Li's patch, but concluded that it was simpler to just revert
      Li's, given that Ben's patch series contained essentially the same fix.
      Signed-off-by: default avatarPaul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f4db74a9
  8. 30 Jul, 2009 4 commits
  9. 24 Jul, 2009 1 commit
  10. 18 Aug, 2009 1 commit
  11. 14 Aug, 2009 1 commit
  12. 04 Aug, 2009 1 commit
  13. 31 Jul, 2009 4 commits
  14. 13 Jul, 2009 1 commit
  15. 21 Nov, 2008 1 commit
    • Warren Turkal's avatar
      This is a patchset to change the way that the HFS+ filesystem detects · 0ba79e5d
      Warren Turkal authored
      whether a volume has a journal or not.
      
      The code currently mounts an HFS+ volume read-only by default when a
      journal is detected.  One can force a read/write mount by giving the
      "force" mount option.  The current code has this behavior since there is
      no support for the HFS+ journal.
      
      My problem is that the detection of the journal could be better.  The
      current code tests the attribute bit in the volume header that indicates
      there is a journal.  If that bit is set, the code assumes that there is a
      journal.
      
      Unfortunately, this is not enough to really determine if there is a
      journal or not.  When that bit is set, one must also examine the journal
      info block field of the volume header.  If this field is 0, there is no
      journal, and the volume can be mounted read/write.
      
      
      This patch:
      
      The HFS+ support in the kernel currently will mount an HFS+ volume
      read-only if the volume header has the attribute bit set that indicates
      there is a journal.  The kernel does this because there is no support for
      a journalled HFS+ volume.
      
      The problem is that this is only half of what needs to be checked to see
      if there really is a journal.  There is also an entry in the volume header
      that tells you where to find the journal info block.  In the kernel
      version of the kernel, this 4 byte block is labeled reserved.  This patch
      identifies the journal info block in the header.
      Signed-off-by: default avatarWarren Turkal <wt@penguintechs.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0ba79e5d
  16. 17 Aug, 2009 1 commit
  17. 18 Jul, 2009 2 commits
    • Andrew Morton's avatar
      ERROR: "(foo*)" should be "(foo *)" · ce6d8bf8
      Andrew Morton authored
      #62: FILE: fs/minix/dir.c:481:
      +		struct inode *inode = (struct inode*)mapping->host;
      
      total: 1 errors, 0 warnings, 46 lines checked
      
      ./patches/v3-minixfs-add-missing-directory-type-checking.patch has style problems, please review.  If any of these errors
      are false positives report them to the maintainer, see
      CHECKPATCH in MAINTAINERS.
      
      Please run checkpatch prior to sending patches
      
      Cc: "Doug Graham" <dgraham@nortel.com>
      Cc: Doug Graham <dgraham@nortel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ce6d8bf8
    • Doug Graham's avatar
      There are a few places in the Minix FS code where the "inode" field of a · 38a369d4
      Doug Graham authored
      minix_dir_entry is used without checking first to see if the dirent is
      really a minix3_dir_entry.  The inode number in a V1/V2 dirent is 16 bits,
      whereas that in a V3 dirent is 32 bits.
      
      Accessing it as a 16 bit field when it really should be accessed as a 32
      bit field probably kinda sorta works on a little-endian machine, but leads
      to some rather odd behaviour on big-endian machines.
      Signed-off-by: default avatarDoug Graham <dgraham@nortel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      38a369d4
  18. 13 Jul, 2009 1 commit
  19. 20 Aug, 2009 2 commits
  20. 06 Aug, 2009 1 commit