1. 12 Nov, 2009 8 commits
    • Alex Chiang's avatar
      By returning early if the node is not online, we can unindent the · de737fa9
      Alex Chiang authored
      interesting code by one level.
      
      No functional change.
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Cc: Gary Hade <garyhade@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de737fa9
    • Alex Chiang's avatar
      Commit c04fc586 (mm: show node to memory section relationship with · a969a507
      Alex Chiang authored
      symlinks in sysfs) created symlinks from nodes to memory sections, e.g.
      
      /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
      
      If you're examining the memory section though and are wondering what node
      it might belong to, you can find it by grovelling around in sysfs, but
      it's a little cumbersome.
      
      Add a reverse symlink for each memory section that points back to the
      node to which it belongs.
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Cc: Gary Hade <garyhade@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a969a507
    • Hugh Dickins's avatar
      When do_nonlinear_fault() realizes that the page table must have been · e5c975d9
      Hugh Dickins authored
      corrupted for it to have been called, it does print_bad_pte() and returns
      ...  VM_FAULT_OOM, which is hard to understand.
      
      It made some sense when I did it for 2.6.15, when do_page_fault() just
      killed the current process; but nowadays it lets the OOM killer decide who
      to kill - so page table corruption in one process would be liable to kill
      another.
      
      Change it to return VM_FAULT_SIGBUS instead: that doesn't guarantee that
      the process will be killed, but is good enough for such a rare
      abnormality, accompanied as it is by the "BUG: Bad page map" message.
      
      And recent HWPOISON work has copied that code into do_swap_page(), when it
      finds an impossible swap entry: fix that to VM_FAULT_SIGBUS too.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e5c975d9
    • Hugh Dickins's avatar
      CONFIG_DEBUG_SPINLOCK adds 12 or 16 bytes to a 32- or 64-bit spinlock_t, · f1526059
      Hugh Dickins authored
      and CONFIG_DEBUG_LOCK_ALLOC adds another 12 or 24 bytes to it: lockdep
      enables both of those, and CONFIG_LOCK_STAT adds 8 or 16 bytes to that.
      
      When 2.6.15 placed the split page table lock inside struct page (usually
      sized 32 or 56 bytes), only CONFIG_DEBUG_SPINLOCK was a possibility, and
      we ignored the enlargement (but fitted in CONFIG_GENERIC_LOCKBREAK's 4 by
      letting the spinlock_t occupy both page->private and page->mapping).
      
      Should these debugging options be allowed to double the size of a struct
      page, when only one minority use of the page (as a page table) needs to
      fit a spinlock in there?  Perhaps not.
      
      Take the easy way out: switch off SPLIT_PTLOCK_CPUS when DEBUG_SPINLOCK or
      DEBUG_LOCK_ALLOC is in force.  I've sometimes tried to be cleverer,
      kmallocing a cacheline for the spinlock when it doesn't fit, but given up
      each time.  Falling back to mm->page_table_lock (as we do when ptlock is
      not split) lets lockdep check out the strictest path anyway.
      
      And now that some arches allow 8192 cpus, use 999999 for infinity.
      
      (What has this got to do with KSM swapping?  It doesn't care about the
      size of struct page, but may care about random junk in page->mapping - to
      be explained separately later.)
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f1526059
    • Hugh Dickins's avatar
      KSM swapping will know where page_referenced_one() and try_to_unmap_one() · 816d8b98
      Hugh Dickins authored
      should look.  It could hack page->index to get them to do what it wants,
      but it seems cleaner now to pass the address down to them.
      
      Make the same change to page_mkclean_one(), since it follows the same
      pattern; but there's no real need in its case.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      816d8b98
    • Hugh Dickins's avatar
      Remove three degrees of obfuscation, left over from when we had · 3c40c0f6
      Hugh Dickins authored
      CONFIG_UNEVICTABLE_LRU.  MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
      CONFIG_HAVE_MLOCK is CONFIG_MMU.  rmap.o (and memory-failure.o) are only
      built when CONFIG_MMU, so don't need such conditions at all.
      
      Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
      169 defconfigs: leave those to evolve in due course.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3c40c0f6
    • Hugh Dickins's avatar
      There's contorted mlock/munlock handling in try_to_unmap_anon() and · 981cbd03
      Hugh Dickins authored
      try_to_unmap_file(), which we'd prefer not to repeat for KSM swapping. 
      Simplify it by moving it all down into try_to_unmap_one().
      
      One thing is then lost, try_to_munlock()'s distinction between when no vma
      holds the page mlocked, and when a vma does mlock it, but we could not get
      mmap_sem to set the page flag.  But its only caller takes no interest in
      that distinction (and is better testing SWAP_MLOCK anyway), so let's keep
      the code simple and return SWAP_AGAIN for both cases.
      
      try_to_unmap_file()'s TTU_MUNLOCK nonlinear handling was particularly
      amusing: once unravelled, it turns out to have been choosing between two
      different ways of doing the same nothing.  Ah, no, one way was actually
      returning SWAP_FAIL when it meant to return SWAP_SUCCESS.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      981cbd03
    • Hugh Dickins's avatar
      At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set · ed4047fa
      Hugh Dickins authored
      in page->mapping, with the higher bits a pointer to the anon_vma; and have
      defined PageKsm(page) as that with NULL anon_vma.
      
      But KSM swapping will need to store a pointer there: so in preparation for
      that, now define PAGE_MAPPING_FLAGS as the low two bits, including
      PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
      other use for the bit emerges).
      
      Declare page_rmapping(page) to return the pointer part of page->mapping,
      and page_anon_vma(page) to return the anon_vma pointer when that's what it
      is.  Use these in a few appropriate places: notably, unuse_vma() has been
      testing page->mapping, but is better to be testing page_anon_vma() (cases
      may be added in which flag bits are set without any pointer).
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ed4047fa
  2. 14 Nov, 2009 1 commit
  3. 12 Nov, 2009 4 commits
    • Mel Gorman's avatar
      After kswapd balances all zones in a pgdat, it goes to sleep. In the · 0460a08d
      Mel Gorman authored
      event of no IO congestion, kswapd can go to sleep very shortly after the
      high watermark was reached.  If there are a constant stream of allocations
      from parallel processes, it can mean that kswapd went to sleep too quickly
      and the high watermark is not being maintained for sufficient length time.
      
      This patch makes kswapd go to sleep as a two-stage process.  It first
      tries to sleep for HZ/10.  If it is woken up by another process or the
      high watermark is no longer met, it's considered a premature sleep and
      kswapd continues work.  Otherwise it goes fully to sleep.
      
      This adds more counters to distinguish between fast and slow breaches of
      watermarks.  A "fast" premature sleep is one where the low watermark was
      hit in a very short time after kswapd going to sleep.  A "slow" premature
      sleep indicates that the high watermark was breached after a very short
      interval.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Frans Pop <elendil@planet.nl>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0460a08d
    • Mel Gorman's avatar
      Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least · a6e043a4
      Mel Gorman authored
      that the commits 373c0a7e 8aa7e847 dramatically increased the number of
      GFP_ATOMIC failures that were occuring within a wireless driver. 
      Reverting this patch seemed to help a lot even though it was pointed out
      that the congestion changes were very far away from high-order atomic
      allocations.
      
      The key to why the revert makes such a big difference is down to timing
      and how long direct reclaimers wait versus kswapd.  With the patch
      reverted, the congestion_wait() is on the SYNC queue instead of the ASYNC.
       As a significant part of the workload involved reads, it makes sense that
      the SYNC list is what was truely congested and with the revert processes
      were waiting on congestion as expected.  Hence, direct reclaimers stalled
      properly and kswapd was able to do its job with fewer stalls.
      
      This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
      for direct reclaimers.  Instead of making the congestion_wait() on the
      SYNC queue which would only fix a particular type of workload, this patch
      adds a third type of congestion_wait - BLK_RW_BOTH which first waits on
      the ASYNC and then the SYNC queue if the timeout has not been reached.  In
      tests, this counter-intuitively results in kswapd stalling less and
      freeing up pages resulting in fewer allocation failures and fewer
      direct-reclaim-orientated stalls.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Frans Pop <elendil@planet.nl>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a6e043a4
    • Andrew Morton's avatar
      ERROR: code indent should use tabs where possible · 0f295e12
      Andrew Morton authored
      #99: FILE: mm/oom_kill.c:209:
      + ^I * to kill current.We have to random task kill in this case.$
      
      ERROR: code indent should use tabs where possible
      #100: FILE: mm/oom_kill.c:210:
      + ^I * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.$
      
      ERROR: code indent should use tabs where possible
      #101: FILE: mm/oom_kill.c:211:
      + ^I */$
      
      ERROR: code indent should use tabs where possible
      #107: FILE: mm/oom_kill.c:216:
      + ^I * The nodemask here is a nodemask passed to alloc_pages(). Now,$
      
      ERROR: code indent should use tabs where possible
      #108: FILE: mm/oom_kill.c:217:
      + ^I * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy$
      
      ERROR: code indent should use tabs where possible
      #109: FILE: mm/oom_kill.c:218:
      + ^I * feature. mempolicy is an only user of nodemask here.$
      
      ERROR: code indent should use tabs where possible
      #111: FILE: mm/oom_kill.c:220:
      + ^I */$
      
      ERROR: code indent should use tabs where possible
      #169: FILE: mm/page_alloc.c:1672:
      +^I ^I* GFP_THISNODE contains __GFP_NORETRY and we never hit this.$
      
      ERROR: code indent should use tabs where possible
      #170: FILE: mm/page_alloc.c:1673:
      +^I ^I* Sanity check for bare calls of __GFP_THISNODE, not real OOM.$
      
      ERROR: code indent should use tabs where possible
      #171: FILE: mm/page_alloc.c:1674:
      +^I ^I* The caller should handle page allocation failure by itself if$
      
      ERROR: code indent should use tabs where possible
      #172: FILE: mm/page_alloc.c:1675:
      +^I ^I* it specifies __GFP_THISNODE.$
      
      ERROR: code indent should use tabs where possible
      #173: FILE: mm/page_alloc.c:1676:
      +^I ^I* Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.$
      
      ERROR: code indent should use tabs where possible
      #174: FILE: mm/page_alloc.c:1677:
      +^I ^I*/$
      
      total: 13 errors, 0 warnings, 125 lines checked
      
      ./patches/oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch has style problems, please review.  If any of these errors
      are false positives report them to the maintainer, see
      CHECKPATCH in MAINTAINERS.
      
      Please run checkpatch prior to sending patches
      
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0f295e12
    • KAMEZAWA Hiroyuki's avatar
      Fix node-oriented allocation handling in oom-kill.c I myself think of this · 96a826cc
      KAMEZAWA Hiroyuki authored
      as a bugfix not as an ehnancement.
      
      In these days, things are changed as
        - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
        - mempolicy don't maintain its own private zonelists.
        (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
      
      So, current oom-killer's check function is wrong.
      
      This patch does
        - check nodemask, if nodemask && nodemask doesn't cover all
          node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
        - Scan all zonelist under nodemask, if it hits cpuset's wall
          this faiulre is from cpuset.
      And
        - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
          This doesn't change "current" behavior. If callers use __GFP_THISNODE
          it should handle "page allocation failure" by itself.
      
        - handle __GFP_NOFAIL+__GFP_THISNODE path.
          This is something like a FIXME but this gfpmask is not used now.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      96a826cc
  4. 11 Nov, 2009 1 commit
    • David Rientjes's avatar
      fix race, add pid & comm to message · 7007f322
      David Rientjes authored
      On Tue, 10 Nov 2009, akpm@linux-foundation.org wrote:
      
      > diff -puN mm/oom_kill.c~oom-kill-show-virtual-size-and-rss-information-of-the-killed-process mm/oom_kill.c
      > --- a/mm/oom_kill.c~oom-kill-show-virtual-size-and-rss-information-of-the-killed-process
      > +++ a/mm/oom_kill.c
      > @@ -352,6 +352,8 @@ static void dump_header(gfp_t gfp_mask,
      >  		dump_tasks(mem);
      >  }
      >
      > +#define K(x) ((x) << (PAGE_SHIFT-10))
      > +
      >  /*
      >   * Send SIGKILL to the selected  process irrespective of  CAP_SYS_RAW_IO
      >   * flag though it's unlikely that  we select a process with CAP_SYS_RAW_IO
      > @@ -371,9 +373,16 @@ static void __oom_kill_task(struct task_
      >  		return;
      >  	}
      >
      > -	if (verbose)
      > -		printk(KERN_ERR "Killed process %d (%s)\n",
      > -				task_pid_nr(p), p->comm);
      > +	if (verbose) {
      > +		task_lock(p);
      > +		printk(KERN_ERR "Killed process %d (%s) "
      > +		       "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
      > +		       task_pid_nr(p), p->comm,
      > +		       K(p->mm->total_vm),
      > +		       K(get_mm_counter(p->mm, anon_rss)),
      > +		       K(get_mm_counter(p->mm, file_rss)));
      > +		task_unlock(p);
      > +	}
      >
      >  	/*
      >  	 * We give our sacrificial lamb high priority and access to
      
      There's a race there which can dereference a NULL p->mm.
      
      p->mm is protected by task_lock(), but there's no check added here that
      ensures p->mm is still valid.  The previous check for !p->mm in
      __oom_kill_task() is not protected by task_lock(), so there's a race:
      
      	select_bad_process()
      	oom_kill_process(p)
      					do_exit()
      					exit_signals(p) /* PF_EXITING */
      	oom_kill_task(p)
      	__oom_kill_task(p)
      					exit_mm(p)
      					task_lock(p)
      					p->mm = NULL
      					task_unlock(p)
      	printk() of p->mm->total_vm
      
      Please merge this as a fix.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7007f322
  5. 10 Nov, 2009 1 commit
    • KOSAKI Motohiro's avatar
      In a typical oom analysis scenario, we frequently want to know whether the · fa8680c6
      KOSAKI Motohiro authored
      killed process has a memory leak or not at the first step.  This patch
      adds vsz and rss information to the oom log to help this analysis.  To
      save time for the debugging.
      
      example:
      ===================================================================
      rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
      Pid: 1308, comm: rsyslogd Not tainted 2.6.32-rc6 #24
      Call Trace:
      [<ffffffff8132e35b>] ?_spin_unlock+0x2b/0x40
      [<ffffffff810f186e>] oom_kill_process+0xbe/0x2b0
      
      (snip)
      
      492283 pages non-shared
      Out of memory: kill process 2341 (memhog) score 527276 or a child
      Killed process 2341 (memhog) vsz:1054552kB, anon-rss:970588kB, file-rss:4kB
      ===========================================================================
                                   ^
                                   |
                                  here
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fa8680c6
  6. 31 Oct, 2009 1 commit
    • KAMEZAWA Hiroyuki's avatar
      It's reported that OOM-Killer kills Gnone/KDE first. And yes, we can · 4a3d5e53
      KAMEZAWA Hiroyuki authored
      reproduce it easily.
      
      Now, oom-killer uses mm->total_vm as its base value.  But in recent
      applications, there are a big gap between VM size and RSS size.  Because
      
        - Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
        - Applications may alloc big VM area but use small part of them.
          (Java, and multi-threaded applications has this tendency because
           of default-size of stack.)
      
      I think using mm->total_vm as score for oom-kill is not good.  By the same
      reason, overcommit memory can't work as expected.  (In other words, if we
      depends on total_vm, using overcommit more positive is a good choice.)
      
      This patch uses mm->anon_rss/file_rss as base value for calculating badness.
      
      Following is changes to OOM score(badness) on an environment with 1.6G memory
      plus memory-eater(500M & 1G).
      
      Top 10 of badness score. (The highest one is the first candidate to be killed)
      Before
      badness program
      91228	gnome-settings-
      94210	clock-applet
      103202	mixer_applet2
      106563	tomboy
      112947	gnome-terminal
      128944	mmap              <----------- 500M malloc
      129332	nautilus
      215476	bash              <----------- parent of 2 mallocs.
      256944	mmap              <----------- 1G malloc
      423586	gnome-session
      
      After
      badness
      1911	mixer_applet2
      1955	clock-applet
      1986	xinit
      1989	gnome-session
      2293	nautilus
      2955	gnome-terminal
      4113	tomboy
      104163	mmap             <----------- 500M malloc.
      168577	bash             <----------- parent of 2 mallocs
      232375	mmap             <----------- 1G malloc
      
      seems good for me.  Maybe we can tweak this patch more, but this one will
      be a good one as a start point.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4a3d5e53
  7. 03 Nov, 2009 2 commits
  8. 16 Oct, 2009 2 commits
  9. 15 Oct, 2009 1 commit
  10. 09 Nov, 2009 5 commits
  11. 03 Nov, 2009 2 commits
    • Hugh Dickins's avatar
      Sorry, just noticed what the diff contexts don't show: Jiri's patch is · 16f80590
      Hugh Dickins authored
      initializing p->first_swap_extent.list at a point before p has been
      decided - we may kfree that newly allocated p and go on to reuse an
      existing free entry for p.
      
      Now, the patch is not actually wrong: an existing free entry will have a
      good empty first_swap_extent.list; but it looks suspicious, it seems
      strange to initialize a field in something we're about to kfree, and I'd
      rather we put that initialization back to where it was in 2.6.32.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      16f80590
    • Jiri Slaby's avatar
      Double swapon on a device causes a crash: · e9032c7e
      Jiri Slaby authored
      BUG: unable to handle kernel NULL pointer dereference at (null)
      IP: [<ffffffff810af160>] sys_swapon+0x1f0/0xc60
      PGD 1dc0b067 PUD 1dc09067 PMD 0
      Oops: 0000 [#1] SMP
      last sysfs file:
      CPU 1
      Modules linked in:
      Pid: 562, comm: swapon Tainted: G        W  2.6.32-rc5-mm1_64 #867
      RIP: 0010:[<ffffffff810af160>]  [<ffffffff810af160>] sys_swapon+0x1f0/0xc60
      ...
      
      It is due to swap_info_struct->first_swap_extent.list not being
      initialized. ->next is NULL in such a situation and
      destroy_swap_extents fails to iterate over the list with the BUG
      above.
      
      Introduced by swap_info-include-first_swap_extent.patch. Revert the
      INIT_LIST_HEAD move.
      Signed-off-by: default avatarJiri Slaby <jirislaby@gmail.com>
      Acked-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e9032c7e
  12. 09 Nov, 2009 3 commits
  13. 13 Oct, 2009 2 commits
    • Jan Beulich's avatar
      - avoid wasting more precious resources (DMA or DMA32 pools), when · 4704daff
      Jan Beulich authored
        being called through vmalloc_32{,_user}()
      - explicitly allow using high memory here even if the outer allocation
        request doesn't allow it
      Signed-off-by: default avatarJan Beulich <jbeulich@novell.com>
      Acked-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4704daff
    • David Rientjes's avatar
      Objects passed to NODEMASK_ALLOC() are relatively small in size and are · 111cdcfa
      David Rientjes authored
      backed by slab caches that are not of large order, traditionally never
      greater than PAGE_ALLOC_COSTLY_ORDER.
      
      Thus, using GFP_KERNEL for these allocations on large machines when
      CONFIG_NODES_SHIFT > 8 will cause the page allocator to loop endlessly in
      the allocation attempt, each time invoking both direct reclaim or the oom
      killer.
      
      This is of particular interest when using NODEMASK_ALLOC() from a
      mempolicy context (either directly in mm/mempolicy.c or the mempolicy
      constrained hugetlb allocations) since the oom killer always kills current
      when allocations are constrained by mempolicies.  So for all present use
      cases in the kernel, current would end up being oom killed when direct
      reclaim fails.  That would allow the NODEMASK_ALLOC() to succeed but
      current would have sacrificed itself upon returning.
      
      This patch adds gfp flags to NODEMASK_ALLOC() to pass to kmalloc() on
      CONFIG_NODES_SHIFT > 8; this parameter is a nop on other configurations. 
      All current use cases either directly from hugetlb code or indirectly via
      NODEMASK_SCRATCH() union __GFP_NORETRY to avoid direct reclaim and the oom
      killer when the slab allocator needs to allocate additional pages.
      
      The side-effect of this change is that all current use cases of either
      NODEMASK_ALLOC() or NODEMASK_SCRATCH() need appropriate -ENOMEM handling
      when the allocation fails (never for CONFIG_NODES_SHIFT <= 8).  All
      current use cases were audited and do have appropriate error handling at
      this time.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      111cdcfa
  14. 03 Nov, 2009 1 commit
  15. 13 Oct, 2009 2 commits
    • Lee Schermerhorn's avatar
      Offload the registration and unregistration of per node hstate sysfs · d95a3cf7
      Lee Schermerhorn authored
      attributes to a worker thread rather than attempt the
      allocation/attachment or detachment/freeing of the attributes in the
      context of the memory hotplug handler.
      
      I don't know that this is absolutely required, but the registration can
      sleep in allocations and other mem hot plug handlers do it this way.  If
      it turns out this is NOT required, we can drop this patch.
      
      N.B.,  Only tested build, boot, libhugetlbfs regression.
             i.e., no memory hotplug testing.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d95a3cf7
    • Lee Schermerhorn's avatar
      Register per node hstate attributes only for nodes with memory. As · 10839393
      Lee Schermerhorn authored
      suggested by David Rientjes.
      
      With Memory Hotplug, memory can be added to a memoryless node and a node
      with memory can become memoryless.  Therefore, add a memory on/off-line
      notifier callback to [un]register a node's attributes on transition
      to/from memoryless state.
      
      N.B.,  Only tested build, boot, libhugetlbfs regression.
             i.e., no memory hotplug testing.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      10839393
  16. 09 Nov, 2009 1 commit
    • David Rientjes's avatar
      When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if · 80b3efc8
      David Rientjes authored
      there are no present pages left.
      
      In such a situation, kswapd must also be stopped since it has nothing left
      to do.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      80b3efc8
  17. 13 Oct, 2009 3 commits
    • Lee Schermerhorn's avatar
      Register per node hstate sysfs attributes only for nodes with memory. · d8d062d7
      Lee Schermerhorn authored
      Global replacement of 'all online nodes" with "all nodes with memory" in
      mm/hugetlb.c.  Suggested by David Rientjes.
      
      A subsequent patch will handle adding/removing of per node hstate sysfs
      attributes when nodes transition to/from memoryless state via memory
      hotplug.
      
      NOTE: this patch has not been tested with memoryless nodes.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d8d062d7
    • Lee Schermerhorn's avatar
      Update the kernel huge tlb documentation to describe the numa memory · ca9680b9
      Lee Schermerhorn authored
      policy based huge page management.  Additionaly, the patch includes a fair
      amount of rework to improve consistency, eliminate duplication and set the
      context for documenting the memory policy interaction.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ca9680b9
    • Lee Schermerhorn's avatar
      Add the per huge page size control/query attributes to the per node · c3971e64
      Lee Schermerhorn authored
      sysdevs:
      
      /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
      	nr_hugepages       - r/w
      	free_huge_pages    - r/o
      	surplus_huge_pages - r/o
      
      The patch attempts to re-use/share as much of the existing global hstate
      attribute initialization and handling, and the "nodes_allowed" constraint
      processing as possible.
      
      Calling set_max_huge_pages() with no node indicates a change to global
      hstate parameters.  In this case, any non-default task mempolicy will be
      used to generate the nodes_allowed mask.  A valid node id indicates an
      update to that node's hstate parameters, and the count argument specifies
      the target count for the specified node.  From this info, we compute the
      target global count for the hstate and construct a nodes_allowed node mask
      contain only the specified node.
      
      Setting the node specific nr_hugepages via the per node attribute
      effectively ignores any task mempolicy or cpuset constraints.
      
      With this patch:
      
      (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
      ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
      
      Starting from:
      Node 0 HugePages_Total:     0
      Node 0 HugePages_Free:      0
      Node 0 HugePages_Surp:      0
      Node 1 HugePages_Total:     0
      Node 1 HugePages_Free:      0
      Node 1 HugePages_Surp:      0
      Node 2 HugePages_Total:     0
      Node 2 HugePages_Free:      0
      Node 2 HugePages_Surp:      0
      Node 3 HugePages_Total:     0
      Node 3 HugePages_Free:      0
      Node 3 HugePages_Surp:      0
      vm.nr_hugepages = 0
      
      Allocate 16 persistent huge pages on node 2:
      (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
      
      [Note that this is equivalent to:
      	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
      ]
      
      Yields:
      Node 0 HugePages_Total:     0
      Node 0 HugePages_Free:      0
      Node 0 HugePages_Surp:      0
      Node 1 HugePages_Total:     0
      Node 1 HugePages_Free:      0
      Node 1 HugePages_Surp:      0
      Node 2 HugePages_Total:    16
      Node 2 HugePages_Free:     16
      Node 2 HugePages_Surp:      0
      Node 3 HugePages_Total:     0
      Node 3 HugePages_Free:      0
      Node 3 HugePages_Surp:      0
      vm.nr_hugepages = 16
      
      Global controls work as expected--reduce pool to 8 persistent huge pages:
      (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
      
      Node 0 HugePages_Total:     0
      Node 0 HugePages_Free:      0
      Node 0 HugePages_Surp:      0
      Node 1 HugePages_Total:     0
      Node 1 HugePages_Free:      0
      Node 1 HugePages_Surp:      0
      Node 2 HugePages_Total:     8
      Node 2 HugePages_Free:      8
      Node 2 HugePages_Surp:      0
      Node 3 HugePages_Total:     0
      Node 3 HugePages_Free:      0
      Node 3 HugePages_Surp:      0
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c3971e64