Commits · 365ccf7e291640321f1acc53540d730f90573cbb · linux / linux-davinci

12 Nov, 2009 16 commits

Fix small inconsistent of ">" and ">=". · 365ccf7e

KOSAKI Motohiro authored Nov 12, 2009

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

365ccf7e

Now, All caller of reclaim use swap_cluster_max as SWAP_CLUSTER_MAX. · 8cad761e

KOSAKI Motohiro authored Nov 12, 2009

Then, we can remove it perfectly.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

8cad761e

In old days, we didn't have sc.nr_to_reclaim and it brought · c0ebe14e

KOSAKI Motohiro authored Nov 12, 2009

sc.swap_cluster_max misuse.

huge sc.swap_cluster_max might makes unnecessary OOM risk and no
performance benefit.

Now, we can stop its insane thing.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

c0ebe14e

shrink_all_zone() was introduced by commit (swsusp: rework · 27010999

KOSAKI Motohiro authored Nov 12, 2009

memory shrinker) for hibernate performance improvement.  and
sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
memory for suspend).

commit a06fe4d307 said

   Without the patch:
   Freed  14600 pages in  1749 jiffies = 32.61 MB/s (Anomolous!)
   Freed  88563 pages in 14719 jiffies = 23.50 MB/s
   Freed 205734 pages in 32389 jiffies = 24.81 MB/s

   With the patch:
   Freed  68252 pages in   496 jiffies = 537.52 MB/s
   Freed 116464 pages in   569 jiffies = 798.54 MB/s
   Freed 209699 pages in   705 jiffies = 1161.89 MB/s

At that time, their patch was pretty worth.  However, Modern Hardware
trend and recent VM improvement broke its worth.  From several reason, I
think we should remove shrink_all_zones() at all.

detail:

1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
  at no i/o congestion.
  but current shrink_zone() is sane, not slow.

2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
  fine on numa system.
  example)
    System has 4GB memory and each node have 2GB. and hibernate need 1GB.

    optimal)
       steal 500MB from each node.
    shrink_all_zones)
       steal 1GB from node-0.

  Oh, Cache balancing logic was broken. ;)
  Unfortunately, Desktop system moved ahead NUMA at nowadays.
  (Side note, if hibernate require 2GB, shrink_all_zones() never success
   on above machine)

3) if the node has several I/O flighting pages, shrink_all_zones() makes
  pretty bad result.

  schenario) hibernate need 1GB

  1) shrink_all_zones() try to reclaim 1GB from Node-0
  2) but it only reclaimed 990MB
  3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
  4) it reclaimed 990MB

  Oh, well. it reclaimed twice much than required.
  In the other hand, current shrink_zone() has sane baling out logic.
  then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.

4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
  shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.

Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
it bring good reviewability and debuggability, and solve above problems.

side note: Reclaim logic unificication makes two good side effect.
 - Fix recursive reclaim bug on shrink_all_memory().
   it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
 - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

27010999

Currently, sc.scap_cluster_max has double meanings. · f0f7902d

KOSAKI Motohiro authored Nov 12, 2009

 1) reclaim batch size as isolate_lru_pages()'s argument
 2) reclaim baling out thresolds

The two meanings pretty unrelated. Thus, Let's separate it.
this patch doesn't change any behavior.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

f0f7902d

Describe NUMA node symlink created for CPUs when CONFIG_NUMA is set. · 250bc948

Alex Chiang authored Nov 12, 2009

Signed-off-by: Alex Chiang <achiang@hp.com>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Gary Hade <garyhade@us.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

250bc948

You can discover which CPUs belong to a NUMA node by examining · 16671a0e

Alex Chiang authored Nov 12, 2009

/sys/devices/system/node/node#/

However, it's not convenient to go in the other direction, when looking at
/sys/devices/system/cpu/cpu#/

Yes, you can muck about in sysfs, but adding these symlinks makes life a
lot more convenient.
Signed-off-by: Alex Chiang <achiang@hp.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Gary Hade <garyhade@us.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

16671a0e

By returning early if the node is not online, we can unindent the · bbe1ea24

Alex Chiang authored Nov 12, 2009

interesting code by two levels.

No functional change.
Signed-off-by: Alex Chiang <achiang@hp.com>
Cc: Gary Hade <garyhade@us.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

bbe1ea24

By returning early if the node is not online, we can unindent the · af2c3d47

Alex Chiang authored Nov 12, 2009

interesting code by one level.

No functional change.
Signed-off-by: Alex Chiang <achiang@hp.com>
Cc: Gary Hade <garyhade@us.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

af2c3d47

Commit (mm: show node to memory section relationship with · 056efae3

Alex Chiang authored Nov 12, 2009

symlinks in sysfs) created symlinks from nodes to memory sections, e.g.

/sys/devices/system/node/node1/memory135 -> ../../memory/memory135

If you're examining the memory section though and are wondering what node
it might belong to, you can find it by grovelling around in sysfs, but
it's a little cumbersome.

Add a reverse symlink for each memory section that points back to the
node to which it belongs.
Signed-off-by: Alex Chiang <achiang@hp.com>
Cc: Gary Hade <garyhade@us.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

056efae3

When do_nonlinear_fault() realizes that the page table must have been · 7fc48bc3

Hugh Dickins authored Nov 12, 2009

corrupted for it to have been called, it does print_bad_pte() and returns
...  VM_FAULT_OOM, which is hard to understand.

It made some sense when I did it for 2.6.15, when do_page_fault() just
killed the current process; but nowadays it lets the OOM killer decide who
to kill - so page table corruption in one process would be liable to kill
another.

Change it to return VM_FAULT_SIGBUS instead: that doesn't guarantee that
the process will be killed, but is good enough for such a rare
abnormality, accompanied as it is by the "BUG: Bad page map" message.

And recent HWPOISON work has copied that code into do_swap_page(), when it
finds an impossible swap entry: fix that to VM_FAULT_SIGBUS too.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

7fc48bc3

CONFIG_DEBUG_SPINLOCK adds 12 or 16 bytes to a 32- or 64-bit spinlock_t, · 078dd777

Hugh Dickins authored Nov 12, 2009

and CONFIG_DEBUG_LOCK_ALLOC adds another 12 or 24 bytes to it: lockdep
enables both of those, and CONFIG_LOCK_STAT adds 8 or 16 bytes to that.

When 2.6.15 placed the split page table lock inside struct page (usually
sized 32 or 56 bytes), only CONFIG_DEBUG_SPINLOCK was a possibility, and
we ignored the enlargement (but fitted in CONFIG_GENERIC_LOCKBREAK's 4 by
letting the spinlock_t occupy both page->private and page->mapping).

Should these debugging options be allowed to double the size of a struct
page, when only one minority use of the page (as a page table) needs to
fit a spinlock in there?  Perhaps not.

Take the easy way out: switch off SPLIT_PTLOCK_CPUS when DEBUG_SPINLOCK or
DEBUG_LOCK_ALLOC is in force.  I've sometimes tried to be cleverer,
kmallocing a cacheline for the spinlock when it doesn't fit, but given up
each time.  Falling back to mm->page_table_lock (as we do when ptlock is
not split) lets lockdep check out the strictest path anyway.

And now that some arches allow 8192 cpus, use 999999 for infinity.

(What has this got to do with KSM swapping?  It doesn't care about the
size of struct page, but may care about random junk in page->mapping - to
be explained separately later.)
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

078dd777

KSM swapping will know where page_referenced_one() and try_to_unmap_one() · 5c930dff

Hugh Dickins authored Nov 12, 2009

should look.  It could hack page->index to get them to do what it wants,
but it seems cleaner now to pass the address down to them.

Make the same change to page_mkclean_one(), since it follows the same
pattern; but there's no real need in its case.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

5c930dff

Remove three degrees of obfuscation, left over from when we had · 71228787

Hugh Dickins authored Nov 12, 2009

CONFIG_UNEVICTABLE_LRU.  MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
CONFIG_HAVE_MLOCK is CONFIG_MMU.  rmap.o (and memory-failure.o) are only
built when CONFIG_MMU, so don't need such conditions at all.

Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
169 defconfigs: leave those to evolve in due course.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

71228787

There's contorted mlock/munlock handling in try_to_unmap_anon() and · 2e6922d3

Hugh Dickins authored Nov 12, 2009

try_to_unmap_file(), which we'd prefer not to repeat for KSM swapping. 
Simplify it by moving it all down into try_to_unmap_one().

One thing is then lost, try_to_munlock()'s distinction between when no vma
holds the page mlocked, and when a vma does mlock it, but we could not get
mmap_sem to set the page flag.  But its only caller takes no interest in
that distinction (and is better testing SWAP_MLOCK anyway), so let's keep
the code simple and return SWAP_AGAIN for both cases.

try_to_unmap_file()'s TTU_MUNLOCK nonlinear handling was particularly
amusing: once unravelled, it turns out to have been choosing between two
different ways of doing the same nothing.  Ah, no, one way was actually
returning SWAP_FAIL when it meant to return SWAP_SUCCESS.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

2e6922d3

At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set · a7f80f7d

Hugh Dickins authored Nov 12, 2009

in page->mapping, with the higher bits a pointer to the anon_vma; and have
defined PageKsm(page) as that with NULL anon_vma.

But KSM swapping will need to store a pointer there: so in preparation for
that, now define PAGE_MAPPING_FLAGS as the low two bits, including
PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
other use for the bit emerges).

Declare page_rmapping(page) to return the pointer part of page->mapping,
and page_anon_vma(page) to return the anon_vma pointer when that's what it
is.  Use these in a few appropriate places: notably, unuse_vma() has been
testing page->mapping, but is better to be testing page_anon_vma() (cases
may be added in which flag bits are set without any pointer).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

a7f80f7d

14 Nov, 2009 1 commit

If reclaim fails to make sufficient progress, the priority is raised. · 044ef24d

KOSAKI Motohiro authored Nov 14, 2009

Once the priority is higher, kswapd starts waiting on congestion. 
However, if the zone is below the min watermark then kswapd needs to
continue working without delay as there is a danger of an increased rate
of GFP_ATOMIC allocation failure.

This patch changes the conditions under which kswapd waits on congestion
by only going to sleep if the min watermarks are being met.

[mel@csn.ul.ie: Add stats to track how relevant the logic is]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

044ef24d

12 Nov, 2009 4 commits

After kswapd balances all zones in a pgdat, it goes to sleep. In the · 2d6e38ae

Mel Gorman authored Nov 12, 2009

event of no IO congestion, kswapd can go to sleep very shortly after the
high watermark was reached.  If there are a constant stream of allocations
from parallel processes, it can mean that kswapd went to sleep too quickly
and the high watermark is not being maintained for sufficient length time.

This patch makes kswapd go to sleep as a two-stage process.  It first
tries to sleep for HZ/10.  If it is woken up by another process or the
high watermark is no longer met, it's considered a premature sleep and
kswapd continues work.  Otherwise it goes fully to sleep.

This adds more counters to distinguish between fast and slow breaches of
watermarks.  A "fast" premature sleep is one where the low watermark was
hit in a very short time after kswapd going to sleep.  A "slow" premature
sleep indicates that the high watermark was breached after a very short
interval.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Frans Pop <elendil@planet.nl>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

2d6e38ae

Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least · a70768e7

Mel Gorman authored Nov 12, 2009

that the commits 373c0a7e 8aa7e847 dramatically increased the number of
GFP_ATOMIC failures that were occuring within a wireless driver. 
Reverting this patch seemed to help a lot even though it was pointed out
that the congestion changes were very far away from high-order atomic
allocations.

The key to why the revert makes such a big difference is down to timing
and how long direct reclaimers wait versus kswapd.  With the patch
reverted, the congestion_wait() is on the SYNC queue instead of the ASYNC.
 As a significant part of the workload involved reads, it makes sense that
the SYNC list is what was truely congested and with the revert processes
were waiting on congestion as expected.  Hence, direct reclaimers stalled
properly and kswapd was able to do its job with fewer stalls.

This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
for direct reclaimers.  Instead of making the congestion_wait() on the
SYNC queue which would only fix a particular type of workload, this patch
adds a third type of congestion_wait - BLK_RW_BOTH which first waits on
the ASYNC and then the SYNC queue if the timeout has not been reached.  In
tests, this counter-intuitively results in kswapd stalling less and
freeing up pages resulting in fewer allocation failures and fewer
direct-reclaim-orientated stalls.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Frans Pop <elendil@planet.nl>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

a70768e7

ERROR: code indent should use tabs where possible · 6f848600

Andrew Morton authored Nov 12, 2009

#99: FILE: mm/oom_kill.c:209:
+ ^I * to kill current.We have to random task kill in this case.$

ERROR: code indent should use tabs where possible
#100: FILE: mm/oom_kill.c:210:
+ ^I * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.$

ERROR: code indent should use tabs where possible
#101: FILE: mm/oom_kill.c:211:
+ ^I */$

ERROR: code indent should use tabs where possible
#107: FILE: mm/oom_kill.c:216:
+ ^I * The nodemask here is a nodemask passed to alloc_pages(). Now,$

ERROR: code indent should use tabs where possible
#108: FILE: mm/oom_kill.c:217:
+ ^I * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy$

ERROR: code indent should use tabs where possible
#109: FILE: mm/oom_kill.c:218:
+ ^I * feature. mempolicy is an only user of nodemask here.$

ERROR: code indent should use tabs where possible
#111: FILE: mm/oom_kill.c:220:
+ ^I */$

ERROR: code indent should use tabs where possible
#169: FILE: mm/page_alloc.c:1672:
+^I ^I* GFP_THISNODE contains __GFP_NORETRY and we never hit this.$

ERROR: code indent should use tabs where possible
#170: FILE: mm/page_alloc.c:1673:
+^I ^I* Sanity check for bare calls of __GFP_THISNODE, not real OOM.$

ERROR: code indent should use tabs where possible
#171: FILE: mm/page_alloc.c:1674:
+^I ^I* The caller should handle page allocation failure by itself if$

ERROR: code indent should use tabs where possible
#172: FILE: mm/page_alloc.c:1675:
+^I ^I* it specifies __GFP_THISNODE.$

ERROR: code indent should use tabs where possible
#173: FILE: mm/page_alloc.c:1676:
+^I ^I* Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.$

ERROR: code indent should use tabs where possible
#174: FILE: mm/page_alloc.c:1677:
+^I ^I*/$

total: 13 errors, 0 warnings, 125 lines checked

./patches/oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

6f848600

Fix node-oriented allocation handling in oom-kill.c I myself think of this · 7b927491

KAMEZAWA Hiroyuki authored Nov 12, 2009

as a bugfix not as an ehnancement.

In these days, things are changed as
  - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
  - mempolicy don't maintain its own private zonelists.
  (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
  - check nodemask, if nodemask && nodemask doesn't cover all
    node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
  - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
And
  - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior. If callers use __GFP_THISNODE
    it should handle "page allocation failure" by itself.

  - handle __GFP_NOFAIL+__GFP_THISNODE path.
    This is something like a FIXME but this gfpmask is not used now.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

7b927491

11 Nov, 2009 1 commit

fix race, add pid & comm to message · 4b608023

David Rientjes authored Nov 11, 2009

On Tue, 10 Nov 2009, akpm@linux-foundation.org wrote:

> diff -puN mm/oom_kill.c~oom-kill-show-virtual-size-and-rss-information-of-the-killed-process mm/oom_kill.c
> --- a/mm/oom_kill.c~oom-kill-show-virtual-size-and-rss-information-of-the-killed-process
> +++ a/mm/oom_kill.c
> @@ -352,6 +352,8 @@ static void dump_header(gfp_t gfp_mask,
>  		dump_tasks(mem);
>  }
>
> +#define K(x) ((x) << (PAGE_SHIFT-10))
> +
>  /*
>   * Send SIGKILL to the selected  process irrespective of  CAP_SYS_RAW_IO
>   * flag though it's unlikely that  we select a process with CAP_SYS_RAW_IO
> @@ -371,9 +373,16 @@ static void __oom_kill_task(struct task_
>  		return;
>  	}
>
> -	if (verbose)
> -		printk(KERN_ERR "Killed process %d (%s)\n",
> -				task_pid_nr(p), p->comm);
> +	if (verbose) {
> +		task_lock(p);
> +		printk(KERN_ERR "Killed process %d (%s) "
> +		       "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
> +		       task_pid_nr(p), p->comm,
> +		       K(p->mm->total_vm),
> +		       K(get_mm_counter(p->mm, anon_rss)),
> +		       K(get_mm_counter(p->mm, file_rss)));
> +		task_unlock(p);
> +	}
>
>  	/*
>  	 * We give our sacrificial lamb high priority and access to

There's a race there which can dereference a NULL p->mm.

p->mm is protected by task_lock(), but there's no check added here that
ensures p->mm is still valid.  The previous check for !p->mm in
__oom_kill_task() is not protected by task_lock(), so there's a race:

	select_bad_process()
	oom_kill_process(p)
					do_exit()
					exit_signals(p) /* PF_EXITING */
	oom_kill_task(p)
	__oom_kill_task(p)
					exit_mm(p)
					task_lock(p)
					p->mm = NULL
					task_unlock(p)
	printk() of p->mm->total_vm

Please merge this as a fix.
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

4b608023

10 Nov, 2009 1 commit

In a typical oom analysis scenario, we frequently want to know whether the · 610a143d

KOSAKI Motohiro authored Nov 10, 2009

killed process has a memory leak or not at the first step.  This patch
adds vsz and rss information to the oom log to help this analysis.  To
save time for the debugging.

example:
===================================================================
rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Pid: 1308, comm: rsyslogd Not tainted 2.6.32-rc6 #24
Call Trace:
[<ffffffff8132e35b>] ?_spin_unlock+0x2b/0x40
[<ffffffff810f186e>] oom_kill_process+0xbe/0x2b0

(snip)

492283 pages non-shared
Out of memory: kill process 2341 (memhog) score 527276 or a child
Killed process 2341 (memhog) vsz:1054552kB, anon-rss:970588kB, file-rss:4kB
===========================================================================
                             ^
                             |
                            here
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

610a143d

30 Oct, 2009 1 commit

It's reported that OOM-Killer kills Gnone/KDE first. And yes, we can · b9531803

KAMEZAWA Hiroyuki authored Oct 30, 2009

reproduce it easily.

Now, oom-killer uses mm->total_vm as its base value.  But in recent
applications, there are a big gap between VM size and RSS size.  Because

  - Applications attaches much dynamic libraries. (Gnome, KDE, etc...)
  - Applications may alloc big VM area but use small part of them.
    (Java, and multi-threaded applications has this tendency because
     of default-size of stack.)

I think using mm->total_vm as score for oom-kill is not good.  By the same
reason, overcommit memory can't work as expected.  (In other words, if we
depends on total_vm, using overcommit more positive is a good choice.)

This patch uses mm->anon_rss/file_rss as base value for calculating badness.

Following is changes to OOM score(badness) on an environment with 1.6G memory
plus memory-eater(500M & 1G).

Top 10 of badness score. (The highest one is the first candidate to be killed)
Before
badness program
91228	gnome-settings-
94210	clock-applet
103202	mixer_applet2
106563	tomboy
112947	gnome-terminal
128944	mmap              <----------- 500M malloc
129332	nautilus
215476	bash              <----------- parent of 2 mallocs.
256944	mmap              <----------- 1G malloc
423586	gnome-session

After
badness
1911	mixer_applet2
1955	clock-applet
1986	xinit
1989	gnome-session
2293	nautilus
2955	gnome-terminal
4113	tomboy
104163	mmap             <----------- 500M malloc.
168577	bash             <----------- parent of 2 mallocs
232375	mmap             <----------- 1G malloc

seems good for me.  Maybe we can tweak this patch more, but this one will
be a good one as a start point.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

b9531803

03 Nov, 2009 2 commits

When the code jumps to the `out', `referenced' is still zero. So there is · 18b60545

Huang Shijie authored Nov 04, 2009

no need to check it.
Signed-off-by: Huang Shijie <shijie8@gmail.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

18b60545

Just simplify the code when `mlocked' is true. · b9f0f80c

Huang Shijie authored Nov 04, 2009

Signed-off-by: Huang Shijie <shijie8@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

b9f0f80c

16 Oct, 2009 2 commits

Fix the comment for try_to_unmap_anon() with the new arguments. · ad3a5ec5

Huang Shijie authored Oct 17, 2009

Signed-off-by: Huang Shijie <shijie8@gmail.com>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ad3a5ec5

Commit ("Streamline generic_file_* interfaces and filemap · 3da78122

Vincent Li authored Oct 17, 2009

cleanups") removed generic_file_write() in filemap.  Change the comment in
vmscan pageout() to __generic_file_aio_write().
Signed-off-by: Vincent Li <macli@brc.ubc.ca>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

3da78122

15 Oct, 2009 1 commit

Reorder (and comment) the fields of swap_info_struct, to make better · d5cdf58f

Hugh Dickins authored Oct 16, 2009

use of its cachelines: it's good for swap_duplicate() in particular
if unsigned int max and swap_map are near the start.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

d5cdf58f

11 Nov, 2009 5 commits

While we're fiddling with the swap_map values, let's assign a particular · 0c6005f9

Hugh Dickins authored Nov 11, 2009

value to shmem/tmpfs swap pages: their swap counts are never incremented,
and it helps swapoff's try_to_unuse() a little if it can immediately
distinguish those pages from process pages.

Since we've no use for SWAP_MAP_BAD | COUNT_CONTINUED,
we might as well use that 0xbf value for SWAP_MAP_SHMEM.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

0c6005f9

Swap is duplicated (reference count incremented by one) whenever the same · c7d29a16

Hugh Dickins authored Nov 11, 2009

swap page is inserted into another mm (when forking finds a swap entry in
place of a pte, or when reclaim unmaps a pte to insert the swap entry).

swap_info_struct's vmalloc'ed swap_map is the array of these reference
counts: but what happens when the unsigned short (or unsigned char since
the preceding patch) is full? (and its high bit is kept for a cache flag)

We then lose track of it, never freeing, leaving it in use until swapoff:
at which point we _hope_ that a single pass will have found all instances,
assume there are no more, and will lose user data if we're wrong.

Swapping of KSM pages has not yet been enabled; but it is implemented,
and makes it very easy for a user to overflow the maximum swap count:
possible with ordinary process pages, but unlikely, even when pid_max
has been raised from PID_MAX_DEFAULT.

This patch implements swap count continuations: when the count overflows,
a continuation page is allocated and linked to the original vmalloc'ed
map page, and this used to hold the continuation counts for that entry
and its neighbours. These continuation pages are seldom referenced:
the common paths all work on the original swap_map, only referring to
a continuation page when the low "digit" of a count is incremented or
decremented through SWAP_MAP_MAX.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

c7d29a16

Halve the vmalloc'ed swap_map array from unsigned shorts to unsigned · 94cc3bd4

Hugh Dickins authored Nov 11, 2009

chars: it's still very unusual to reach a swap count of 126, and the
next patch allows it to be extended indefinitely.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

94cc3bd4

Though swap_count() is useful, I'm finding that swap_has_cache() and · cde9ae7c

Hugh Dickins authored Nov 11, 2009

encode_swapmap() obscure what happens in the swap_map entry, just at
those points where I need to understand it.  Remove them, and pass
more usable "usage" values to scan_swap_map(), swap_entry_free() and
__swap_duplicate(), instead of the SWAP_MAP and SWAP_CACHE enum.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cde9ae7c

Move CONFIG_HIBERNATION's swapdev_block() into the main CONFIG_HIBERNATION · 343c1b76

Hugh Dickins authored Nov 11, 2009

block, remove extraneous whitespace and return, fix typo in a comment.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

343c1b76

03 Nov, 2009 2 commits

Sorry, just noticed what the diff contexts don't show: Jiri's patch is · 00bc1f5a

Hugh Dickins authored Nov 04, 2009

initializing p->first_swap_extent.list at a point before p has been
decided - we may kfree that newly allocated p and go on to reuse an
existing free entry for p.

Now, the patch is not actually wrong: an existing free entry will have a
good empty first_swap_extent.list; but it looks suspicious, it seems
strange to initialize a field in something we're about to kfree, and I'd
rather we put that initialization back to where it was in 2.6.32.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

00bc1f5a

Double swapon on a device causes a crash: · f27e5313

Jiri Slaby authored Nov 03, 2009

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff810af160>] sys_swapon+0x1f0/0xc60
PGD 1dc0b067 PUD 1dc09067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file:
CPU 1
Modules linked in:
Pid: 562, comm: swapon Tainted: G        W  2.6.32-rc5-mm1_64 #867
RIP: 0010:[<ffffffff810af160>]  [<ffffffff810af160>] sys_swapon+0x1f0/0xc60
...

It is due to swap_info_struct->first_swap_extent.list not being
initialized. ->next is NULL in such a situation and
destroy_swap_extents fails to iterate over the list with the BUG
above.

Introduced by swap_info-include-first_swap_extent.patch. Revert the
INIT_LIST_HEAD move.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

f27e5313

11 Nov, 2009 3 commits

Make better use of the space by folding first swap_extent into its · 11634ed1

Hugh Dickins authored Nov 11, 2009

swap_info_struct, instead of just the list_head: swap partitions need
only that one, and for others it's used as a circular list anyway.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

11634ed1

The swap_info_struct is only 76 or 104 bytes, but it does seem wrong · 306a9cbc

Hugh Dickins authored Nov 11, 2009

to reserve an array of about 30 of them in bss, when most people will
want only one.  Change swap_info[] to an array of pointers.

That does need a "type" field in the structure: pack it as a char with
next type and short prio (aha, char is unsigned by default on PowerPC).
Use the (admittedly peculiar) name "type" throughout for this index.

/proc/swaps does not take swap_lock: I wouldn't want it to, but do take
care with barriers when adding a new item to the array (never removed).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

306a9cbc

The swap_info_struct is mostly private to mm/swapfile.c, with only · 8092a54b

Hugh Dickins authored Nov 11, 2009

one other in-tree user: get_swap_bio().  Adjust its interface to
map_swap_page(), so that we can then remove get_swap_info_struct().

But there is a popular user out-of-tree, TuxOnIce: so leave the
declaration of swap_info_struct in linux/swap.h.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Nigel Cunningham <ncunningham@crca.org.au>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

8092a54b

13 Oct, 2009 1 commit

- avoid wasting more precious resources (DMA or DMA32 pools), when · abad5ce6

Jan Beulich authored Oct 13, 2009

  being called through vmalloc_32{,_user}()
- explicitly allow using high memory here even if the outer allocation
  request doesn't allow it
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

abad5ce6