1. 28 Oct, 2009 1 commit
    • Andrew Lunn's avatar
      There is a race condition in um_request_irq(). The SIGIO handling is · 356968d7
      Andrew Lunn authored
      first enabled with the call to activate_fd().  The irq handler is then
      registered with request_irq().  What can happen is that directly after
      activate_fd() the SIGIO goes off and the IRQ source is disabled at the low
      level, the pollfd is set to -1.  Since no irq handler has yet been
      registered, the interrupt it left disabled at the low level.  The
      interrupt handler is then registered, but its too late, the interrupt
      source has been disabled at the lower level and is never re-enabled.  To
      fix this race condition i swapped the order.  First request_irq() then
      activate_fd() the interrupt sources.
      
      There is a second bug.  In 2.6.31 there was a change to the way __do_IRQ()
      and friends work for chained interrupt sources.  The old way was that all
      chained interrupt handlers were called.  The new way is that the chain is
      walked only until a handler returns IRQ_HANDLED or IRQ_WAKE_THREAD. 
      uml_net_interrupt() would always return IRQ_HANDLED, irrespective of if
      the device really did receive an interrupt or not.  This mean with the new
      code only the first device on a chained interrupt ever got its interrupts
      handled.  The second/third/...  device never got any interrupts processed.
       I changed uml_net_interrupt() to always return IRQ_NONE so that all
      handlers get called on a chained interrupt.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      356968d7
  2. 30 Sep, 2009 1 commit
  3. 14 Oct, 2009 1 commit
    • Arjan van de Ven's avatar
      gcc is not convinced that the floppy.c ioctl has sufficient bound checks: · ca4665c7
      Arjan van de Ven authored
      In function `copy_from_user',
          inlined from `fd_copyin' at drivers/block/floppy.c:3080,
          inlined from `fd_ioctl' at drivers/block/floppy.c:3503:
      /home/arjan/linux/arch/x86/include/asm/uaccess_32.h:211:
      warning: call to `copy_from_user_overflow' declared with attribute
      warning: copy_from_user buffer size is not provably correct
      
      And frankly, as a human I have a hard time proving the same more or less
      (the size comes from the ioctl argument.  humpf.  maybe.  the code isn't
      very nice)
      
      This patch adds an explicit check to make 100% sure it's safe, better than
      finding out later that there indeed was a gap.
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ca4665c7
  4. 12 Oct, 2009 1 commit
  5. 23 Jul, 2009 1 commit
  6. 03 Nov, 2009 1 commit
  7. 17 Nov, 2009 1 commit
  8. 09 Nov, 2009 1 commit
  9. 17 Nov, 2009 1 commit
  10. 03 Nov, 2009 1 commit
  11. 14 Feb, 2009 2 commits
  12. 09 Nov, 2009 1 commit
  13. 17 Nov, 2009 1 commit
    • Larry Woodman's avatar
      hugetlb_fault() takes the mm->page_table_lock spinlock then calls · 7147153d
      Larry Woodman authored
      hugetlb_cow().  If the alloc_huge_page() in hugetlb_cow() fails due to an
      insufficient huge page pool it calls unmap_ref_private() with the
      mm->page_table_lock held.  unmap_ref_private() then calls
      unmap_hugepage_range() which tries to acquire the mm->page_table_lock.
      
      [<ffffffff810928c3>] print_circular_bug_tail+0x80/0x9f
       [<ffffffff8109280b>] ? check_noncircular+0xb0/0xe8
       [<ffffffff810935e0>] __lock_acquire+0x956/0xc0e
       [<ffffffff81093986>] lock_acquire+0xee/0x12e
       [<ffffffff8111a7a6>] ? unmap_hugepage_range+0x3e/0x84
       [<ffffffff8111a7a6>] ? unmap_hugepage_range+0x3e/0x84
       [<ffffffff814c348d>] _spin_lock+0x40/0x89
       [<ffffffff8111a7a6>] ? unmap_hugepage_range+0x3e/0x84
       [<ffffffff8111afee>] ? alloc_huge_page+0x218/0x318
       [<ffffffff8111a7a6>] unmap_hugepage_range+0x3e/0x84
       [<ffffffff8111b2d0>] hugetlb_cow+0x1e2/0x3f4
       [<ffffffff8111b935>] ? hugetlb_fault+0x453/0x4f6
       [<ffffffff8111b962>] hugetlb_fault+0x480/0x4f6
       [<ffffffff8111baee>] follow_hugetlb_page+0x116/0x2d9
       [<ffffffff814c31a7>] ? _spin_unlock_irq+0x3a/0x5c
       [<ffffffff81107b4d>] __get_user_pages+0x2a3/0x427
       [<ffffffff81107d0f>] get_user_pages+0x3e/0x54
       [<ffffffff81040b8b>] get_user_pages_fast+0x170/0x1b5
       [<ffffffff81160352>] dio_get_page+0x64/0x14a
       [<ffffffff8116112a>] __blockdev_direct_IO+0x4b7/0xb31
       [<ffffffff8115ef91>] blkdev_direct_IO+0x58/0x6e
       [<ffffffff8115e0a4>] ? blkdev_get_blocks+0x0/0xb8
       [<ffffffff810ed2c5>] generic_file_aio_read+0xdd/0x528
       [<ffffffff81219da3>] ? avc_has_perm+0x66/0x8c
       [<ffffffff81132842>] do_sync_read+0xf5/0x146
       [<ffffffff8107da00>] ? autoremove_wake_function+0x0/0x5a
       [<ffffffff81211857>] ? security_file_permission+0x24/0x3a
       [<ffffffff81132fd8>] vfs_read+0xb5/0x126
       [<ffffffff81133f6b>] ? fget_light+0x5e/0xf8
       [<ffffffff81133131>] sys_read+0x54/0x8c
       [<ffffffff81011e42>] system_call_fastpath+0x16/0x1b
      
      This can be fixed by dropping the mm->page_table_lock around the call to
      unmap_ref_private() if alloc_huge_page() fails, its dropped right below in
      the normal path anyway.
      Signed-off-by: default avatarLarry Woodman <lwooman@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7147153d
  14. 14 Nov, 2009 1 commit
  15. 13 Nov, 2009 6 commits
    • Hugh Dickins's avatar
      Add a pointer to the ksm page into struct stable_node, holding a reference · b7ceb250
      Hugh Dickins authored
      to the page while the node exists.  Put a pointer to the stable_node into
      the ksm page's ->mapping.
      
      Then we don't need get_ksm_page() while traversing the stable tree: the
      page to compare against is sure to be present and correct, even if it's no
      longer visible through any of its existing rmap_items.
      
      And we can handle the forked ksm page case more efficiently: no need to
      memcmp our way through the tree to find its match.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b7ceb250
    • Hugh Dickins's avatar
      Though we still do well to keep rmap_items in the unstable tree without a · 38e81a30
      Hugh Dickins authored
      separate tree_item at the node, for several reasons it becomes awkward to
      keep rmap_items in the stable tree without a separate stable_node: lack of
      space in the nicely-sized rmap_item, the need for an anchor as rmap_items
      are removed, the need for a node even when temporarily no rmap_items are
      attached to it.
      
      So declare struct stable_node (rb_node to place it in the tree and
      hlist_head for the rmap_items hanging off it), and convert stable tree
      handling to use it: without yet taking advantage of it.  Note how one
      stable_tree_insert() of a node now has _two_ stable_tree_append()s of the
      two rmap_items being merged.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      38e81a30
    • Hugh Dickins's avatar
      Free up a pointer in struct rmap_item, by making the mm_slot's rmap_list a · d02561bf
      Hugh Dickins authored
      singly-linked list: we always traverse that list sequentially, and we
      don't even lose any prefetches (but should consider adding a few later). 
      Name it rmap_list throughout.
      
      Do we need to free up that pointer?  Not immediately, and in the end, we
      could continue to avoid it with a union; but having done the conversion,
      let's keep it this way, since there's no downside, and maybe we'll want
      more in future (struct rmap_item is a cache-friendly 32 bytes on 32-bit
      and 64 bytes on 64-bit, so we shall want to avoid expanding it).
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d02561bf
    • Hugh Dickins's avatar
      Cleanup: make argument names more consistent from cmp_and_merge_page() · 93c5cded
      Hugh Dickins authored
      down to replace_page(), so that it's easier to follow the rmap_item's page
      and the matching tree_page and the merged kpage through that code.
      
      In some places, e.g.  break_cow(), pass rmap_item instead of separate mm
      and address.
      
      cmp_and_merge_page() initialize tree_page to NULL, to avoid a "may be used
      uninitialized" warning seen in one config by Anil SB.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      93c5cded
    • Hugh Dickins's avatar
      There is no need for replace_page() to calculate a write-protected prot · 68f51208
      Hugh Dickins authored
      vm_page_prot must already be write-protected for an anonymous page (see
      mm/memory.c do_anonymous_page() for similar reliance on vm_page_prot).
      
      There is no need for try_to_merge_one_page() to get_page and put_page on
      newpage and oldpage: in every case we already hold a reference to each of
      them.
      
      But some instinct makes me move try_to_merge_one_page()'s unlock_page of
      oldpage down after replace_page(): that doesn't increase contention on the
      ksm page, and makes thinking about the transition easier.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      68f51208
    • Hugh Dickins's avatar
      1. remove_rmap_item_from_tree() is called as a precaution from · 5bfe68c0
      Hugh Dickins authored
         various places: don't dirty the rmap_item cacheline unnecessarily,
         just mask the flags out of the address when they have been set.
      
      2. First get_next_rmap_item() removes an unstable rmap_item from its tree,
         then shortly afterwards cmp_and_merge_page() removes a stable rmap_item
         from its tree: it's easier just to do both at once (but definitely keep
         the BUG_ON(age > 1) which guards against a future omission).
      
      3. When cmp_and_merge_page() moves an rmap_item from unstable to stable
         tree, it does its own rb_erase() and accounting: that's better
         expressed by remove_rmap_item_from_tree().
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5bfe68c0
  16. 12 Nov, 2009 16 commits
    • KOSAKI Motohiro's avatar
      Fix small inconsistent of ">" and ">=". · 365ccf7e
      KOSAKI Motohiro authored
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      365ccf7e
    • KOSAKI Motohiro's avatar
      Now, All caller of reclaim use swap_cluster_max as SWAP_CLUSTER_MAX. · 8cad761e
      KOSAKI Motohiro authored
      Then, we can remove it perfectly.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8cad761e
    • KOSAKI Motohiro's avatar
      In old days, we didn't have sc.nr_to_reclaim and it brought · c0ebe14e
      KOSAKI Motohiro authored
      sc.swap_cluster_max misuse.
      
      huge sc.swap_cluster_max might makes unnecessary OOM risk and no
      performance benefit.
      
      Now, we can stop its insane thing.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c0ebe14e
    • KOSAKI Motohiro's avatar
      shrink_all_zone() was introduced by commit d6277db4 (swsusp: rework · 27010999
      KOSAKI Motohiro authored
      memory shrinker) for hibernate performance improvement.  and
      sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
      memory for suspend).
      
      commit a06fe4d307 said
      
         Without the patch:
         Freed  14600 pages in  1749 jiffies = 32.61 MB/s (Anomolous!)
         Freed  88563 pages in 14719 jiffies = 23.50 MB/s
         Freed 205734 pages in 32389 jiffies = 24.81 MB/s
      
         With the patch:
         Freed  68252 pages in   496 jiffies = 537.52 MB/s
         Freed 116464 pages in   569 jiffies = 798.54 MB/s
         Freed 209699 pages in   705 jiffies = 1161.89 MB/s
      
      At that time, their patch was pretty worth.  However, Modern Hardware
      trend and recent VM improvement broke its worth.  From several reason, I
      think we should remove shrink_all_zones() at all.
      
      detail:
      
      1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
        at no i/o congestion.
        but current shrink_zone() is sane, not slow.
      
      2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
        fine on numa system.
        example)
          System has 4GB memory and each node have 2GB. and hibernate need 1GB.
      
          optimal)
             steal 500MB from each node.
          shrink_all_zones)
             steal 1GB from node-0.
      
        Oh, Cache balancing logic was broken. ;)
        Unfortunately, Desktop system moved ahead NUMA at nowadays.
        (Side note, if hibernate require 2GB, shrink_all_zones() never success
         on above machine)
      
      3) if the node has several I/O flighting pages, shrink_all_zones() makes
        pretty bad result.
      
        schenario) hibernate need 1GB
      
        1) shrink_all_zones() try to reclaim 1GB from Node-0
        2) but it only reclaimed 990MB
        3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
        4) it reclaimed 990MB
      
        Oh, well. it reclaimed twice much than required.
        In the other hand, current shrink_zone() has sane baling out logic.
        then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.
      
      4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
        shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.
      
      Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
      it bring good reviewability and debuggability, and solve above problems.
      
      side note: Reclaim logic unificication makes two good side effect.
       - Fix recursive reclaim bug on shrink_all_memory().
         it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
       - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      27010999
    • KOSAKI Motohiro's avatar
      Currently, sc.scap_cluster_max has double meanings. · f0f7902d
      KOSAKI Motohiro authored
       1) reclaim batch size as isolate_lru_pages()'s argument
       2) reclaim baling out thresolds
      
      The two meanings pretty unrelated. Thus, Let's separate it.
      this patch doesn't change any behavior.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f0f7902d
    • Alex Chiang's avatar
      Describe NUMA node symlink created for CPUs when CONFIG_NUMA is set. · 250bc948
      Alex Chiang authored
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Gary Hade <garyhade@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      250bc948
    • Alex Chiang's avatar
      You can discover which CPUs belong to a NUMA node by examining · 16671a0e
      Alex Chiang authored
      /sys/devices/system/node/node#/
      
      However, it's not convenient to go in the other direction, when looking at
      /sys/devices/system/cpu/cpu#/
      
      Yes, you can muck about in sysfs, but adding these symlinks makes life a
      lot more convenient.
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Gary Hade <garyhade@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      16671a0e
    • Alex Chiang's avatar
      By returning early if the node is not online, we can unindent the · bbe1ea24
      Alex Chiang authored
      interesting code by two levels.
      
      No functional change.
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Cc: Gary Hade <garyhade@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bbe1ea24
    • Alex Chiang's avatar
      By returning early if the node is not online, we can unindent the · af2c3d47
      Alex Chiang authored
      interesting code by one level.
      
      No functional change.
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Cc: Gary Hade <garyhade@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      af2c3d47
    • Alex Chiang's avatar
      Commit c04fc586 (mm: show node to memory section relationship with · 056efae3
      Alex Chiang authored
      symlinks in sysfs) created symlinks from nodes to memory sections, e.g.
      
      /sys/devices/system/node/node1/memory135 -> ../../memory/memory135
      
      If you're examining the memory section though and are wondering what node
      it might belong to, you can find it by grovelling around in sysfs, but
      it's a little cumbersome.
      
      Add a reverse symlink for each memory section that points back to the
      node to which it belongs.
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Cc: Gary Hade <garyhade@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      056efae3
    • Hugh Dickins's avatar
      When do_nonlinear_fault() realizes that the page table must have been · 7fc48bc3
      Hugh Dickins authored
      corrupted for it to have been called, it does print_bad_pte() and returns
      ...  VM_FAULT_OOM, which is hard to understand.
      
      It made some sense when I did it for 2.6.15, when do_page_fault() just
      killed the current process; but nowadays it lets the OOM killer decide who
      to kill - so page table corruption in one process would be liable to kill
      another.
      
      Change it to return VM_FAULT_SIGBUS instead: that doesn't guarantee that
      the process will be killed, but is good enough for such a rare
      abnormality, accompanied as it is by the "BUG: Bad page map" message.
      
      And recent HWPOISON work has copied that code into do_swap_page(), when it
      finds an impossible swap entry: fix that to VM_FAULT_SIGBUS too.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7fc48bc3
    • Hugh Dickins's avatar
      CONFIG_DEBUG_SPINLOCK adds 12 or 16 bytes to a 32- or 64-bit spinlock_t, · 078dd777
      Hugh Dickins authored
      and CONFIG_DEBUG_LOCK_ALLOC adds another 12 or 24 bytes to it: lockdep
      enables both of those, and CONFIG_LOCK_STAT adds 8 or 16 bytes to that.
      
      When 2.6.15 placed the split page table lock inside struct page (usually
      sized 32 or 56 bytes), only CONFIG_DEBUG_SPINLOCK was a possibility, and
      we ignored the enlargement (but fitted in CONFIG_GENERIC_LOCKBREAK's 4 by
      letting the spinlock_t occupy both page->private and page->mapping).
      
      Should these debugging options be allowed to double the size of a struct
      page, when only one minority use of the page (as a page table) needs to
      fit a spinlock in there?  Perhaps not.
      
      Take the easy way out: switch off SPLIT_PTLOCK_CPUS when DEBUG_SPINLOCK or
      DEBUG_LOCK_ALLOC is in force.  I've sometimes tried to be cleverer,
      kmallocing a cacheline for the spinlock when it doesn't fit, but given up
      each time.  Falling back to mm->page_table_lock (as we do when ptlock is
      not split) lets lockdep check out the strictest path anyway.
      
      And now that some arches allow 8192 cpus, use 999999 for infinity.
      
      (What has this got to do with KSM swapping?  It doesn't care about the
      size of struct page, but may care about random junk in page->mapping - to
      be explained separately later.)
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      078dd777
    • Hugh Dickins's avatar
      KSM swapping will know where page_referenced_one() and try_to_unmap_one() · 5c930dff
      Hugh Dickins authored
      should look.  It could hack page->index to get them to do what it wants,
      but it seems cleaner now to pass the address down to them.
      
      Make the same change to page_mkclean_one(), since it follows the same
      pattern; but there's no real need in its case.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5c930dff
    • Hugh Dickins's avatar
      Remove three degrees of obfuscation, left over from when we had · 71228787
      Hugh Dickins authored
      CONFIG_UNEVICTABLE_LRU.  MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
      CONFIG_HAVE_MLOCK is CONFIG_MMU.  rmap.o (and memory-failure.o) are only
      built when CONFIG_MMU, so don't need such conditions at all.
      
      Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
      169 defconfigs: leave those to evolve in due course.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      71228787
    • Hugh Dickins's avatar
      There's contorted mlock/munlock handling in try_to_unmap_anon() and · 2e6922d3
      Hugh Dickins authored
      try_to_unmap_file(), which we'd prefer not to repeat for KSM swapping. 
      Simplify it by moving it all down into try_to_unmap_one().
      
      One thing is then lost, try_to_munlock()'s distinction between when no vma
      holds the page mlocked, and when a vma does mlock it, but we could not get
      mmap_sem to set the page flag.  But its only caller takes no interest in
      that distinction (and is better testing SWAP_MLOCK anyway), so let's keep
      the code simple and return SWAP_AGAIN for both cases.
      
      try_to_unmap_file()'s TTU_MUNLOCK nonlinear handling was particularly
      amusing: once unravelled, it turns out to have been choosing between two
      different ways of doing the same nothing.  Ah, no, one way was actually
      returning SWAP_FAIL when it meant to return SWAP_SUCCESS.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2e6922d3
    • Hugh Dickins's avatar
      At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set · a7f80f7d
      Hugh Dickins authored
      in page->mapping, with the higher bits a pointer to the anon_vma; and have
      defined PageKsm(page) as that with NULL anon_vma.
      
      But KSM swapping will need to store a pointer there: so in preparation for
      that, now define PAGE_MAPPING_FLAGS as the low two bits, including
      PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
      other use for the bit emerges).
      
      Declare page_rmapping(page) to return the pointer part of page->mapping,
      and page_anon_vma(page) to return the anon_vma pointer when that's what it
      is.  Use these in a few appropriate places: notably, unuse_vma() has been
      testing page->mapping, but is better to be testing page_anon_vma() (cases
      may be added in which flag bits are set without any pointer).
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a7f80f7d
  17. 14 Nov, 2009 1 commit
  18. 12 Nov, 2009 2 commits
    • Mel Gorman's avatar
      After kswapd balances all zones in a pgdat, it goes to sleep. In the · 2d6e38ae
      Mel Gorman authored
      event of no IO congestion, kswapd can go to sleep very shortly after the
      high watermark was reached.  If there are a constant stream of allocations
      from parallel processes, it can mean that kswapd went to sleep too quickly
      and the high watermark is not being maintained for sufficient length time.
      
      This patch makes kswapd go to sleep as a two-stage process.  It first
      tries to sleep for HZ/10.  If it is woken up by another process or the
      high watermark is no longer met, it's considered a premature sleep and
      kswapd continues work.  Otherwise it goes fully to sleep.
      
      This adds more counters to distinguish between fast and slow breaches of
      watermarks.  A "fast" premature sleep is one where the low watermark was
      hit in a very short time after kswapd going to sleep.  A "slow" premature
      sleep indicates that the high watermark was breached after a very short
      interval.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Frans Pop <elendil@planet.nl>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2d6e38ae
    • Mel Gorman's avatar
      Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least · a70768e7
      Mel Gorman authored
      that the commits 373c0a7e 8aa7e847 dramatically increased the number of
      GFP_ATOMIC failures that were occuring within a wireless driver. 
      Reverting this patch seemed to help a lot even though it was pointed out
      that the congestion changes were very far away from high-order atomic
      allocations.
      
      The key to why the revert makes such a big difference is down to timing
      and how long direct reclaimers wait versus kswapd.  With the patch
      reverted, the congestion_wait() is on the SYNC queue instead of the ASYNC.
       As a significant part of the workload involved reads, it makes sense that
      the SYNC list is what was truely congested and with the revert processes
      were waiting on congestion as expected.  Hence, direct reclaimers stalled
      properly and kswapd was able to do its job with fewer stalls.
      
      This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
      for direct reclaimers.  Instead of making the congestion_wait() on the
      SYNC queue which would only fix a particular type of workload, this patch
      adds a third type of congestion_wait - BLK_RW_BOTH which first waits on
      the ASYNC and then the SYNC queue if the timeout has not been reached.  In
      tests, this counter-intuitively results in kswapd stalling less and
      freeing up pages resulting in fewer allocation failures and fewer
      direct-reclaim-orientated stalls.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Frans Pop <elendil@planet.nl>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a70768e7