1. 24 Aug, 2009 2 commits
  2. 11 Aug, 2009 2 commits
  3. 12 Aug, 2009 2 commits
  4. 18 Aug, 2009 1 commit
  5. 24 Aug, 2009 1 commit
  6. 12 Aug, 2009 1 commit
    • Daisuke Nishimura's avatar
      After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"), · f80362c0
      Daisuke Nishimura authored
      read_swap_cache_async() will busy-wait while a entry doesn't exist in swap
      cache but it has SWAP_HAS_CACHE flag.
      
      Such entries can exist on add/delete path of swap cache.  On add path,
      add_to_swap_cache() is called soon after SWAP_HAS_CACHE flag is set, and
      on delete path, swapcache_free() will be called (SWAP_HAS_CACHE flag is
      cleared) soon after __delete_from_swap_cache() is called.  So, the
      busy-wait works well in most cases.
      
      But this mechanism can cause soft lockup if add_to_swap_cache() sleeps and
      read_swap_cache_async() tries to swap-in the same entry on the same cpu.
      
      This patch calls radix_tree_preload() before swapcache_prepare() and
      divides add_to_swap_cache() into two part: radix_tree_preload() part and
      radix_tree_insert() part(define it as __add_to_swap_cache()).
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f80362c0
  7. 11 Aug, 2009 3 commits
    • Mel Gorman's avatar
      Knowing tracepoints exist is not quite the same as knowing what they · 0aba8dc8
      Mel Gorman authored
      should be used for.  This patch adds a document giving a basic description
      of the kmem tracepoints and why they might be useful to a performance
      analyst.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0aba8dc8
    • Mel Gorman's avatar
      The documentation for ftrace, events and tracepoints is pretty extensive. · 30e9dec2
      Mel Gorman authored
      Similarly, the perf PCL tools help files --help are there and the code
      simple enough to figure out what much of the switches mean.  However,
      pulling the discrete bits and pieces together and translating that into
      "how do I solve a problem" requires a fair amount of imagination.
      
      This patch adds a simple document intended to get someone started on the
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      30e9dec2
    • Mel Gorman's avatar
      This patch adds a simple post-processing script for the · 2e607c90
      Mel Gorman authored
      page-allocator-related trace events.  It can be used to give an indication
      of who the most allocator-intensive processes are and how often the zone
      lock was taken during the tracing period.  Example output looks like
      
      Process                   Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
      details                  allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
                                      under lock     direct  pagevec      drain
      swapper-0                     0          0          2        0          0        0        0          0        0        0        0        0        0
      Xorg-3770                 10603       5952       3685     6978       5996      194      192          0        0        0        0        0        0
      modprobe-21397               51          0          0       86         31        1        0          0        0        0        0        0        0
      xchat-5370                  228         93          0        0          0        0        3          0        0        0        0        0        0
      awesome-4317                 32         32          0        0          0        0       32          0        0        0        0        0        0
      thinkfan-3863                 2          0          1        1          0        0        0          0        0        0        0        0        0
      hald-addon-stor-3935          2          0          0        0          0        0        0          0        0        0        0        0        0
      akregator-4506                1          1          0        0          0        0        1          0        0        0        0        0        0
      xmms-14888                    0          0          1        0          0        0        0          0        0        0        0        0        0
      khelper-12                    1          0          0        0          0        0        0          0        0        0        0        0        0
      
      Optionally, the output can include information on the parent or aggregate
      based on process name instead of aggregating based on each pid. Example output
      including parent information and stripped out the PID looks something like;
      
      Process                        Pages      Pages      Pages    Pages       PCPU     PCPU     PCPU   Fragment Fragment  MigType Fragment Fragment  Unknown
      details                       allocd     allocd      freed    freed      pages   drains  refills   Fallback  Causing  Changed   Severe Moderate
                                           under lock     direct  pagevec      drain
      gdm-3756 :: Xorg-3770           3796       2976         99     3813       3224      104       98          0        0        0        0        0        0
      init-1 :: hald-3892                1          0          0        0          0        0        0          0        0        0        0        0        0
      git-21447 :: editor-21448          4          0          4        0          0        0        0          0        0        0        0        0        0
      
      This says that Xorg allocated 3796 pages and it's parent process is gdm
      with a PID of 3756;
      
      The postprocessor parses the text output of tracing.  While there is a
      binary format, the expectation is that the binary output can be readily
      translated into text and post-processed offline.  Obviously if the text
      format changes, the parser will break but the regular expression parser is
      fairly rudimentary so should be readily adjustable.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2e607c90
  8. 13 Aug, 2009 1 commit
    • Andrew Morton's avatar
      mm/page_alloc.c: In function 'free_pages_bulk': · 2d04241d
      Andrew Morton authored
      mm/page_alloc.c:549: error: implicit declaration of function 'trace_mm_page_pcpu_drain'
      mm/page_alloc.c: In function '__rmqueue_fallback':
      mm/page_alloc.c:879: error: implicit declaration of function 'trace_mm_page_alloc_extfrag'
      mm/page_alloc.c: In function '__rmqueue':
      mm/page_alloc.c:915: error: implicit declaration of function 'trace_mm_page_alloc_zone_locked'
      mm/page_alloc.c: In function 'free_hot_page':
      mm/page_alloc.c:1106: error: implicit declaration of function 'trace_mm_page_free_direct'
      mm/page_alloc.c: In function '__alloc_pages_nodemask':
      mm/page_alloc.c:1951: error: implicit declaration of function 'trace_mm_page_alloc'
      mm/page_alloc.c: In function '__pagevec_free':
      mm/page_alloc.c:1987: error: implicit declaration of function 'trace_mm_pagevec_free'
      
      
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2d04241d
  9. 11 Aug, 2009 3 commits
    • Mel Gorman's avatar
      The page allocation trace event reports that a page was successfully · de4c81a5
      Mel Gorman authored
      allocated but it does not specify where it came from.  When analysing
      performance, it can be important to distinguish between pages coming from
      the per-cpu allocator and pages coming from the buddy lists as the latter
      requires the zone lock to the taken and more data structures to be
      examined.
      
      This patch adds a trace event for __rmqueue reporting when a page is being
      allocated from the buddy lists.  It distinguishes between being called to
      refill the per-cpu lists or whether it is a high-order allocation. 
      Similarly, this patch adds an event to catch when the PCP lists are being
      drained a little and pages are going back to the buddy lists.
      
      This is trickier to draw conclusions from but high activity on those
      events could explain why there were a large number of cache misses on a
      page-allocator-intensive workload.  The coalescing and splitting of
      buddies involves a lot of writing of page metadata and cache line bounces
      not to mention the acquisition of an interrupt-safe lock necessary to
      enter this path.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de4c81a5
    • Mel Gorman's avatar
      Fragmentation avoidance depends on being able to use free pages from lists · 3a561180
      Mel Gorman authored
      of the appropriate migrate type.  In the event this is not possible,
      __rmqueue_fallback() selects a different list and in some circumstances
      change the migratetype of the pageblock.  Simplistically, the more times
      this event occurs, the more likely that fragmentation will be a problem
      later for hugepage allocation at least but there are other considerations
      such as the order of page being split to satisfy the allocation.
      
      This patch adds a trace event for __rmqueue_fallback() that reports what
      page is being used for the fallback, the orders of relevant pages, the
      desired migratetype and the migratetype of the lists being used, whether
      the pageblock changed type and whether this event is important with
      respect to fragmentation avoidance or not.  This information can be used
      to help analyse fragmentation avoidance and help decide whether
      min_free_kbytes should be increased or not.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3a561180
    • Mel Gorman's avatar
      This patch adds trace events for the allocation and freeing of pages, · ec2ebbd0
      Mel Gorman authored
      including the freeing of pagevecs.  Using the events, it will be known
      what struct page and pfns are being allocated and freed and what the call
      site was in many cases.
      
      The page alloc tracepoints be used as an indicator as to whether the
      workload was heavily dependant on the page allocator or not.  You can make
      a guess based on vmstat but you can't get a per-process breakdown. 
      Depending on the call path, the call_site for page allocation may be
      __get_free_pages() instead of a useful callsite.  Instead of passing down
      a return address similar to slab debugging, the user should enable the
      stacktrace and seg-addr options to get a proper stack trace.
      
      The pagevec free tracepoint has a different usecase.  It can be used to
      get a idea of how many pages are being dumped off the LRU and whether it
      is kswapd doing the work or a process doing direct reclaim.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ec2ebbd0
  10. 24 Aug, 2009 1 commit
  11. 04 Aug, 2009 1 commit
    • Andrew Morton's avatar
      ERROR: code indent should use tabs where possible · b2e67f8a
      Andrew Morton authored
      #219: FILE: arch/s390/mm/init.c:108:
      +                nr_free_pages() << (PAGE_SHIFT-10),$
      
      total: 1 errors, 0 warnings, 162 lines checked
      
      ./patches/arches-drop-superfluous-casts-in-nr_free_pages-callers.patch has style problems, please review.  If any of these errors
      are false positives report them to the maintainer, see
      CHECKPATCH in MAINTAINERS.
      
      Please run checkpatch prior to sending patches
      
      Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b2e67f8a
  12. 24 Aug, 2009 1 commit
  13. 04 Aug, 2009 1 commit
  14. 24 Aug, 2009 2 commits
  15. 09 Sep, 2009 1 commit
  16. 24 Aug, 2009 1 commit
  17. 10 Sep, 2009 1 commit
    • Mel Gorman's avatar
      Calculate the number of pageblocks within a range properly · 4553616e
      Mel Gorman authored
      Patch
      page-allocator-change-migratetype-for-all-pageblocks-within-a-high-order-page-during-__rmqueue_fallback
      is meant to change the pageblock ownership of each pageblock within a
      given range. This is necessary when the buddy to be split is of higher
      order than the pageblock_order. However, the calculation was wrong
      leading to crashes on ia-64 and slightly incorrect behaviour on x86.
      This patch corrects the calculation.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4553616e
  18. 24 Aug, 2009 4 commits
  19. 10 Sep, 2009 3 commits
  20. 24 Aug, 2009 1 commit
  21. 04 Aug, 2009 2 commits
  22. 03 Sep, 2009 1 commit
    • Andrea Arcangeli's avatar
      Rawhide users have reported hang at startup when cryptsetup is run: the · 93c93e98
      Andrea Arcangeli authored
      same problem can be simply reproduced by running a program int main() {
      mlockall(MCL_CURRENT | MCL_FUTURE); return 0; }
      
      The problem is that exit_mmap() applies munlock_vma_pages_all() to
      clean up VM_LOCKED areas, and its current implementation (stupidly)
      tries to fault in absent pages, for example where PROT_NONE prevented
      them being faulted in when mlocking.  Whereas the "ksm: fix oom
      deadlock" patch, knowing there's a race by which KSM might try to fault
      in pages after exit_mmap() had finally zapped the range, backs out of
      such faults doing nothing when its ksm_test_exit() notices mm_users 0.
      
      So revert that part of "ksm: fix oom deadlock" which moved the
      ksm_exit() call from before exit_mmap() to the middle of exit_mmap();
      and remove those ksm_test_exit() checks from the page fault paths, so
      allowing the munlocking to proceed without interference.
      
      ksm_exit, if there are rmap_items still chained on this mm slot, takes
      mmap_sem write side: so preventing KSM from working on an mm while
      exit_mmap runs.  And KSM will bail out as soon as it notices that
      mm_users is already zero, thanks to its internal ksm_test_exit checks. 
      So that when a task is killed by OOM killer or the user, KSM will not
      indefinitely prevent it from running exit_mmap to release its memory.
      
      This does break a part of what "ksm: fix oom deadlock" was trying to
      achieve.  When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even
      when ksmd itself has to cancel a KSM page, it is possible that the
      first OOM-kill victim would be the KSM process being faulted: then its
      memory won't be freed until a second victim has been selected (freeing
      memory for the unmerging fault to complete).
      
      But the OOM killer is already liable to kill a second victim once the
      intended victim's p->mm goes to NULL: so there's not much point in
      rejecting this KSM patch before fixing that OOM behaviour.  It is very
      much more important to allow KSM users to boot up, than to haggle over
      an unlikely and poorly supported OOM case.
      
      We also intend to fix munlocking to not fault pages: at which point
      this patch _could_ be reverted; though that would be controversial, so
      we hope to find a better solution.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarJustin M. Forbes <jforbes@redhat.com>
      Acked-for-now-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      93c93e98
  23. 24 Aug, 2009 1 commit
    • Hugh Dickins's avatar
      There's a now-obvious deadlock in KSM's out-of-memory handling: · cb80fbbd
      Hugh Dickins authored
      imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
      trying to allocate a page to break KSM in an mm which becomes the
      OOM victim (quite likely in the unmerge case): it's killed and goes
      to exit, and hangs there waiting to acquire ksm_thread_mutex.
      
      Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
      though that made everything else: perhaps use mmap_sem somehow?
      And part of the answer lies in the comments on unmerge_ksm_pages:
      __ksm_exit should also leave all the rmap_item removal to ksmd.
      
      But there's a fundamental problem, that KSM relies upon mmap_sem to
      guarantee the consistency of the mm it's dealing with, yet exit_mmap
      tears down an mm without taking mmap_sem.  And bumping mm_users won't
      help at all, that just ensures that the pages the OOM killer assumes
      are on their way to being freed will not be freed.
      
      The best answer seems to be, to move the ksm_exit callout from just
      before exit_mmap, to the middle of exit_mmap: after the mm's pages
      have been freed (if the mmu_gather is flushed), but before its page
      tables and vma structures have been freed; and down_write,up_write
      mmap_sem there to serialize with KSM's own reliance on mmap_sem.
      
      But KSM then needs to be careful, whenever it downs mmap_sem, to
      check that the mm is not already exiting: there's a danger of using
      find_vma on a layout that's being torn apart, or writing into page
      tables which have been freed for reuse; and even do_anonymous_page
      and __do_fault need to check they're not being called by break_ksm
      to reinstate a pte after zap_pte_range has zapped that page table.
      
      Though it might be clearer to add an exiting flag, set while holding
      mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
      a zapped pte.  All we need is to check whether mm_users is 0 - but
      must remember that ksmd may detect that before __ksm_exit is reached.
      So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.
      
      __ksm_exit now has to leave clearing up the rmap_items to ksmd,
      that needs ksm_thread_mutex; but shift the exiting mm just after the
      ksm_scan cursor so that it will soon be dealt with.  __ksm_enter raise
      mm_count to hold the mm_struct, ksmd's exit processing (exactly like
      its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
      similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).
      
      But also give __ksm_exit a fast path: when there's no complication
      (no rmap_items attached to mm and it's not at the ksm_scan cursor),
      it can safely do all the exiting work itself.  This is not just an
      optimization: when ksmd is not running, the raised mm_count would
      otherwise leak mm_structs.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cb80fbbd
  24. 04 Aug, 2009 3 commits
    • Hugh Dickins's avatar
      Do some housekeeping in ksm.c, to help make the next patch easier · 45c6ffbc
      Hugh Dickins authored
      to understand: remove the function remove_mm_from_lists, distributing
      its code to its callsites scan_get_next_rmap_item and __ksm_exit.
      
      That turns out to be a win in scan_get_next_rmap_item: move its
      remove_trailing_rmap_items and cursor advancement up, and it becomes
      simpler than before.  __ksm_exit becomes messier, but will change
      again; and moving its remove_trailing_rmap_items up lets us strengthen
      the unstable tree item's age condition in remove_rmap_item_from_tree.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      45c6ffbc
    • Hugh Dickins's avatar
      break_ksm has been looping endlessly ignoring VM_FAULT_OOM: that should · e3170486
      Hugh Dickins authored
      only be a problem for ksmd when a memory control group imposes limits
      (normally the OOM killer will kill others with an mm until it succeeds);
      but in general (especially for MADV_UNMERGEABLE and KSM_RUN_UNMERGE) we
      do need to route the error (or kill) back to the caller (or sighandling).
      
      Test signal_pending in unmerge_ksm_pages, which could be a lengthy
      procedure if it has to spill into swap: returning -ERESTARTSYS so that
      trivial signals will restart but fatals will terminate (is that right?
      we do different things in different places in mm, none exactly this).
      
      unmerge_and_remove_all_rmap_items was forgetting to lock when going
      down the mm_list: fix that.  Whether it's successful or not, reset
      ksm_scan cursor to head; but only if it's successful, reset seqnr
      (shown in full_scans) - page counts will have gone down to zero.
      
      This patch leaves a significant OOM deadlock, but it's a good step
      on the way, and that deadlock is fixed in a subsequent patch.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e3170486
    • Hugh Dickins's avatar
      1. We don't use __break_cow entry point now: merge it into break_cow. · b2f860c2
      Hugh Dickins authored
      2. remove_all_slot_rmap_items is just a special case of
         remove_trailing_rmap_items: use the latter instead.
      3. Extend comment on unmerge_ksm_pages and rmap_items.
      4. try_to_merge_two_pages should use try_to_merge_with_ksm_page
         instead of duplicating its code; and so swap them around.
      5. Comment on cmp_and_merge_page described last year's: update it.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b2f860c2