1. 22 Mar, 2006 40 commits
    • Chen, Kenneth W's avatar
      [PATCH] optimize follow_hugetlb_page · d5d4b0aa
      Chen, Kenneth W authored
      follow_hugetlb_page() walks a range of user virtual address and then fills
      in list of struct page * into an array that is passed from the argument
      list.  It also gets a reference count via get_page().  For compound page,
      get_page() actually traverse back to head page via page_private() macro and
      then adds a reference count to the head page.  Since we are doing a virt to
      pte look up, kernel already has a struct page pointer into the head page.
      So instead of traverse into the small unit page struct and then follow a
      link back to the head page, optimize that with incrementing the reference
      count directly on the head page.
      
      The benefit is that we don't take a cache miss on accessing page struct for
      the corresponding user address and more importantly, not to pollute the
      cache with a "not very useful" round trip of pointer chasing.  This adds a
      moderate performance gain on an I/O intensive database transaction
      workload.
      Signed-off-by: default avatarKen Chen <kenneth.w.chen@intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d5d4b0aa
    • Chen, Kenneth W's avatar
      [PATCH] convert hugetlbfs_counter to atomic · bba1e9b2
      Chen, Kenneth W authored
      Implementation of hugetlbfs_counter() is functionally equivalent to
      atomic_inc_return().  Use the simpler atomic form.
      Signed-off-by: default avatarKen Chen <kenneth.w.chen@intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      bba1e9b2
    • David Gibson's avatar
      [PATCH] hugepage: is_aligned_hugepage_range() cleanup · 42b88bef
      David Gibson authored
      Quite a long time back, prepare_hugepage_range() replaced
      is_aligned_hugepage_range() as the callback from mm/mmap.c to arch code to
      verify if an address range is suitable for a hugepage mapping.
      is_aligned_hugepage_range() stuck around, but only to implement
      prepare_hugepage_range() on archs which didn't implement their own.
      
      Most archs (everything except ia64 and powerpc) used the same
      implementation of is_aligned_hugepage_range().  On powerpc, which
      implements its own prepare_hugepage_range(), the custom version was never
      used.
      
      In addition, "is_aligned_hugepage_range()" was a bad name, because it
      suggests it returns true iff the given range is a good hugepage range,
      whereas in fact it returns 0-or-error (so the sense is reversed).
      
      This patch cleans up by abolishing is_aligned_hugepage_range().  Instead
      prepare_hugepage_range() is defined directly.  Most archs use the default
      version, which simply checks the given region is aligned to the size of a
      hugepage.  ia64 and powerpc define custom versions.  The ia64 one simply
      checks that the range is in the correct address space region in addition to
      being suitably aligned.  The powerpc version (just as previously) checks
      for suitable addresses, and if necessary performs low-level MMU frobbing to
      set up new areas for use by hugepages.
      
      No libhugetlbfs testsuite regressions on ppc64 (POWER5 LPAR).
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarZhang Yanmin <yanmin.zhang@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      42b88bef
    • David Gibson's avatar
      [PATCH] hugepage: Move hugetlb_free_pgd_range() prototype to hugetlb.h · 3915bcf3
      David Gibson authored
      The optional hugepage callback, hugetlb_free_pgd_range() is presently
      implemented non-trivially only on ia64 (but I plan to add one for powerpc
      shortly).  It has its own prototype for the function in asm-ia64/pgtable.h.
       However, since the function is called from generic code, it make sense for
      its prototype to be in the generic hugetlb.h header file, as the protypes
      other arch callbacks already are (prepare_hugepage_range(),
      set_huge_pte_at(), etc.).  This patch makes it so.
      Signed-off-by: default avatarDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3915bcf3
    • David Gibson's avatar
      [PATCH] hugepage: Fix hugepage logic in free_pgtables() harder · 4866920b
      David Gibson authored
      Turns out the hugepage logic in free_pgtables() was doubly broken.  The
      loop coalescing multiple normal page VMAs into one call to free_pgd_range()
      had an off by one error, which could mean it would coalesce one hugepage
      VMA into the same bundle (checking 'vma' not 'next' in the loop).  I
      transferred this bug into the new is_vm_hugetlb_page() based version.
      Here's the fix.
      
      This one didn't bite on powerpc previously for the same reason the
      is_hugepage_only_range() problem didn't: powerpc's hugetlb_free_pgd_range()
      is identical to free_pgd_range().  It didn't bite on ia64 because the
      hugepage region is distant enough from any other region that the separated
      PMD_SIZE distance test would always prevent coalescing the two together.
      
      No libhugetlbfs testsuite regressions (ppc64, POWER5).
      Signed-off-by: default avatarDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4866920b
    • David Gibson's avatar
      [PATCH] hugepage: Fix hugepage logic in free_pgtables() · 9da61aef
      David Gibson authored
      free_pgtables() has special logic to call hugetlb_free_pgd_range() instead
      of the normal free_pgd_range() on hugepage VMAs.  However, the test it uses
      to do so is incorrect: it calls is_hugepage_only_range on a hugepage sized
      range at the start of the vma.  is_hugepage_only_range() will return true
      if the given range has any intersection with a hugepage address region, and
      in this case the given region need not be hugepage aligned.  So, for
      example, this test can return true if called on, say, a 4k VMA immediately
      preceding a (nicely aligned) hugepage VMA.
      
      At present we get away with this because the powerpc version of
      hugetlb_free_pgd_range() is just a call to free_pgd_range().  On ia64 (the
      only other arch with a non-trivial is_hugepage_only_range()) we get away
      with it for a different reason; the hugepage area is not contiguous with
      the rest of the user address space, and VMAs are not permitted in between,
      so the test can't return a false positive there.
      
      Nonetheless this should be fixed.  We do that in the patch below by
      replacing the is_hugepage_only_range() test with an explicit test of the
      VMA using is_vm_hugetlb_page().
      
      This in turn changes behaviour for platforms where is_hugepage_only_range()
      returns false always (everything except powerpc and ia64).  We address this
      by ensuring that hugetlb_free_pgd_range() is defined to be identical to
      free_pgd_range() (instead of a no-op) on everything except ia64.  Even so,
      it will prevent some otherwise possible coalescing of calls down to
      free_pgd_range().  Since this only happens for hugepage VMAs, removing this
      small optimization seems unlikely to cause any trouble.
      
      This patch causes no regressions on the libhugetlbfs testsuite - ppc64
      POWER5 (8-way), ppc64 G5 (2-way) and i386 Pentium M (UP).
      Signed-off-by: default avatarDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Acked-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9da61aef
    • David Gibson's avatar
      [PATCH] hugepage: Make {alloc,free}_huge_page() local · 27a85ef1
      David Gibson authored
      Originally, mm/hugetlb.c just handled the hugepage physical allocation path
      and its {alloc,free}_huge_page() functions were used from the arch specific
      hugepage code.  These days those functions are only used with mm/hugetlb.c
      itself.  Therefore, this patch makes them static and removes their
      prototypes from hugetlb.h.  This requires a small rearrangement of code in
      mm/hugetlb.c to avoid a forward declaration.
      
      This patch causes no regressions on the libhugetlbfs testsuite (ppc64,
      POWER5).
      Signed-off-by: default avatarDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      27a85ef1
    • David Gibson's avatar
      [PATCH] hugepage: Strict page reservation for hugepage inodes · b45b5bd6
      David Gibson authored
      These days, hugepages are demand-allocated at first fault time.  There's a
      somewhat dubious (and racy) heuristic when making a new mmap() to check if
      there are enough available hugepages to fully satisfy that mapping.
      
      A particularly obvious case where the heuristic breaks down is where a
      process maps its hugepages not as a single chunk, but as a bunch of
      individually mmap()ed (or shmat()ed) blocks without touching and
      instantiating the pages in between allocations.  In this case the size of
      each block is compared against the total number of available hugepages.
      It's thus easy for the process to become overcommitted, because each block
      mapping will succeed, although the total number of hugepages required by
      all blocks exceeds the number available.  In particular, this defeats such
      a program which will detect a mapping failure and adjust its hugepage usage
      downward accordingly.
      
      The patch below addresses this problem, by strictly reserving a number of
      physical hugepages for hugepage inodes which have been mapped, but not
      instatiated.  MAP_SHARED mappings are thus "safe" - they will fail on
      mmap(), not later with an OOM SIGKILL.  MAP_PRIVATE mappings can still
      trigger an OOM.  (Actually SHARED mappings can technically still OOM, but
      only if the sysadmin explicitly reduces the hugepage pool between mapping
      and instantiation)
      
      This patch appears to address the problem at hand - it allows DB2 to start
      correctly, for instance, which previously suffered the failure described
      above.
      
      This patch causes no regressions on the libhugetblfs testsuite, and makes a
      test (designed to catch this problem) pass which previously failed (ppc64,
      POWER5).
      Signed-off-by: default avatarDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b45b5bd6
    • David Gibson's avatar
      [PATCH] hugepage: serialize hugepage allocation and instantiation · 3935baa9
      David Gibson authored
      Currently, no lock or mutex is held between allocating a hugepage and
      inserting it into the pagetables / page cache.  When we do go to insert the
      page into pagetables or page cache, we recheck and may free the newly
      allocated hugepage.  However, since the number of hugepages in the system
      is strictly limited, and it's usualy to want to use all of them, this can
      still lead to spurious allocation failures.
      
      For example, suppose two processes are both mapping (MAP_SHARED) the same
      hugepage file, large enough to consume the entire available hugepage pool.
      If they race instantiating the last page in the mapping, they will both
      attempt to allocate the last available hugepage.  One will fail, of course,
      returning OOM from the fault and thus causing the process to be killed,
      despite the fact that the entire mapping can, in fact, be instantiated.
      
      The patch fixes this race by the simple method of adding a (sleeping) mutex
      to serialize the hugepage fault path between allocation and insertion into
      pagetables and/or page cache.  It would be possible to avoid the
      serialization by catching the allocation failures, waiting on some
      condition, then rechecking to see if someone else has instantiated the page
      for us.  Given the likely frequency of hugepage instantiations, it seems
      very doubtful it's worth the extra complexity.
      
      This patch causes no regression on the libhugetlbfs testsuite, and one
      test, which can trigger this race now passes where it previously failed.
      
      Actually, the test still sometimes fails, though less often and only as a
      shmat() failure, rather processes getting OOM killed by the VM.  The dodgy
      heuristic tests in fs/hugetlbfs/inode.c for whether there's enough hugepage
      space aren't protected by the new mutex, and would be ugly to do so, so
      there's still a race there.  Another patch to replace those tests with
      something saner for this reason as well as others coming...
      Signed-off-by: default avatarDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3935baa9
    • David Gibson's avatar
      [PATCH] hugepage: Small fixes to hugepage clear/copy path · 79ac6ba4
      David Gibson authored
      Move the loops used in mm/hugetlb.c to clear and copy hugepages to their
      own functions for clarity.  As we do so, we add some checks of need_resched
      - we are, after all copying megabytes of memory here.  We also add
      might_sleep() accordingly.  We generally dropped locks around the clear and
      copy, already but not everyone has PREEMPT enabled, so we should still be
      checking explicitly.
      
      For this to work, we need to remove the clear_huge_page() from
      alloc_huge_page(), which is called with the page_table_lock held in the COW
      path.  We move the clear_huge_page() to just after the alloc_huge_page() in
      the hugepage no-page path.  In the COW path, the new page is about to be
      copied over, so clearing it was just a waste of time anyway.  So as a side
      effect we also fix the fact that we held the page_table_lock for far too
      long in this path by calling alloc_huge_page() under it.
      
      It causes no regressions on the libhugetlbfs testsuite (ppc64, POWER5).
      Signed-off-by: default avatarDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      79ac6ba4
    • Zhang, Yanmin's avatar
      [PATCH] Enable mprotect on huge pages · 8f860591
      Zhang, Yanmin authored
      2.6.16-rc3 uses hugetlb on-demand paging, but it doesn_t support hugetlb
      mprotect.
      
      From: David Gibson <david@gibson.dropbear.id.au>
      
        Remove a test from the mprotect() path which checks that the mprotect()ed
        range on a hugepage VMA is hugepage aligned (yes, really, the sense of
        is_aligned_hugepage_range() is the opposite of what you'd guess :-/).
      
        In fact, we don't need this test.  If the given addresses match the
        beginning/end of a hugepage VMA they must already be suitably aligned.  If
        they don't, then mprotect_fixup() will attempt to split the VMA.  The very
        first test in split_vma() will check for a badly aligned address on a
        hugepage VMA and return -EINVAL if necessary.
      
      From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      
        On i386 and x86-64, pte flag _PAGE_PSE collides with _PAGE_PROTNONE.  The
        identify of hugetlb pte is lost when changing page protection via mprotect.
        A page fault occurs later will trigger a bug check in huge_pte_alloc().
      
        The fix is to always make new pte a hugetlb pte and also to clean up
        legacy code where _PAGE_PRESENT is forced on in the pre-faulting day.
      Signed-off-by: default avatarZhang Yanmin <yanmin.zhang@intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarKen Chen <kenneth.w.chen@intel.com>
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8f860591
    • Steven Pratt's avatar
      [PATCH] readahead: fix initial window size calculation · aed75ff3
      Steven Pratt authored
      The current current get_init_ra_size is not optimal across different IO
      sizes and max_readahead values.  Here is a quick summary of sizes computed
      under current design and under the attached patch.  All of these assume 1st
      IO at offset 0, or 1st detected sequential IO.
      
      	32k max, 4k request
      
      	old         new
      	-----------------
      	 8k        8k
      	16k       16k
      	32k       32k
      
      	128k max, 4k request
      	old         new
      	-----------------
      	32k         16k
      	64k         32k
      	128k        64k
      	128k       128k
      
      	128k max, 32k request
      	old         new
      	-----------------
      	32k         64k    <-----
      	64k        128k
      	128k       128k
      
      	512k max, 4k request
      	old         new
      	-----------------
      	4k         32k     <----
      	16k        64k
      	64k       128k
      	128k      256k
      	512k      512k
      
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      aed75ff3
    • Oleg Nesterov's avatar
      [PATCH] readahead: ->prev_page can overrun the ahead window · a564da39
      Oleg Nesterov authored
      If get_next_ra_size() does not grow fast enough, ->prev_page can overrun
      the ahead window.  This means the caller will read the pages from
      ->ahead_start + ->ahead_size to ->prev_page synchronously.
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a564da39
    • Hugh Dickins's avatar
      [PATCH] shmem: inline to avoid warning · d15c023b
      Hugh Dickins authored
      shmem.c was named and shamed in Jesper's "Building 100 kernels" warnings:
      shmem_parse_mpol is only used when CONFIG_TMPFS parses mount options; and
      only called from that one site, so mark it inline like its non-NUMA stub.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d15c023b
    • Christoph Lameter's avatar
      [PATCH] vmscan: emove obsolete checks from shrink_list() and fix unlikely in refill_inactive_zone() · 6e5ef1a9
      Christoph Lameter authored
      As suggested by Marcelo:
      
      1. The optimization introduced recently for not calling
         page_referenced() during zone reclaim makes two additional checks in
         shrink_list unnecessary.
      
      2. The if (unlikely(sc->may_swap)) in refill_inactive_zone is optimized
         for the zone_reclaim case.  However, most peoples system only does swap.
         Undo that.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6e5ef1a9
    • Michael Buesch's avatar
      [PATCH] Uninline sys_mmap common code (reduce binary size) · a7290ee0
      Michael Buesch authored
      Remove the inlining of the new vs old mmap system call common code.  This
      reduces the size of the resulting vmlinux for defconfig as follows:
      
      mb@pc1:~/develop/git/linux-2.6$ size vmlinux.mmap*
         text    data     bss     dec     hex filename
      3303749  521524  186564 4011837  3d373d vmlinux.mmapinline
      3303557  521524  186564 4011645  3d367d vmlinux.mmapnoinline
      
      The new sys_mmap2() has also one function call overhead removed, now.
      (probably it was already optimized to a jmp before, but anyway...)
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a7290ee0
    • Nick Piggin's avatar
      [PATCH] mm: optimise page_count · 617d2214
      Nick Piggin authored
      Optimise page_count compound page test and make it consistent with similar
      functions.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      617d2214
    • Nick Piggin's avatar
      [PATCH] mm: more CONFIG_DEBUG_VM · b7ab795b
      Nick Piggin authored
      Put a few more checks under CONFIG_DEBUG_VM
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b7ab795b
    • Andrew Morton's avatar
      [PATCH] mm: prep_zero_page() in irq is a bug · 6626c5d5
      Andrew Morton authored
      prep_zero_page() uses KM_USER0 and hence may not be used from IRQ context, at
      least for highmem pages.
      
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <christoph@lameter.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6626c5d5
    • Nick Piggin's avatar
      [PATCH] mm: cleanup prep_ stuff · 17cf4406
      Nick Piggin authored
      Move the prep_ stuff into prep_new_page.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      17cf4406
    • Nick Piggin's avatar
      [PATCH] remove set_page_count() outside mm/ · 7835e98b
      Nick Piggin authored
      set_page_count usage outside mm/ is limited to setting the refcount to 1.
      Remove set_page_count from outside mm/, and replace those users with
      init_page_count() and set_page_refcounted().
      
      This allows more debug checking, and tighter control on how code is allowed
      to play around with page->_count.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7835e98b
    • Nick Piggin's avatar
      [PATCH] remove set_page_count(page, 0) users (outside mm) · 70dc991d
      Nick Piggin authored
      A couple of places set_page_count(page, 1) that don't need to.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      70dc991d
    • Nick Piggin's avatar
      [PATCH] mm: nommu use compound pages · 84097518
      Nick Piggin authored
      Now that compound page handling is properly fixed in the VM, move nommu
      over to using compound pages rather than rolling their own refcounting.
      
      nommu vm page refcounting is broken anyway, but there is no need to have
      divergent code in the core VM now, nor when it gets fixed.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: David Howells <dhowells@redhat.com>
      
      (Needs testing, please).
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      84097518
    • Nick Piggin's avatar
      [PATCH] mm: make __put_page internal · 0f8053a5
      Nick Piggin authored
      Remove __put_page from outside the core mm/.  It is dangerous because it does
      not handle compound pages nicely, and misses 1->0 transitions.  If a user
      later appears that really needs the extra speed we can reevaluate.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0f8053a5
    • Nick Piggin's avatar
      [PATCH] x86_64: pageattr remove __put_page · 4fa4f53b
      Nick Piggin authored
      Remove page_count and __put_page from x86-64 pageattr
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Acked-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4fa4f53b
    • Nick Piggin's avatar
      [PATCH] x86_64: pageattr use single list · 20aaffd6
      Nick Piggin authored
      Use page->lru.next to implement the singly linked list of pages rather than
      the struct deferred_page which needs to be allocated and freed for each
      page.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Acked-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      20aaffd6
    • Nick Piggin's avatar
      [PATCH] i386: pageattr remove __put_page · 84d1c054
      Nick Piggin authored
      Stop using __put_page and page_count in i386 pageattr.c
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      84d1c054
    • Nick Piggin's avatar
      [PATCH] sg: use compound pages · f9aed0e2
      Nick Piggin authored
      sg increments the refcount of constituent pages in its higher order memory
      allocations when they are about to be mapped by userspace.  This is done so
      the subsequent get_page/put_page when doing the mapping and unmapping does not
      free the page.
      
      Move over to the preferred way, that is, using compound pages instead.  This
      fixes a whole class of possible obscure bugs where a get_user_pages on a
      constituent page may outlast the user mappings or even the driver.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Douglas Gilbert <dougg@torque.net>
      Cc: James Bottomley <James.Bottomley@steeleye.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f9aed0e2
    • Hugh Dickins's avatar
      [PATCH] remove VM_DONTCOPY bogosities · a6f563db
      Hugh Dickins authored
      Now that it's madvisable, remove two pieces of VM_DONTCOPY bogosity:
      
      1. There was and is no logical reason why VM_DONTCOPY should be in the
         list of flags which forbid vma merging (and those drivers which set
         it are also setting VM_IO, which itself forbids the merge).
      
      2. It's hard to understand the purpose of the VM_HUGETLB, VM_DONTCOPY
         block in vm_stat_account: but never mind, it's under CONFIG_HUGETLB,
         which (unlike CONFIG_HUGETLB_PAGE or CONFIG_HUGETLBFS) has never been
         defined.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a6f563db
    • Wu Fengguang's avatar
      [PATCH] mm: shrink_inactive_lis() nr_scan accounting fix · fb8d14e1
      Wu Fengguang authored
      In shrink_inactive_list(), nr_scan is not accounted when nr_taken is 0.
      But 0 pages taken does not mean 0 pages scanned.
      
      Move the goto statement below the accounting code to fix it.
      Signed-off-by: default avatarWu Fengguang <wfg@mail.ustc.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fb8d14e1
    • Wu Fengguang's avatar
      [PATCH] mm: isolate_lru_pages() scan count fix · c9b02d97
      Wu Fengguang authored
      In isolate_lru_pages(), *scanned reports one more scan because the scan
      counter is increased one more time on exit of the while-loop.
      
      Change the while-loop to for-loop to fix it.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarWu Fengguang <wfg@mail.ustc.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c9b02d97
    • Christoph Lameter's avatar
      [PATCH] zone_reclaim: additional comments and cleanup · 7fb2d46d
      Christoph Lameter authored
      Add some comments to explain how zone reclaim works.  And it fixes the
      following issues:
      
      - PF_SWAPWRITE needs to be set for RECLAIM_SWAP to be able to write
        out pages to swap. Currently RECLAIM_SWAP may not do that.
      
      - remove setting nr_reclaimed pages after slab reclaim since the slab shrinking
        code does not use that and the nr_reclaimed pages is just right for the
        intended follow up action.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7fb2d46d
    • Andrew Morton's avatar
      [PATCH] vmscan: rename functions · 1742f19f
      Andrew Morton authored
      We have:
      
      	try_to_free_pages
      	->shrink_caches(struct zone **zones, ..)
      	  ->shrink_zone(struct zone *, ...)
      	    ->shrink_cache(struct zone *, ...)
      	      ->shrink_list(struct list_head *, ...)
      	    ->refill_inactive_list((struct zone *, ...)
      
      which is fairly irrational.
      
      Rename things so that we have
      
       	try_to_free_pages
       	->shrink_zones(struct zone **zones, ..)
       	  ->shrink_zone(struct zone *, ...)
       	    ->shrink_inactive_list(struct zone *, ...)
       	      ->shrink_page_list(struct list_head *, ...)
      	    ->shrink_active_list(struct zone *, ...)
      
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <christoph@lameter.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      1742f19f
    • Andrew Morton's avatar
      [PATCH] vmscan return nr_reclaimed · 05ff5137
      Andrew Morton authored
      Change all the vmscan functions to retunr the number-of-reclaimed pages and
      remove scan_conrtol.nr_reclaimed.
      
      Saves ten-odd bytes of text and makes things clearer and more consistent.
      
      The patch also changes the behaviour of zone_reclaim() when it falls back to slab shrinking.  Christoph says
      
        "Setting this to one means that we will rescan and shrink the slab for
        each allocation if we are out of zone memory and RECLAIM_SLAB is set.  Plus
        if we do an order 0 allocation we do not go off node as intended.
      
        "We better set this to zero.  This means the allocation will go offnode
        despite us having potentially freed lots of memory on the zone.  Future
        allocations can then again be done from this zone."
      
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <christoph@lameter.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      05ff5137
    • Andrew Morton's avatar
      [PATCH] vmscan: use unsigned longs · 69e05944
      Andrew Morton authored
      Turn basically everything in vmscan.c into `unsigned long'.  This is to avoid
      the possibility that some piece of code in there might decide to operate upon
      more than 4G (or even 2G) of pages in one hit.
      
      This might be silly, but we'll need it one day.
      
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      69e05944
    • Andrew Morton's avatar
      [PATCH] vmscan: scan_control cleanup · 179e9639
      Andrew Morton authored
      Initialise as much of scan_control as possible at the declaration site.  This
      tidies things up a bit and assures us that all unmentioned fields are zeroed
      out.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      179e9639
    • Christoph Lameter's avatar
      [PATCH] Thin out scan_control: remove nr_to_scan and priority · 8695949a
      Christoph Lameter authored
      Make nr_to_scan and priority a parameter instead of putting it into scan
      control.  This allows various small optimizations and IMHO makes the code
      easier to read.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8695949a
    • Andrew Morton's avatar
      [PATCH] slab: use on_each_cpu() · a07fa394
      Andrew Morton authored
      Slab duplicates on_each_cpu().
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a07fa394
    • Andrew Morton's avatar
      [PATCH] on_each_cpu(): disable local interrupts · 78eef01b
      Andrew Morton authored
      When on_each_cpu() runs the callback on other CPUs, it runs with local
      interrupts disabled.  So we should run the function with local interrupts
      disabled on this CPU, too.
      
      And do the same for UP, so the callback is run in the same environment on both
      UP and SMP.  (strictly it should do preempt_disable() too, but I think
      local_irq_disable is sufficiently equivalent).
      
      Also uninlines on_each_cpu().  softirq.c was the most appropriate file I could
      find, but it doesn't seem to justify creating a new file.
      
      Oh, and fix up that comment over (under?) x86's smp_call_function().  It
      drives me nuts.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      78eef01b
    • Christoph Lameter's avatar
      [PATCH] slab: Remove SLAB_NO_REAP option · ac2b898c
      Christoph Lameter authored
      SLAB_NO_REAP is documented as an option that will cause this slab not to be
      reaped under memory pressure.  However, that is not what happens.  The only
      thing that SLAB_NO_REAP controls at the moment is the reclaim of the unused
      slab elements that were allocated in batch in cache_reap().  Cache_reap()
      is run every few seconds independently of memory pressure.
      
      Could we remove the whole thing?  Its only used by three slabs anyways and
      I cannot find a reason for having this option.
      
      There is an additional problem with SLAB_NO_REAP.  If set then the recovery
      of objects from alien caches is switched off.  Objects not freed on the
      same node where they were initially allocated will only be reused if a
      certain amount of objects accumulates from one alien node (not very likely)
      or if the cache is explicitly shrunk.  (Strangely __cache_shrink does not
      check for SLAB_NO_REAP)
      
      Getting rid of SLAB_NO_REAP fixes the problems with alien cache freeing.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ac2b898c